A Toolbox for Predicting G-Quadruplex Formation and Stability

G-quadruplexes are four stranded nucleic acid structures formed around a core of guanines, arranged in squares with mutual hydrogen bonding. Many of these structures are highly thermally stable, especially in the presence of monovalent cations, such as those found under physiological conditions. Understanding of their physiological roles is expanding rapidly, and they have been implicated in regulating gene transcription and translation among other functions. We have built a community-focused website to act as a repository for the information that is now being developed. At its core, this site has a detailed database (QuadDB) of predicted G-quadruplexes in the human and other genomes, together with the predictive algorithm used to identify them. We also provide a QuadPredict server, which predicts thermal stability and acts as a repository for experimental data from all researchers. There are also a number of other data sources with computational predictions. We anticipate that the wide availability of this information will be of use both to researchers already active in this exciting field and to those who wish to investigate a particular gene hypothesis.


Introduction
It was observed in 1910 [1] that a sufficiently high concentration of guanosine could form a gel, unlike the other nucleobases, and in 1962 [2] it was discovered that four guanosine can self-assemble to form a hydrogen-bonded square, with bonds between the N 1 -O 6 and N 2 -N 7 positions. This structure is known as a G-tetrad or G-quartet. Like any nucleobase, there is also a strong propensity for these structures to stack on each other via π-π interactions, forming four-stranded helices called G-quadruplexes, with the phosphate backbone perpendicular to the plane of the Gquartets. The four strands may be from separate molecules, or they may be from only 2 or 1, with loops joining them together [3][4][5][6][7].
They form with great thermal stability, [8] and have been found experimentally to form from genomic sequences in critical regions such as telomeres, gene promoters and UTRs, [9,10] and to have physiological effects in each of these regions. In telomeres, their formation reduces the activity of telomerase, the upregulation of which has been associated with 85% of cancers, and has led to much pharmaceutical interest [11]. G-quadruplexes in gene promoters, such as the oncogenes c-myc and c-kit, [12,13] have been shown to control transcriptional activity in vitro, although interestingly their formation can lead to the increase or decrease of activity in different systems. It has been shown that G-quadruplex formation in the 5 UTR can decrease translational activity, [14] and there have been suggestions of other physiological effects. A wide variety of proteins have been found to interact specifically with them, [15] and they have been shown experimentally to form in vivo [16][17][18].
In parallel with the experimental work being developed, computational techniques have also been developed to predict which sequences will form G-quadruplexes [22][23][24]. There are a variety of different algorithmic rules that can be used to predict which sequences can form G-quadruplexes, [25,26] although some are more widely used and accepted. There is not sufficient evidence for any of them to be held as absolutely true, and it is only recently that any work has been done to try to predict relative stabilities of possible Gquadruplex structures, rather than just whether they could form or not. Despite this limitation, computational methods have led to a number of discoveries, including the observations that G-quadruplexes are relatively rare in the human genome, but more prevalent than expected in gene promoters [27]. Some of the computational discoveries have been recently reviewed [25,26].
The field as a whole has grown very significantly in recent times, with a roughly exponential rise in publications (see Figure 1), including over 350 in 2009. A dedicated book has been produced, [28] together with special issues of some journals focused on this topic, and some databases on particular aspects of G-quadruplexes. A few G-quadruplex based drugs have also entered clinical trials. A series of International Conferences has been initiated, the first two hosted in Louisville, KY [29,30]. At the first of these, it was suggested that a central and coherent website to store and provide data related to G-quadruplexes should be produced, and we volunteered to provide such a repository [29], hosted at the URL http://www.quadruplex.org/.
Here, we describe the features available at that website, and in particular the core databases to describe predicted Gquadruplexes, and a new tool to estimate the thermal stability of these structures computationally. We also describe the other online sources of predictive data for G-quadruplexes, so that researchers may chose the most appropriate tool for their work.

QuadDB-A Database of G-Quadruplex Predictions
The core quadruplex database (QuadDB, http://www .quadruplex.org/?view=quadbase) provides both static and searchable data for researchers on computationally predicted G-quadruplexes (Putative Quadruplex Sequences, PQS). These have been generated as previously described [22], using our favoured predictive algorithm, which identifies sequences on either strand of the form (G 3+ N 1-7 G 3+ N 1-7 G 3+ N 1-7 G 3+ ). This has been shown experimentally to be a good predictor of in vitro Gquadruplex formation [31]. It aims to identify specific G-quadruplexes that may form, providing a testable in vitro hypothesis that can be tested using simple biophysical methods.

Quadparser.
For any researcher interested in identifying PQS in specific sequences, we provide the quadparser program pre-compiled for MS Windows and Mac OS X with detailed instructions. The program is customisable, so that different patterns can be searched for. Different loop length constraints, G-tract lengths and so forth may all be set, so that the algorithm can be adjusted to fit with the particular context desired. Quadparser has a variety of output styles for different uses, and reads sequence data in FASTA format.

Data
Search. The Data search section allows a researcher to identify any PQS in gene promoters (defined as the 1 kb upstream of the TSS) or UTRs for their gene of interest. The genes may be identified by ensembl ID, HGNC code or description. The output provides full details of the gene, including genomic parameters, and the location and sequence of PQS in the appropriate regions of every transcript of the gene. Links are also provided to Ensembl so the PQS may be seen in context. Figure 2 displays the output when searching the human genome for PQS in the promoter or UTRs of c-kit (HGNC nomenclature KIT). Currently, searches may be performed against the human, chimpanzee and mouse genomes.

Data Download.
As a convenient alternative to gene-bygene searches or using the quadparser program, we also provide a downloadable listing of every PQS identified in various genomes. We currently offer this data for human (builds 34, 35 and 36 for back compatibility), chimpanzee (2.1), mouse (37), rat (3.4), dog (2), chicken (2), zebrafish (7), fruitfly (5.4), roundworm (180) and yeast (1.01) genomes. In each case the data provides a genomic coordinates for each PQS, together with the strand, sequence and a unique identifier. Data may be taken altogether or by chromosome.

Quadpredict-Predicting G-Quadruplex Stability
The thermal stability of G-quadruplexes varies with the concentration of monovalent cations, specifically Na + and K + . However, even for fixed concentrations, the exact details of the sequence, and hence the structure formed, make a very large difference. G-quadruplexes can vary from those which are too unstable to form at 5 • C to those which will resist temperatures above 95 • C [31]. It is therefore necessary not just to predict which sequences can form G-quadruplexes at all, but also the stability with which such sequences can form. Such experiments are relatively easy to perform, and have led to a series of studies of different aspects of the relationship between sequence and stability [31][32][33][34]. However, this does not enable prediction of unmeasured sequences, forcing researchers to make informed guesses as to the stability of novel sequences. We recently developed [35] a Bayesian learning algorithm that is capable of making accurate predictions of thermal stability for new sequences, having been trained on a collection of measured sequences. Full details of the methodology and the parameters considered are available elsewhere [35]. We provide an interface to this system at http://www.quadruplex.org/?view=quadpredict, enabling researchers to make easy predictions of melting temperatures under various conditions for any desired sequence. Figure 3 gives an example of such predictions.
One feature of the Bayesian inference we use is that in addition to predictions of the melting temperature, we also provide uncertainties in the values for each sequence. In general, the uncertainty increases for sequences that are highly unlike those in our training set. This therefore enables researchers to decide rationally how much faith to place in a particular prediction.
We intend to develop the training data further, and have already employed a rational active learning protocol to collect more data and reduce the uncertainties below that originally presented. We will continue to do this, and also provide an opportunity for researchers to contribute their own data, so that the Bayesian inference can be increasingly accurate. We hope that depositing data publicly may become a standard requirement for publication of G-quadruplex thermal data.
We allow researchers to discover whether particular sequences they are interested in are already in our database of measurements, with information about exactly how such an experiment was performed. We hope that these facilities will prove useful to all those working in this field. As well as those interested in biological aspects of G-quadruplexes, we feel this facility may be particularly helpful for those working in nanotechnology or materials science, providing them with a method of rationally selecting G-quadruplexforming oligonucleotides.

Other G-Quadruplex Computational Tools
There are a number of other tools that may be used to predict the existence of G-quadruplexes in DNA, and links to these are provided from http://www.quadruplex.org/. Bagga and coworkers use a similar algorithm to quadparser called QGRS mapper [36]. It has different default parameters, in particular looking at sequences with fewer consecutive guanines and longer loops, but essentially looks for much the same sequences. Interestingly, it includes a scoring parameter for different possible G-quadruplexes that can be formed. Although this is loosely based on empiric evidence, it is not clear how the "G-score" produced, which ranges up to a maximum of 105, relates to stability. To the best of our knowledge, no empiric tests have been performed testing the validity of the G-score even as a ranking list, but it is still a useful formulation of established rules of thumb. As well as the QGRS mapper, which also provides the facility to search by genes, they also provide specialised databases, GRSDB2 and GRS UTRdb, [37] for searching pre-mRNAs and UTR sequences.
Maiti and coworkers offer a site called Quadfinder, [38] which implements essentially the same algorithm as quadparser. (At the time of writing it does not appear to be functioning.) At the same institute, Chowdhury and coworkers have a site called QuadBase, [39] again using essentially the same algorithm. They focus on cross-species analysis, offering an ortholog analysis for finding conserved G-quadruplexes, across either prokaryotes (ProQuad) (and see [40]) or eukaryotes (EuQuad). It should be noted that the conservation required is by presence, and no sequence comparison is performed.
Lastly in this category is the Greglist database of potential G-quadruplex regulated genes, which lists all human genes that have a G-quadruplex in the 1 kb region upstream of the transcription start site. The quadparser algorithm is used to predict these sequences [41].
A completely different approach to G-quadruplex prediction is taken by the Maizels lab [42,43]. Whereas other methods aim to predict specific G-quadruplex sequences, largely driven by the desire of structural biologists to have structures to study, and by the desire of medicinal chemists to have a defined form to target, the G4 calculator from Eddy and Maizels accepts that many of these structures are highly polymorphic in vivo. As a result, they do not aim to predict individual structures but look at the density of sequences likely to lead to G-quadruplex structures. Given that this is an entirely orthogonal approach, it is striking that in many cases, particularly working on the gene functions that are likely to be regulated by G-quadruplexes, very similar conclusions arise from using this approach as the quadparser model. We strongly recommend that for any large-scale genomic studies, both approaches are used to corroborate the results found.

Conclusions
Computational methods have been of great use in understanding the role that G-quadruplexes may play in biology, unveiling their function in gene promoters [27,42] and in regulating translation [44]. They have also revealed that stable G-quadruplexes are generally located in nucleosomefree regions [45]. Stability predictions have been used to develop experimental methods to directly visualise Gquadruplexes using AFM [46]. We anticipate that greater availability of ever more reliable tools will both improve the quality of informatic research in this area and make it increasingly easy for experimentalists to access computational results.