![]() | ![]() |
Formats:
|
||||||||||||||
Copyright © 2008 The Author(s) DiProDB: a database for dinucleotide properties 1Biocomputing Group, Leibniz Institute for Age Research - Fritz Lipmann Institute, Beutenbergstrasse 11, 07745 Jena, 2Department of Bioinformatics, Friedrich-Schiller-University Jena, Ernst-Abbe-Platz 2, 07743 Jena, Germany and 3Theoretical Systems Biology, Institute of Food Research, Norwich Research Park, Colney, Norwich NR4 7UA, UK *To whom correspondence should be addressed. Tel: Phone: +44 1603 255313; Fax: +44 1603 255128; Email: thomas.wilhelm/at/bbsrc.ac.uk Received August 1, 2008; Revised September 3, 2008; Accepted September 3, 2008. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract DiProDB (http://diprodb.fli-leibniz.de) is a database of conformational and thermodynamic dinucleotide properties. It includes datasets both for DNA and RNA, as well as for single and double strands. The data have been shown to be important for understanding different aspects of nucleic acid structure and function, and they can also be used for encoding nucleic acid sequences. The database is intended to facilitate further applications of dinucleotide properties. A number of property datasets is highly correlated. Therefore, the database comes with a correlation analysis facility. Authors having determined new sets of dinucleotide property values are invited to submit these data to DiProDB. INTRODUCTION Nucleic acid properties are governed by the corresponding nucleotide sequence. More specifically, many properties such as nucleic acid stability, for example, seem to depend primarily on the identity of nearest-neighbour nucleotides (1). The corresponding nearest-neighbour model is also the basis for RNA secondary structure prediction by free-energy minimization (2). It is known that not only thermodynamic but also conformational nucleotide properties may play a role. It has been shown, for example, that promoter locations can be predicted adopting dinucleotide stiffness parameters derived from molecular dynamic simulations (3). Also, curved DNA is known to play a role in prokaryotic gene expression (4). In addition, physical DNA profiles have been used for an improved promoter prediction (5,6). There are numerous other examples. It is, however, beyond the scope of this brief database description to provide a comprehensive overview. Currently, we are developing a Genome Browser that encodes complete eukaryotic or prokaryotic genomes by thermodynamic and conformational dinucleotide properties. In this context, we have collected more than 100 sets of dinucleotide properties from the literature. Currently, there are two related data collections, the PROPERTY DB (srs6.bionet.nsc.ru/srs6bin/cgi-bin/wgetz?-page+LibInfo+-id+1pFZP1TuQpU+-lib+PROPERTY) with about 30 property sets (7) and plot.it (hydra.icgeb.trieste.it/dna/plot_it.html) with about 50 sets (Vlahovicek,K. and Pongor,S., unpublished data). Both of these databases do not include many of the existing datasets and, in addition, it is difficult to trace back the original data sources. Also, both of them are not included in the NAR Database Collection. Therefore, we have set up the database DiProDB, which is aimed to be a one-stop resource for these properties. With DiProDB we want to provide reliable, easily accessible and comprehensive information on dinucleotide properties that may stimulate the application of these data to a diversity of biological problems. DATABASE CONTENT DiProDB currently includes 115 dinucleotide datasets. They were collected from the literature and are classified according to nucleic acid type (DNA and RNA), strand information (double or single), how the data were obtained (experimental, theoretical/calculated) and also according to the general type of the dinucleotide property: thermodynamical (e.g. free energy), conformational (e.g. twist) or letter-based (e.g. GC content). We include the letter-based data to demonstrate relations to thermodynamical and conformational properties. Moreover, most of the current motif discovery approaches are letter-based. An example from our work refers to the identification of significant purine–pyrimidine patterns in restriction enzyme binding sites (8). The number of datasets for each category is shown in Table 1. For each dataset, the 16 dinucleotide values, the unit of measurement, the reference, the classification features as well as comments are provided. If a dataset refers to RNA, it is mentioned in the corresponding property name, if the name does not mention a nucleic acid, it always refers to DNA.
USER INTERFACE DiProDB displays all data in a single table, see Figure 1
DATA ANALYSES The DiProDB website contains a Correlate option, where users can calculate Pearson's or Spearman's rank correlation coefficients for all or selected properties. This allows easy identification of dependencies between different dinucleotide properties. As an example in Figure 2 Based on these correlations, we have done different hierarchical clustering analyses to get a deeper insight into the overall correlation of the datasets. Figure 3
In order to gain more insights into the data, we performed two principal component analyses (PCA) (15). The complete data of 115 properties for 16 dinucleotides corresponds to 115 points in 16-dimensional space (or 16 points in 115-dimensional space). PCA helps to reveal the internal structure of such high-dimensional data by providing lower dimensional pictures of the ‘cloud’ in coordinates corresponding to maximum variance of the data (http://en.wikipedia.org/wiki/Principal_components_analysis). The cloud of all 115 properties in the first two principal components (PCs, the new coordinates) is shown in Figure 4
Finally, we also performed a PCA calculating the 115 principal components for the 16 dinucleotides. The first 15 PCs carry information (23%, 21%, 14%, 12%, 6%, etc.), roughly indicating that about this number of low correlated properties is needed to represent all information of the complete set of 115 properties. The Supplementary Material also contains a corresponding PC1–PC2 plot, together with all detailed information about the performed PCAs. OUTLOOK So far the DiProDB database contains 115 sets of dinucleotide properties. In the future, this number is to be increased. We also invite other authors to submit their measured or calculated dinucleotide properties to DiProDB. FUNDING Funding for open access charge: Biotechnology and Biological Sciences Research Council (BBSRC)IFR Core Strategic Grant. Conflict of interest statement. None declared. Supplementary data are available at NAR Online. ACKNOWLEDGEMENTS We are grateful to Friedrich Haubensak for setting up the database and to Rolf Hühne for helpful comments on the database layout. REFERENCES 1. SantaLucia J., Jr A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbour thermodynamics. Proc. Natl Acad. Sci. USA. 1998;95:1460–1465. [PubMed] 2. Mathews DH, Turner DH. Prediction of RNA secondary structure by free energy minimization. Curr. Opin. Struct. Biol. 2006;16:270–278. [PubMed] 3. Goñi JR, Pérez A, Torrents D, Orozco M. Determining promoter location based on DNA structure first-principles calculations. Genome Biol. 2007;8:R263. [PubMed] 4. Pérez-Martín J, Rojo F, de Lorenzo V. Promoters responsive to DNA bending: a common theme in prokaryotic gene expression. Microbiol. Rev. 1994;58:268–290. [PubMed] 5. Abeel T, Saeys Y, Rouzé P, Van de Peer Y. ProSOM: core promoter prediction based on unsupervised clustering of DNA physical profiles. Bioinformatics. 2008;24:i24–i31. [PubMed] 6. Florquin K, Saeys Y, Degroeve S, Rouzé P, Van de Peer Y. Large-scale structural analysis of the core promoter in mammalian and plant genomes. Nucleic Acids Res. 2005;33:4255–4264. [PubMed] 7. Ponomarenko JV, Ponomarenko MP, Frolov AS, Vorobyev DG, Overton GC, Kolchanov NA. Conformational and physicochemical DNA features specific for transcription factor binding sites. Bioinformatics. 1999;15:654–668. [PubMed] 8. Nikolajewa S, Beyer A, Friedel M, Hollunder J, Wilhelm T. Common patterns in type II restriction enzyme binding sites. Nucleic Acids Res. 2005;33:2726–2733. [PubMed] 9. Karas H, Knüppel R, Schulz W, Sklenar H, Wingender E. Combining structural analysis of DNA with search routines for the detection of transcription regulatory elements. Comput. Appl. Biosci. 1996;12:441–446. [PubMed] 10. Pérez A, Noy A, Lankas F, Luque FJ, Orozco M. The relative flexibility of B-DNA and A-RNA duplexes: database analysis. Nucleic Acids Res. 2004;32:6144–6151. [PubMed] 11. Gorin AA, Zhurkin VB, Olson WK. B-DNA twisting correlates with base-pair morphology. J. Mol. Biol. 1995;247:34–48. [PubMed] 12. Suzuki M, Yagi N, Finch JT. Role of base-backbone and base-base interactions in alternating DNA conformations. FEBS Lett. 1996;379:148–152. [PubMed] 13. Shpigelman ES, Trifonov EN, Bolshoy A. CURVATURE: software for the analysis of curved DNA. Comput. Appl. Biosci. 1993;9:435–440. [PubMed] 14. Ward JH. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 1963;58:236–244. 15. Pearson K. On lines and planes of closest fit to systems of points in space. Philos. Magazine. 1901;2:559–572. |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||
Proc Natl Acad Sci U S A. 1998 Feb 17; 95(4):1460-5.
[Proc Natl Acad Sci U S A. 1998]Curr Opin Struct Biol. 2006 Jun; 16(3):270-8.
[Curr Opin Struct Biol. 2006]Genome Biol. 2007; 8(12):R263.
[Genome Biol. 2007]Microbiol Rev. 1994 Jun; 58(2):268-90.
[Microbiol Rev. 1994]Bioinformatics. 2008 Jul 1; 24(13):i24-31.
[Bioinformatics. 2008]Nucleic Acids Res. 2005; 33(8):2726-33.
[Nucleic Acids Res. 2005]Comput Appl Biosci. 1996 Oct; 12(5):441-6.
[Comput Appl Biosci. 1996]Nucleic Acids Res. 2004; 32(20):6144-51.
[Nucleic Acids Res. 2004]J Mol Biol. 1995 Mar 17; 247(1):34-48.
[J Mol Biol. 1995]FEBS Lett. 1996 Jan 29; 379(2):148-52.
[FEBS Lett. 1996]Comput Appl Biosci. 1993 Aug; 9(4):435-40.
[Comput Appl Biosci. 1993]Comput Appl Biosci. 1996 Oct; 12(5):441-6.
[Comput Appl Biosci. 1996]Nucleic Acids Res. 2004; 32(20):6144-51.
[Nucleic Acids Res. 2004]J Mol Biol. 1995 Mar 17; 247(1):34-48.
[J Mol Biol. 1995]FEBS Lett. 1996 Jan 29; 379(2):148-52.
[FEBS Lett. 1996]Comput Appl Biosci. 1993 Aug; 9(4):435-40.
[Comput Appl Biosci. 1993]