Logo of narLink to Publisher's site
Nucleic Acids Res. 2004 Jan 1; 32(Database issue): D3–D22.
PMCID: PMC308877

The Molecular Biology Database Collection: 2004 update


The Molecular Biology Database Collection is a public resource listing key databases of value to the biologist, including those featured in this issue of Nucleic Acids Research, and other high-quality databases. All databases included in this Collection are freely available to the public. This listing aims to serve as a convenient starting point for searching the web for reliable information on various aspects of molecular biology, biochemistry and genetics. This year’s update includes 548 databases, 162 more than the previous one. The databases are organized in a hierarchical classification that should simplify finding the right database for each given task. Each database in the list comes with a recently updated brief description. The database list and the database descriptions can be accessed online at the Nucleic Acids Research web site http://nar.oupjournals.org/.

The great challenge in biological research today is how to turn data into knowledge. I have met people who think data is knowledge but these people are then striving for a means of turning knowledge into understanding.

Sydney Brenner. The Scientist 16[6]:12, March 18, 2002


The 50th anniversary of Watson and Crick’s discovery of the DNA double-helix structure last year was marked by the formal completion of the Human Genome Project (1). In the feast of the ever-increasing pace of DNA sequencing, this 3-billion-letter text was unraveled barely 8 years after the completion of the first genome of a cellular life form, the 2000-fold smaller genome of Haemophilus influenzae strain Rd KW20 (2). The history of genome sequencing shows that the amount of accumulated DNA sequence data keeps growing at an exponential rate, nearly doubling every year. Genomes of more than a hundred organisms from all major phylogenetic lineages are already available in GenBank and sequencing of many more is currently under way. These sequence data have stimulated research in more areas of life sciences than anybody could have expected just a few years ago. They have already spawned a revolution in microbiology and, with the progress of eukaryotic genome projects, will soon impact such areas as entomology and veterinary science. Unfortunately, a great majority of biologists, chemists and physicians still have only a very vague idea of how to use these data or even where to find them. For the last 10 years, Nucleic Acids Research has been devoting a special issue to the molecular biology database compilation (3), which, together with the recently launched NAR Web Server Issue (4), should help meet the challenge of bringing molecular biology data and computational tools to every laboratory bench and making them an integral part of every biologist’s tool kit.

In order to have a real impact, molecular biology data need to be properly organized and curated. The database structure should help in improving the signal-to-noise ratio, making it easy to extract useful information. In the very beginning of the genome sequencing era, Walter Gilbert and colleagues warned of ‘database explosion’, stemming from the exponentially increasing amount of incoming DNA sequence and the unavoidable errors it contains (5). Luckily, this threat has not materialized so far, due to the corresponding growth in computational power and storage capacity and the strict requirements for sequence accuracy. However, having managed so far to cope with data accumulation in terms of the capacity to store sequence data, we have fared much worse in terms of our capacity to comprehend these data. Even though at least 50–70% of proteins encoded in any genome are homologous to proteins that are already in the database, every newly sequenced genome encodes hundreds or thousands of novel proteins that have never been seen before and whose very existence in the live cell, let alone function, is uncertain. Even for Escherichia coli, arguably the best-studied organism on this planet, almost a half of the ∼4288 proteins encoded in the genome have never been studied experimentally and, at the current rate of their experimental characterization, it could take many years before this task is completed (6). For eukaryotes with their much larger genome sizes, complex gene organization, multitude of regulatory interactions and the abundance of proteins without evident enzymatic activities, the task of comprehending the genomic information is infinitely more challenging.

In a way, the proliferation of molecular biology databases can be seen as a natural response of the biological community as a whole to the challenge of staying current in this ever-increasing flow of information that faces every individual biologist. It allows one to rely on the expertise of others, typically well-known professionals in the field, to sort through the raw data and come up with a curated digest, not unlike the immensely popular mini-reviews that now show up in nearly every journal. The difference, of course, is that the databases are freely available on the web and are continuously updated, which makes each of them a live resource, rather than just a snapshot.

So what’s the purpose of this compilation in the era of Google, HotBot, Overture and dozens of other search engines? Unfortunately, these engines rank web sites by popularity, not by their relevance to scientists, and are unable to discriminate between reliable and unreliable web sites. Thus, a recent Google search for ‘mitochondrial myopathy’ returned a huge number of links, many of them relevant, but clicking the very first of those links launched a series of new windows offering a trial subscription to a web service, cheap airline tickets, and several more items not to be named here. Even the target window was mostly devoted to the importance of treating mitochondrial myopathies with a vegetarian diet, hardly what I was looking for. In contrast, the same search of the OMIM database yielded just 38 links, all of which were relevant and provided reliable information on this family of diseases. Thus, I hope that this compilation will help bridge the ‘digital divide’ between those researchers who create molecular biology databases and those that would benefit most from using them but are either unaware that such databases exist or are just too busy to spend valuable time sorting through dubious web links.

Certainly, this listing is far from being complete. In order to be included, databases had to provide added value to the user and be publicly available to anyone without any need for registration or subscription. The latter requirement left out a number of useful and otherwise worthy databases, previously described in NAR, such as the Asthma and Allergy Gene Database (7) or BioKnowledge Library databases YPD, PombePD and WormPD (8) from Proteome Inc., currently owned by Incyte. However, exceptions were made for the databases described in this volume and for those databases that allow some limited access without registration. Naturally, the database list has grown since the last issue. This edition includes 548 databases, an increase of 162 over the last year’s list (3). While most of these new databases have been created only recently, we have also added some well-known databases that were missing before, such as Colibri, FSSP (now superceded by Dali but still widely used) and GtRDB. We have also introduced a hierarchical classification of databases that should simplify searching the list. Due to the limitations of every classification, in the online version of this list, available at http://nar.oupjournals.org/, some databases appear more than once. Doing that in the print version (Table (Table1)1) would have consumed too much valuable space.

Table 1.
Molecular Biology Database Collectiona

Suggestions for the inclusion of additional database resources in this Collection are encouraged and should be directed to Dr Alex Bateman at ku.ca.mac.bml-crm@esabatadran and to the author at vog.hin.mln.ibcn@nireplag.

Supplementary Material

[Database Listing]


I thank Andreas Baxevanis for keeping this invaluable resource running for the last 4 years and helpful comments. The hierarchical classification of databases was originally developed for our recent book with Eugene Koonin (6). I thank Rich Roberts and my colleagues at NCBI for support and helpful advice and Alice Ellingham and Gill Smith for logistical support and assistance in tracking the database list.


1. Collins F.S., Morgan,M. and Patrinos,A. (2003) The Human Genome Project: lessons from large-scale biology. Science, 300, 286–290. [PubMed]
2. Fleischmann R.D., Adams,M.D., White,O., Clayton,R.A., Kirkness,E.F., Kerlavage,A.R., Bult,C.J., Tomb,J.-F., Dougherty,B.A., Merrick,J.M. et al. (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science, 269, 496–512. [PubMed]
3. Baxevanis A.D. (2003) The Molecular Biology Database Collection: 2003 update. Nucleic Acids Res., 31, 1–12. [PMC free article] [PubMed]
4. Editorial. (2003) Nucleic Acids Res., 31, 3289. [PMC free article]
5. Bhatia U., Robison,K. and Gilbert,W. (1997) Dealing with database explosion: a cautionary note. Science, 276, 1724–1725. [PubMed]
6. Koonin E.V. and Galperin,M.Y. (2002) Sequence–Evolution–Function. Computational Approaches in Comparative Genomics. Kluwer Academic Publishers, Boston, MA. [PubMed]
7. Immervoll T. and Wjst,M. (1999) Current status of the Asthma and Allergy Database. Nucleic Acids Res., 27, 213–214. [PMC free article] [PubMed]
8. Costanzo M.C., Crawford,M.E., Hirschman,J.E., Kranz,J.E., Olsen,P., Robertson,L.S., Skrzypek,M.S., Braun,B.R., Hopkins,K.L., Kondu,P. et al. (2001) YPD, PombePD and WormPD: model organism volumes of the BioKnowledge library, an integrated resource for protein information. Nucleic Acids Res., 29, 75–79. [PMC free article] [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
PubReader format: click here to try


Save items

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...