From Data and Information to Facts:
GenBank and the Human Genome
GenBank
is the NIH database maintained and distributed by NCBI that stores
all known public DNA sequences. Sequence data are submitted to GenBank
from individual scientists from around the world, as well as from
the large centers involved in the Human Genome Project. The number
of DNA sequences stored in the GenBank database, from all organisms,
recently reached colossal heights and continues to grow at a rapid
rate. GenBank is an international collaborative project, with partners
located at the European Bioinformatics Institute in the United Kingdom
and the National Institute of Genetics in Japan.
NCBI investigators are using the human data stored in GenBank,
both draft and finished nucleotide sequences, to generate an assembly
of the human genome. Assembling and ordering the individual
sequences is a critical phase of the project, involving many steps.
Updated assemblies—incorporating new data, filling in existing
gaps, and increasing overall accuracy—are released to the public
on a regular basis.
NCBI investigators are also engaged in the essential process of
annotating, or labeling, the biologically important areas
of the human genome. This process includes the correct placement
of known human genes into their proper genomic context as well as
the prediction of previously unidentified genes from the genomic
sequence. For the first task, mRNAs from the NCBI RefSeq
collection are placed on the genome primarily by alignment. RefSeq
mRNAs are reference sequence standards for the human genome.
To generate a sequence standard, NCBI investigators first collaborate
with external organizations to gather various types of information. They
next assimilate these data using both computational tools and scientific
judgment to determine what sequence is an appropriate representation
for a gene.
OMIM
is a Web-based catalog that contains thousands of entries for genes
and genetic disorders and serves as a phenotypic companion to
the Human Genome Project. The OMIM cytogenetic and morbid maps
present cytogenetic locations for those genes with published locations
and provide an alphabetical list of all the diseases described in
OMIM.
To validate the findings generated through computer-based comparative
analysis, it is essential to consider the results of wet-bench biology
reported in the scientific literature. Therefore, the integration
of scientific data with the literature is a necessary step for creating
a unified information resource in the life sciences. To this end,
individuals are provided with a direct link from OMIM to PubMed,
NCBI's literature retrieval system.
PubMed
provides Web-based access to over 11 million citations, abstracts,
and indexing terms for journal articles in the biomedical sciences.
It also includes links to full-text journals. Currently,
approximately 20 million searches are conducted per month, and as
many as 140,000 different users seek information daily via PubMed.
PubMed Central (PMC),
a digital archive of life sciences journal literature, was launched
in January 2001 and offers a new model for electronic scientific
communication and data retrieval. The value of PubMed Central, in
addition to its role as an archive, lies in what can be done when
data from diverse sources are stored in a common format in a single
repository. PMC currently provides free and unrestricted access
to the full text of 104 life sciences journals, with more
forthcoming.
|