NCBI Logo NCBI at a Glance Titlebar
National Center for Biotechnology Information
 
About NCBI NCBI at a Glance A Science Primer Databases and Tools
Human Genome Resources Model Organisms Guide Outreach and Education News

About NCBI
Site Map

NCBI at a Glance:

Our Mission

Programs and Activities

Organizational Structure

Researchers at NCBI

Contact Information

Exhibit Schedule
 
 

A STORY OF DISCOVERY

From Sequence to Survival:
The Search to Understand Disease

 
Biology today is being transformed by an explosive growth of data emerging from laboratories worldwide. The challenge is to transform data into knowledge, knowledge that will lead to a better understanding of the biological processes underlying both health and disease. The quest for this knowledge drives NCBI investigators to develop new methods for integrative, computer-based data analysis to mine massive and complex datasets. Once developed, these software tools may then be used by the research community to answer specific scientific questions.

NCBI's Web-based approach for disseminating its resources to the research and medical communities has enabled scientists worldwide to integrate seemingly disparate data and to shape more biologically meaningful views of this information, which has, in turn, generated new knowledge. This multi-step process is best illustrated by the examples outlined below.

 
 

From Data and Information to Facts:
GenBank and the Human Genome

GenBank is the NIH database maintained and distributed by NCBI that stores all known public DNA sequences. Sequence data are submitted to GenBank from individual scientists from around the world, as well as from the large centers involved in the Human Genome Project. The number of DNA sequences stored in the GenBank database, from all organisms, recently reached colossal heights and continues to grow at a rapid rate. GenBank is an international collaborative project, with partners located at the European Bioinformatics Institute in the United Kingdom and the National Institute of Genetics in Japan.

NCBI investigators are using the human data stored in GenBank, both draft and finished nucleotide sequences, to generate an assembly of the human genome. Assembling and ordering the individual sequences is a critical phase of the project, involving many steps. Updated assemblies—incorporating new data, filling in existing gaps, and increasing overall accuracy—are released to the public on a regular basis.

NCBI investigators are also engaged in the essential process of annotating, or labeling, the biologically important areas of the human genome. This process includes the correct placement of known human genes into their proper genomic context as well as the prediction of previously unidentified genes from the genomic sequence. For the first task, mRNAs from the NCBI RefSeq collection are placed on the genome primarily by alignment. RefSeq mRNAs are reference sequence standards for the human genome. To generate a sequence standard, NCBI investigators first collaborate with external organizations to gather various types of information. They next assimilate these data using both computational tools and scientific judgment to determine what sequence is an appropriate representation for a gene.

OMIM is a Web-based catalog that contains thousands of entries for genes and genetic disorders and serves as a phenotypic companion to the Human Genome Project. The OMIM cytogenetic and morbid maps present cytogenetic locations for those genes with published locations and provide an alphabetical list of all the diseases described in OMIM.

To validate the findings generated through computer-based comparative analysis, it is essential to consider the results of wet-bench biology reported in the scientific literature. Therefore, the integration of scientific data with the literature is a necessary step for creating a unified information resource in the life sciences. To this end, individuals are provided with a direct link from OMIM to PubMed, NCBI's literature retrieval system.

PubMed provides Web-based access to over 11 million citations, abstracts, and indexing terms for journal articles in the biomedical sciences. It also includes links to full-text journals. Currently, approximately 20 million searches are conducted per month, and as many as 140,000 different users seek information daily via PubMed.

PubMed Central (PMC), a digital archive of life sciences journal literature, was launched in January 2001 and offers a new model for electronic scientific communication and data retrieval. The value of PubMed Central, in addition to its role as an archive, lies in what can be done when data from diverse sources are stored in a common format in a single repository. PMC currently provides free and unrestricted access to the full text of 104 life sciences journals, with more forthcoming.

 
 

From Facts to Knowledge

Each database described above is, by itself, informative and useful. Yet, as illustrated, it is only after the components become linked to form a single integrated resource that the information stored in each database can be analyzed as part of the bigger whole. For example, by integrating various forms of information relative to a particular protein, a researcher may be able to elucidate a previously unknown function. Suddenly, certain steps within a complex biological pathway never before understood become clear. Researchers may then build on this information to gain an insight into what goes awry in the pathway in a disease state. Long-term goals would include the development of novel diagnosis and treatment strategies.

 
 

Discovery of Disease Genes

Although this particular example is greatly simplified, the integrated resources developed and disseminated by the NCBI and the Human Genome Project have led to many scientific advances. The discovery of the genes for hereditary nonpolyposis colorectal cancer (HNPCC) is one such example.

HNPCC is believed to account for one in six of all colon cancer cases. Although most forms of cancer appear to be non-inherited, there are some forms where an individual has a hereditary risk attributable to a single altered gene. Although scientists had known for years that an altered gene was to blame for HNPCC, they had few clues as to where the gene might reside, and finding it was proving tricky. Finally, using the various tools emerging from the Human Genome Project, an international research team tracked the gene to a region of chromosome 2. Several months later, two research teams zeroed in on the culprit. Three months after that, researchers identified a second gene on chromosome 3 that was also associated with this form of cancer.

It is now known that together, mutations within these two genes account for most cases of HNPCC. Researchers used this information to develop a blood test to screen select individuals for these gene mutations. Detecting the presence of the mutated genes for HNPCC within a family allows physicians to target those family members most likely to benefit from treatment. By identifying an unaffected family member at risk for HNPCC, physicians may then more closely monitor this patient for signs of disease development. Family members determined to be non-carriers no longer have to suffer through extensive medical examinations. Most importantly, patients demonstrating early signs of cancer and determined to carry a gene mutation may undergo prompt medical treatment. This is truly a medical advance, because HNPCC, when diagnosed and treated early, is nearly 100 percent curable.

 
 
Revised: May 21, 2004.
  NCBI NLM NIH

  Disclaimer Privacy Statement Accessibility