NCBI Logo NCBI News NCBI News banner
National Center for Biotechnology Information US Department of Health and Human Services National Center for Biotechnology Information National Library of Medicine National Institutes of Health
Spring 2003 issue of NCBI News

In this issue

The Reference Human Genome

SARS Coronavirus Resource

Gene Expression Omnibus (GEO)

Major Histocompat-ibility Complex database (dbMHC)

RefSeq Release 1 Ready for Download

GenBank Release 137

New Microbial Genomes in GenBank

Sequence Revision History Page Offers New Comparison Function





The Reference Human Genome at NCBI

The Human Genome Project, a 13-year international collaborative effort, reached a major milestone in April 2003 with the release of the first reference sequence for the human genome. This finished sequence follows the working draft, completed in 2001, and described in the February 15 edition of Nature.[1] Over the period of the genome sequencing effort, sequencing centers from around the world deposited billions of letters of human DNA sequence into GenBank® and its collaborating databases, DDBJ and EMBL, where the data was immediately made available to researchers.

Between the appearance of the draft sequence and the release of the finished reference sequence, NCBI maintained interim assemblies of the data to ensure access by the research community to the most complete genome draft. Now, the finished reference human genome sequence, along with the results of NCBI analysis and annotations, is available for viewing and downloading.

The Genome and its Genes

The reference human genome, a small portion of which is shown in Figure 1, consists of 24 finished chromosomes of 2.9 billion bases and covers about 99 percent of the gene-containing DNA. The sequence is accurate, on average, to the level of one error per 10,000 bases. Small updates to the assembly will continue as complex regions are further refined and the small number of remaining gaps between the large stretches of contiguous sequence, or “contigs”, are closed.

NCBI identifies known genes in the genome by aligning Reference Sequence (see RefSeq below) and GenBank mRNAs to the assembled genomic sequence using MegaBLAST. NCBI also predicts genes computationally, but includes predicted genes in the annotation only if they do not overlap a gene model based on an mRNA alignment. About 25,000 genes have been annotated on the genome using these two methods. Sequence variations are mapped to the reference genome via BLAST®, using the data in the Database of Single Nucleotide Polymorphisms (dbSNP). For more information on NCBI’s human genome assembly or annotation, see the Web pages referenced in the box entitled “Human Genome Build Information”.

Figure 1: Map Viewer display for the human BRCA1 gene showing, from the right, the NCBI gene model, 13 transcript variants, a GenomeScan predicted gene model, and UniGene cluster sequences that map to the region.

Figure 1. Map Viewer display for the human BRCA1 gene showing, from the right, the NCBI gene model, 13 transcript variants, a GenomeScan predicted gene model, and UniGene cluster sequences that map to the region.

Exploring the Genome

The Human Genome Resource page, found under ‘Hot Spots’ on the NCBI Home Page, provides an entry into the NCBI resources, databases, and tools related to the human reference genome. Three primary resources accessible from this page, as well as from the NCBI Home Page, are RefSeq, LocusLink, and the Map Viewer.

RefSeq and LocusLink

The human portion of the RefSeq database (for more information, see “RefSeq Release 1 is Ready for Download”, this issue) includes the transcript and associated protein sequences derived from GenBank submissions, the gene models derived from the genome by prediction, and the contig and chromosomal records for the reference genome itself. RefSeqs are recognized by accession numbers beginning with two letters, indicating the type of sequence, and an underscore. Transcript and protein RefSeqs with the prefixes “NM_” and “NP_”, respectively, are derived from GenBank submissions and therefore are considered to be experimentally supported to some degree. Predicted transcripts and their protein translation products bear, respectively, the prefixes “XM_” and “XP_”. Genomic contigs begin with “NT_” while reference records for the 24 human chromosomes comprise the series “NC_000001-NC_000024”. The RefSeq contigs, transcripts, and proteins are also retrievable with standard Entrez queries by accession number, gene symbol, or protein name and can be restricted to the RefSeq entries using ‘Entrez Limits’.

LocusLink offers a single query interface to gene loci for many organisms, and includes all human genes defined by the genome annotation process. LocusLink reports display descriptive information and links to related NCBI resources such as RefSeq, NCBI's Map Viewer, Evidence Viewer, Model Maker, BLAST Link, UniGene, protein domains from NCBI's Conserved Domain Database, and the Homologene database. Follow the links under ‘Hot Spots’ on the NCBI Home Page to reach the LocusLink and RefSeq pages.

The Map Viewer

The NCBI Map Viewer, available under ‘Hot Spots’ on the NCBI Home Page and via the Entrez Links menu for nucleotide and protein records shown in the Map Viewer, generates graphical views, such as that shown in Figure 1, of aligned chromosomal maps for human and other organisms. A flexible query interface that supports gene names or symbols, marker names, SNP identifiers, accession numbers and other identifiers makes it easy to navigate to a gene or region of interest. The Map Viewer for the human reference genome displays cytogenetic maps, physical maps, maps showing predicted gene models, EST alignments with links to UniGene clusters from human and related organisms, and mRNA alignments used to construct gene models. A tabular view of the data allows convenient export of the information shown in the graphical display. Map Viewer displays are linked to supporting resources such as LocusLink, the Evidence Viewer, and Model Maker; the latter two tools are described in the shaded box entitled “Human Genome Tools”. Segments of the genomic assembly shown in the graphical view may be downloaded using the Map Viewer's “Download/View Sequence” link. A Map Viewer help document is available via the “Human Maps Help” link on the Map Viewer page. See also chapters in the NCBI Handbook, available by clicking “NCBI Handbook” under ‘Hot Spots’ on the NCBI Home Page.—DW

Human Genome Tools

Model Maker (MM)—allows the construction of transcript models using the pre-computed alignments of NCBI RefSeqs, GenBank transcripts, ESTs, and predicted transcripts to the NCBI human genome assembly.

The Evidence Viewer (EV)—displays the alignments of RefSeq transcripts. GenBank mRNAs or predicted transcripts, and ESTs that support an NCBI gene model. The EV produces a graphical summary along with a detailed exon-by-exon view and shows any proteins annotated on the transcript.

Human Genome BLAST—The human genome BLAST page offers MegaBLAST for rapid searches of the reference genome. Standard variants of BLAST are also available to search the RefSeq transcripts and proteins arising from NCBI annotations. Human Genome BLAST hits are displayed in the Map Viewer to show their genomic content.

Human Genome Build Information

(see "Available Documentation" links at the bottom of the Web page)


[1]Nature. 2001 Feb 15;409(6822):745-964

Continue to: SARS Coronavirus Resource

NCBI News | Summer 2003 NCBI News: Spring 2003