Between the appearance of the draft sequence and the release of the finished reference sequence, NCBI maintained interim assemblies of the data to ensure access by the research community to the most complete genome draft. Now, the finished reference human genome sequence, along with the results of NCBI analysis and annotations, is available for viewing and downloading.
The Genome and its Genes
The reference human genome, a small portion of which is shown in Figure 1, consists of 24 finished chromosomes of 2.9 billion bases and covers about 99 percent of the gene-containing DNA. The sequence is accurate, on average, to the level of one error per 10,000 bases. Small updates to the assembly will continue as complex regions are further refined and the small number of remaining gaps between the large stretches of contiguous sequence, or “contigs”, are closed.
NCBI identifies known genes in the genome by aligning Reference Sequence (see RefSeq below) and GenBank mRNAs to the assembled genomic sequence using MegaBLAST. NCBI also predicts genes computationally, but includes predicted genes in the annotation only if they do not overlap a gene model based on an mRNA alignment. About 25,000 genes have been annotated on the genome using these two methods. Sequence variations are mapped to the reference genome via BLAST®, using the data in the Database of Single Nucleotide Polymorphisms (dbSNP). For more information on NCBI’s human genome assembly or annotation, see the Web pages referenced in the box entitled “Human Genome Build Information”.
Exploring the Genome
The page, found under ‘Hot Spots’ on the NCBI Home Page, provides an entry into the NCBI resources, databases, and tools related to the human reference genome. Three primary resources accessible from this page, as well as from the NCBI Home Page, are RefSeq, LocusLink, and the Map Viewer.
RefSeq and LocusLink
The human portion of the RefSeq database (for more information, see “RefSeq Release 1 is Ready for Download”, this issue) includes the transcript and associated protein sequences derived from GenBank submissions, the gene models derived from the genome by prediction, and the contig and chromosomal records for the reference genome itself. RefSeqs are recognized by accession numbers beginning with two letters, indicating the type of sequence, and an underscore. Transcript and protein RefSeqs with the prefixes “NM_” and “NP_”, respectively, are derived from GenBank submissions and therefore are considered to be experimentally supported to some degree. Predicted transcripts and their protein translation products bear, respectively, the prefixes “XM_” and “XP_”. Genomic contigs begin with “NT_” while reference records for the 24 human chromosomes comprise the series “NC_000001-NC_000024”. The RefSeq contigs, transcripts, and proteins are also retrievable with standard Entrez queries by accession number, gene symbol, or protein name and can be restricted to the RefSeq entries using ‘Entrez Limits’.
LocusLink offers a single query interface to gene loci for many organisms, and includes all human genes defined by the genome annotation process. LocusLink reports display descriptive information and links to related NCBI resources such as RefSeq, NCBI's Map Viewer, Evidence Viewer, Model Maker, BLAST Link, UniGene, protein domains from NCBI's Conserved Domain Database, and the Homologene database. Follow the links under ‘Hot Spots’ on the NCBI Home Page to reach the LocusLink and RefSeq pages.
The Map Viewer
The NCBI Map
Viewer, available under ‘Hot Spots’ on the NCBI Home Page and via
the Entrez Links menu for nucleotide and protein records shown in the
Map Viewer, generates graphical views, such as that shown in Figure
1, of aligned chromosomal maps for human and other organisms. A flexible
query interface that supports gene names or symbols, marker names, SNP
identifiers, accession numbers and other identifiers makes it easy to
navigate to a gene or region of interest. The Map Viewer for the human
reference genome displays cytogenetic maps, physical maps, maps showing
predicted gene models, EST alignments with links to UniGene clusters
from human and related organisms, and mRNA alignments used to construct
gene models. A tabular view of the data allows convenient export of
the information shown in the graphical display. Map Viewer displays
are linked to supporting resources such as LocusLink, the Evidence Viewer,
and Model Maker; the latter two tools are described in the shaded box
entitled “Human Genome Tools”. Segments of the genomic assembly shown
in the graphical view may be downloaded using the Map Viewer's “Download/View
Sequence” link. A Map Viewer help document is available via the “Human
Maps Help” link on the Map Viewer page. See also chapters in the
NCBI Handbook, available by clicking “” under ‘Hot Spots’ on the
NCBI Home Page.
Nature. 2001 Feb 15;409(6822):745-964