NCBI Logo
NCBI News




In this issue


Entrez Query Goes “Global”

Register Your Genome Project Online at NCBI

New Genome Builds and Annotations

Entrez Gene Database Debuts

Recent Publications by NCBI Staff

New Microbial Genomes in GenBank

KOGs and COGs Now in CDD

Submission Corner

GenBank Release 139

UniGene Adds Four

RefSeq Version 3 Released

Masthead





New Genome Builds and Annotations at NCBI

One Build, Multiple Rounds of Annotation

As reference genome assemblies, such as those for human, mouse, and rat stabilize, a single build is expected to pass through multiple cycles of annotation. For this reason, a “build X version Y” identification system is now being used for many genomes shown in the Map Viewer. An identifier such as, “build 34 version 1”, indicates the first version of annotation for genome build 34.

Gene Annotation

NCBI has switched from GenomeScan as its standard method of predicting gene models, to Gnomon, a program developed by NCBI scientists. Gnomon differs from GenomeScan by putting a greater emphasis on coding propensity and matches to existing proteins when predicting genes. Gnomon also checks more rigorously for shifts in reading frame within transcript models that are often indicative of pseudogenes. As a result, the number of genes appearing in the most recent NCBI annotations of the genomes of human, mouse, and rat has decreased significantly, while the number of models identified as pseudogenes has increased. About 20% of the gene models appearing in human genome build 34 version 1 were produced using Gnomon; the remaining 80% were derived from NCBI RefSeq transcript alignments. For more on the Gnomon algorithm, see the shaded box entitled “Gnomon”.

Human Genome Build 34 Version 3

The NCBI Map Viewer now shows build 34 version 3 of the human genome reference sequence, which is based on data available as of July 2003, and includes the pseudoautosomal region of the Y chromosome. Supplementing the reference sequence are a separate assembly of chromosome 7 submitted by the The Center for Applied Genomics (TCAG), and the reference sequence for the DR51 haplotype in the Major Histocompatibility Complex region. Transcripts annotated on the TCAG assembly by the TCAG group are shown on a separate track in the Map Viewer. Other new tracks for the human Map Viewer show the alignment to the genome of all human, mouse, rat, pig, and cow ESTs along with mRNAs. A new ab initio track replaces the GenomeScan for the display of Gnomon gene predictions.

Rat Genome Build 2 Version 1

NCBI build 2 version 1 of the rat genome reference sequence uses the Rat Genome Sequencing Consortium version 3.1 assembly, which covers 2.8 billion bases of the genome in Whole Genome Shotgun (WGS) contigs. Shown alongside this assembly in the Map Viewer is the NCBI “NT” assembly, which covers 25 million bases of sequence and uses contigs assembled by NCBI from finished BAC sequences in GenBank. New in the Map Viewer for this build, are tracks showing the alignment of all human, mouse, and rat ESTs to the rat genome, as well as the ab initio track, which replaces the GenomeScan map for the display of gene models.

Mouse Build 32 Version 1, Anopheles gambiae Build 2 Version 1

Mouse build 32 version 1, based on data available as of September 2003, includes 24,819 mapped genes. Shown with the build 32 version 1 reference assembly in the Map Viewer is the Celera assembly of chromosome 16 with NCBI annotations. Build 2 Version 1 of the Anopheles gambiae genome, based on data available as of July 2003, is also available for browsing in the Map Viewer with over 12,000 annotated genes.

Magnaporthe grisea, Bos taurus, Sus scrofa, Canis familiaris are new in Map Viewer

NCBI has recently created Map Viewer displays for four more organisms. The display for Magnaporthe grisea, a pathogen of rice that is a close relative of Neurospora crassa, includes contig, gene, and transcript tracks. For cow1 and pig2, the Map Viewer displays Meat Animal Research Center (MARC) linkage maps while for the dog the Map Viewer shows the Canine 1Mb Radiation Hybrid map (RHDF5000).3 New Genome Guide pages, created by NCBI in cooperation with the genomic research communities to provide links to an array of genome-specific resources, are available for cow, pig, and dog. These pages can be seen at:

1  Keele, J. A second-generation linkage map of the bovine genome. Genome Res. 1997 7(3): 235-49. PMID: 9074927
2  Rohrer GA, et al. A comprehensive map of the porcine genome. Genome Res. 1996 6(5): 371-91. PMID: 8743988
3  Guyon R, et al. A 1-Mb resolution radiation hybrid map of the canine genome. PNAS USA. 2003: 100(9): 5296-301. PMID: 12700351

 
Gnomon

To create a gene model, Gnomon finds the best self-consistent set of transcript and protein alignments to a genomic region and uses these alignments as constraints for a Hidden Markov Model (HMM)-based gene prediction. Several steps are involved.

Gnomon evaluates the statistical properties of the transcripts aligned to a genome in order to determine their most probable coding regions. For each gene model, the set of non-overlapping transcript alignments with the best coding propensity is chosen, after which the best matching proteins for the transcript sequences are aligned to the genomic DNA sequence. For HMM-based gene models without supporting transcript evidence, the proteins that are the best matches to the translated genomic sequence are aligned. Gnomon checks that the resulting predicted gene has every exon in a reading frame consistent with the protein alignment, however, the program is free to choose splice sites and to introduce additional exons between segments of the protein alignment.


— DW

Continue to:  GEO


NCBI News | Fall/Winter 2002 NCBI News: Spring 2003