NCBI Logo
NCBI News




In this issue


GENSAT Project Data Now in Entrez

My NCBI

Influenza Virus Resource

NCBI ToolKit Utility Programs

New Microbial Genomes in GenBank

Iceman Preserved in GenBank

RefSeq Updates

RefSeq Release 11

New Organisms in UniGene

GenBank Release 147

New Genome Build

CCDS Database

NCBI Courses

PubMed Corrects Spelling

BLAST Lab

LocusLink Retired

Masthead





Towards a Uniform Human Genome Annotation: the Consensus CDS Database

Annotations of genes on the human genome are displayed within several public resources. These annotations are made using different methods, resulting in gene coordinates and sequences that are similar but not always identical. The human genome sequence is now sufficiently stable to begin to compile a standard set of gene annotations on the human genome by identifying those gene placements that are identical. The Consensus CDS (CCDS) project is a collaborative effort to identify a core set of human protein coding regions that are consistently annotated and are of high quality.

The CCDS set is built by consensus among the collaborating members including the European Bioinformatics Institute (EBI), National Center for Biotechnology Information (NCBI), the Wellcome Trust Sanger Institute (WTSI), and the University of California, Santa Cruz (UCSC).

Annotated genes that are included in the CCDS set are given a unique identifier and version number (e.g., CCDS1.1, CCDS234.1) akin to the GenBank "accession.version" system. If the CDS structure changes or if the underlying genome sequence changes, then the version number will be incremented. With annotation and sequence based genome browser update cycles, the CCDS set will be mapped forward, maintaining identifiers. All changes to existing CCDS genes are made by collaboration agreement.

The CCDS set is calculated on the basis of coordinated whole genome annotation updates carried out by the NCBI and Ensembl. To be included in the CCDS set, coding regions must be annotated as full-length, with an initiating ATG and valid stop codon; must be translated from the genome without frameshifts, and must use consensus splice-sites.

Annotations are made via a mixture of manual curation and automated computational processing. Genome annotations resulting from the NCBI and Ensembl pipelines are first compared to identify annotated coding regions that have identical locations on the genome. Then, lower quality CDSs from this core set are removed pending additional review among the collaboration groups. Quality tests include analysis to identify putative pseudogenes, retrotransposed genes, consensus splice sites, supporting transcripts, and protein homology.

As of March 2005, the initial CCDS dataset contains 14,795 coding sequences and 13,142 genes, representing more than half of the human genes, according to the current gene number.

Visit the CCDS Project Web site at:

back to previous articleContinue to next article

NCBI News | Fall/Winter 2002 NCBI News: Spring 2003