Releases & Statistics
The Consensus CDS (CCDS) project is a collaborative effort to identify a core set of human and mouse protein coding regions that are consistently annotated and of high quality. The long term goal is to support convergence towards a standard set of gene annotations.
Available information includes:
| ||Announcements || |
CCDS Release 19 - Update for Mouse July 30, 2015
The NCBI, Ensembl, and Sanger (Havana) annotation of the GRCm38 reference genome (assembly GCF_000001635.23, NCBI annotation release 105, Ensembl annotation release 81) was analyzed to identify additional coding sequences (CDS) that are consistently annotated. CCDS data is available in the CCDS web site and FTP site and will become available in the collaborators' genome and/or gene browser web sites according to each browser's update cycle.
This update adds 1,003 new CCDS IDs, and adds 148 Genes into the mouse CCDS set. CCDS Release 19 includes a total of 24,834 CCDS IDs that correspond to 20,215 GeneIDs. See the Releases & Statistics report for details.
CCDS Release 18 - Update for Human May 12, 2015
The NCBI, Ensembl, and Sanger (Havana) annotation of the GRCh38 reference genome (assembly GCF_000001405.28, NCBI annotation release 107, Ensembl annotation release 79) was analyzed to identify additional coding sequences (CDS) that are consistently annotated. CCDS data is available in the CCDS web site and FTP site and will become available in the collaborators' genome and/or gene browser web sites according to each browser's update cycle.
This update adds 808 new CCDS IDs, and adds 86 Genes into the human CCDS set. CCDS Release 18 includes a total of 31,371 CCDS IDs that correspond to 18,826 GeneIDs. See the Releases & Statistics report for details.
See Past Announcements
| ||Overview || |
Annotation of genes is provided by multiple public resources, using different methods, and resulting in information that is similar but not always identical. The human and mouse genome sequence is now sufficiently stable to start identifying those gene placements that are identical, and to make those data public and supported as a core set by the three major public genome browsers. The long term goal is to support convergence towards a standard set of gene annotations.
Toward this end, the Consensus CDS (CCDS) project was established. The CCDS project is a collaborative effort to identify a core set of protein coding regions that are consistently annotated and of high quality.
| ||Access and Availability || |
Initial results from the Consensus CDS project are now available from the participants' genome browser Web sites. In addition, CCDS identifiers are indicated on the relevant NCBI RefSeq and Entrez Gene records and in Map Viewer displays of RNA (RefSeq) and Gene annotations on the reference assembly. CCDS reports can be accessed by following provided links, or by directly querying the underlying database using the query interface provided at the top of this page.
The CCDS dataset is also available for anonymous FTP.
| ||Collaborators || |
The CCDS set is built by consensus among the collaborating members which include: We envision the CCDS set will become more complete as the independent curation groups agree on cases where they initially differ, as additional experimental validation of weakly supported genes occurs, and as automatic annotation methods continue to improve. Communication among the CCDS collaborating groups is an ongoing activity that will resolve differences and identify refinements between CCDS update cycles.
| ||CCDS Identifiers and Tracking || |
Annotated genes that are included in the CCDS set are associated with a unique identifier number and version number (e.g., CCDS1.1, CCDS234.1). The version number will update if the CDS structure changes, or if the underlying genome sequence changes at that location. With annotation and sequence based genome browser update cycles, the CCDS set will be mapped forward, maintaining identifiers. All changes to existing CCDS genes are done by collaboration agreement; no single group will change the set unilaterally.
| ||Process Flow and Quality Testing || |
The CCDS set is calculated following coordinated whole genome annotation updates carried out by the NCBI, WTSI, and Ensembl. Annotation updates represent genes that are defined by a mixture of manual curation and automated computational processing.
The main curation groups are the Havana team at the WTSI and the RefSeq annotation group at NCBI. In addition, the manually curated information on chr14 (Genoscope) and Chr7 (Wustl) has been brought in via the Vega resource. The automatic methods are via the Ensembl group and the NCBI genome annotation computational pipeline. Curated information is favored over automated information and the information has to be both consistent in the Hinxton (Vega/Ensembl) and NCBI groups and also pass stringent QC controls.
The general process flow for defining the CCDS gene set includes:
- compare genome annotation results
- identify annotated coding regions that have identical location coordinates on the genome
- quality evaluation
- remove lower quality CDSs from the core set pending additional review among the collaboration groups.
The CCDS set includes coding regions that are annotated as full-length (with an initiating ATG and valid stop-codon), can be translated from the genome without frameshifts, and use consensus splice-sites. The number and type of quality tests performed may be expanded in the future but includes consistency in cross-species comparative analysis, analysis to identify putative pseudogenes, retrotransposed genes, consensus splice sites, supporting transcripts, and protein homology.
| ||Publications || |
Please use the following citations for CCDS:
The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes.
Pruitt KD, Harrow J, Harte RA, Wallin C, Diekhans M, Maglott DR, Searle S, Farrell CM, Loveland JE, Ruef BJ, Hart E, Suner MM, Landrum MJ, Aken B, Ayling S, Baertsch R, Fernandez-Banet J, Cherry JL, Curwen V, Dicuccio M, Kellis M, Lee J, Lin MF, Schuster M, Shkeda A, Amid C, Brown G, Dukhanina O, Frankish A, Hart J, Maidak BL, Mudge J, Murphy MR, Murphy T, Rajan J, Rajput B, Riddick LD, Snow C, Steward C, Webb D, Weber JA, Wilming L, Wu W, Birney E, Haussler D, Hubbard T, Ostell J, Durbin R, Lipman D.
Genome Res. 2009 Jul;19(7):1316-23.
PubMed: PMID: 19498102
Tracking and coordinating an international curation effort for the CCDS Project.
Harte RA, Farrell CM, Loveland JE, Suner MM, Wilming L, Aken B, Barrell D, Frankish A, Wallin C, Searle S, Diekhans M, Harrow J, Pruitt KD.
Database 2012 Mar 20;2012:bas008. doi: 10.1093/database/bas008.
PubMed: PMID: 22434842
Current status and new features of the Consensus Coding Sequence database.
Farrell CM, O'Leary NA, Harte RA, Loveland JE, Wilming LG, Wallin C, Diekhans M, Barrell D, Searle SM, Aken B, Hiatt SM, Frankish A, Suner MM, Rajput B, Steward CA, Brown GR, Bennett R, Murphy M, Wu W, Kay MP, Hart J, Rajan J, Weber J, Snow C, Riddick LD, Hunt T, Webb D, Thomas M, Tamez P, Rangwala SH, McGarvey KM, Pujar S, Shkeda A, Mudge JM, Gonzalez JM, Gilbert JG, Trevanion SJ, Baertsch R, Harrow JL, Hubbard T, Ostell JM, Haussler D, Pruitt KD.
Nucleic Acids Res. 2014 Jan 1;42(1):D865-72. doi: 10.1093/nar/gkt1059.
PubMed: PMID: 24217909