Releases & Statistics
The Consensus CDS (CCDS) project is a collaborative effort to identify a core set of human and mouse protein coding regions that are consistently annotated and of high quality. The long term goal is to support convergence towards a standard set of gene annotations.
Available information includes:
| ||Announcements || |
Identifying Inclusion in MANE November 9, 2022
Sequence identifiers that are included in a MANE set are now identified in individual CCDS report pages for human records. MANE is the Matched Annotation from NCBI and EMBL-EBI project. There is a link to information about MANE on the left hand side of the CCDS page, under the Related Resources heading. CCDS Release 24 - Update for Human October 26, 2022
The NCBI and Ensembl/Havana annotation of the GRCh38.p14 reference genome (assembly GCF_000001405.40, NCBI annotation release 110, Ensembl annotation release 108) was analyzed to identify additional coding sequences (CDS) that are consistently annotated. CCDS data is available in the CCDS web site and FTP site and will become available in the collaborators' genome and/or gene browser web sites according to each browser's update cycle.
This update adds 2,746 new CCDS IDs, and adds 237 genes into the human CCDS set. CCDS Release 24 includes a total of 35,608 CCDS IDs that correspond to 19,107 GeneIDs, with 48,062 protein sequences from NCBI and 47,762 from Ensembl. See the Releases & Statistics report for details.
See Past Announcements
| ||Overview || |
Annotation of genes is provided by multiple public resources, using different methods, and resulting in information that is similar but not always identical. The human and mouse genome sequence is now sufficiently stable to start identifying those gene placements that are identical, and to make those data public and supported as a core set by the three major public genome browsers. The long term goal is to support convergence towards a standard set of gene annotations.
Toward this end, the Consensus CDS (CCDS) project was established. The CCDS project is a collaborative effort to identify a core set of protein coding regions that are consistently annotated and of high quality.
| ||Access and Availability || |
Initial results from the Consensus CDS project are now available from the participants' genome browser Web sites. In addition, CCDS identifiers are indicated on the relevant NCBI RefSeq and Entrez Gene records. CCDS reports can be accessed by following provided links, or by directly querying the underlying database using the query interface provided at the top of this page.
The CCDS dataset is also available for anonymous FTP.
| ||Collaborators || |
The CCDS set is built by consensus among the collaborating members which include: We envision the CCDS set will become more complete as the independent curation groups agree on cases where they initially differ, as additional experimental validation of weakly supported genes occurs, and as automatic annotation methods continue to improve. Communication among the CCDS collaborating groups is an ongoing activity that will resolve differences and identify refinements between CCDS update cycles.
| ||CCDS Identifiers and Tracking || |
Annotated genes that are included in the CCDS set are associated with a unique identifier number and version number (e.g., CCDS1.1, CCDS234.1). The version number will update if the CDS structure changes, or if the underlying genome sequence changes at that location. With annotation and sequence based genome browser update cycles, the CCDS set will be mapped forward, maintaining identifiers. All changes to existing CCDS genes are done by collaboration agreement; no single group will change the set unilaterally.
| ||Process Flow and Quality Testing || |
The CCDS set is calculated following coordinated whole genome annotation updates carried out by the NCBI and Ensembl. Annotation updates represent genes that are defined by a mixture of manual curation and automated computational processing.
The main curation groups are the Havana team at EMBL-EBI and the RefSeq annotation group at NCBI. The automatic methods are via the Ensembl group and the NCBI genome annotation computational pipeline. Curated information is favored over automated information and the information has to be both consistent in the EMBL-EBI and NCBI groups and also pass stringent QC controls.
The general process flow for defining the CCDS gene set includes:
- compare genome annotation results
- identify annotated coding regions that have identical location coordinates on the genome
- quality evaluation
- remove lower quality CDSs from the core set pending additional review among the collaboration groups.
The CCDS set includes coding regions that are annotated as full-length (with an initiating ATG and valid stop-codon), can be translated from the genome without frameshifts, and use consensus splice-sites. The number and type of quality tests performed may be expanded in the future but includes consistency in cross-species comparative analysis, analysis to identify putative pseudogenes, retrotransposed genes, consensus splice sites, supporting transcripts, and protein homology.
| ||Publications || |
Please use the following citations for CCDS:
The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes.
Pruitt KD, Harrow J, Harte RA, Wallin C, Diekhans M, Maglott DR, Searle S, Farrell CM, Loveland JE, Ruef BJ, Hart E, Suner MM, Landrum MJ, Aken B, Ayling S, Baertsch R, Fernandez-Banet J, Cherry JL, Curwen V, Dicuccio M, Kellis M, Lee J, Lin MF, Schuster M, Shkeda A, Amid C, Brown G, Dukhanina O, Frankish A, Hart J, Maidak BL, Mudge J, Murphy MR, Murphy T, Rajan J, Rajput B, Riddick LD, Snow C, Steward C, Webb D, Weber JA, Wilming L, Wu W, Birney E, Haussler D, Hubbard T, Ostell J, Durbin R, Lipman D.
Genome Res. 2009 Jul;19(7):1316-23.
PubMed: PMID: 19498102
Tracking and coordinating an international curation effort for the CCDS Project.
Harte RA, Farrell CM, Loveland JE, Suner MM, Wilming L, Aken B, Barrell D, Frankish A, Wallin C, Searle S, Diekhans M, Harrow J, Pruitt KD.
Database 2012 Mar 20;2012:bas008. doi: 10.1093/database/bas008.
PubMed: PMID: 22434842
Current status and new features of the Consensus Coding Sequence database.
Farrell CM, O'Leary NA, Harte RA, Loveland JE, Wilming LG, Wallin C, Diekhans M, Barrell D, Searle SM, Aken B, Hiatt SM, Frankish A, Suner MM, Rajput B, Steward CA, Brown GR, Bennett R, Murphy M, Wu W, Kay MP, Hart J, Rajan J, Weber J, Snow C, Riddick LD, Hunt T, Webb D, Thomas M, Tamez P, Rangwala SH, McGarvey KM, Pujar S, Shkeda A, Mudge JM, Gonzalez JM, Gilbert JG, Trevanion SJ, Baertsch R, Harrow JL, Hubbard T, Ostell JM, Haussler D, Pruitt KD.
Nucleic Acids Res. 2014 Jan 1;42(1):D865-72. doi: 10.1093/nar/gkt1059.
PubMed: PMID: 24217909
Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation.
Pujar S, O'Leary NA, Farrell CM, Loveland JE, Mudge JM, Wallin C, Girón CG, Diekhans M, Barnes I, Bennett R, Berry AE, Cox E, Davidson C, Goldfarb T, Gonzalez JM, Hunt T, Jackson J, Joardar V, Kay MP, Kodali VK, Martin FJ, McAndrews M, McGarvey KM, Murphy M, Rajput B, Rangwala SH, Riddick LD, Seal RL, Suner MM, Webb D, Zhu S, Aken BL, Bruford EA, Bult CJ, Frankish A, Murphy T, Pruitt KD.
Nucleic Acids Res. 2018 Jan 4;46(D1):D221-D228. doi: 10.1093/nar/gkx1031.
PubMed: PMID: 29126148
PubMed Central: PMCID: PMC5753299