Send to:

Choose Destination
See comment in PubMed Commons below
Proteomics. 2004 Jun;4(6):1712-26.

Has the yo-yo stopped? An assessment of human protein-coding gene number.

Author information

  • 1Oxford GlycoSciences, Abingdon, UK.


Since the identification of approximately 25,000 proteins from the draft human genome assembly in 2001, estimates of the total have oscillated between 30,000 and 70,000. The recently announced genome closure has not generated a consensus gene count despite this being a key parameter for many areas of biology including drug target discovery and characterization of the human proteome. Contrary to earlier predictions of constitutive under-detection for eukaryotic genes, the latest model organism updates have produced minor increases in the worm but fly and yeast gene numbers have decreased. The postdraft, precompletion interval has produced large increases in human transcript coverage, continuous improvements in genome assembly and refinements in automated genomic annotation. Notably these enhancements have resulted in an Ensembl human protein-coding gene number of 22,184, a decrease of 1862 since the first release. Longitudinal database surveys indicate that redundancy-reduced human mRNA and protein collections are flattening out at approximately 28,000, although Ensembl maps approximately 20,000 known sequences. Observations suggest high-throughput cloning projects are predominantly extending known genes or sampling new splice forms and novel protein discovery has slowed to a trickle. The hypothesis that substantial numbers of short proteins remain experimentally and computationally undetected in mammalian genomes is neither supported by sequence data nor by the extensive homology between mouse and human proteins. Aggregating the independent annotations for complete transcripts from seven completed human chromosomes extrapolates to approximately 25,000 genes. The inclusion of partial putative genes would increase this to above 30,000 but recent data suggest these represent predominantly nonprotein-coding transcripts. Mass spectrometry-based proteomics has already verified more than 10% of human genes but has not identified significant numbers of unpredicted proteins. The available data are thus converging to a basal protein-coding gene number well below 30,000, which could even be as low as 25,000.

[PubMed - indexed for MEDLINE]
PubMed Commons home

PubMed Commons

How to join PubMed Commons

    Supplemental Content

    Full text links

    Icon for Wiley
    Loading ...
    Write to the Help Desk