Statistics on previous AceView builds
Statistics on Human build 33 (June 5th, 2003)
Thanks to the increase in mRNA and EST sequences submitted to the public databases and to the amelioration of the genome sequence, we align on this new build 20% more mRNAs and ESTs in 7% fewer genes. On this genome sequence, we map 3,449,800 cDNA sequences: 3,313,569 ESTs (67% of all ESTs in GenBank/dbest, a 9% increase relative to last time) and 117,475 mRNAs from the public databases (93%), as well as 18,756 NCBI RefSeq, representing 99.5% of the current RefSeq set. Although we do not mask for repeats, only 1.5% of the mRNAs and 2.75% of the ESTs are ambiguous and match the genome with indistinguishable quality in more than one place. As for the current quality of the alignments, mRNAs on average measure 2,017 bp and align over 98.3% of their length with 99.7% accuracy. ESTs, especially prone to sequencing errors near the end of the read, on average measure 531 bp, yet align over 93% of their length with over 98% accuracy.
AceView clusters the 3.5 million reads into 30,953 genes with at least one validated [gt-ag] or [gc-ag] spliced intron and on average 4.9 alternative variants per gene (altogether 152,371 different mRNAs), exactly as we had reported in the Lander et al. main genome paper! Another 515 genes do not have confirmed introns, yet encode proteins of more than 300 amino-acids and are hence likely functional intronless genes. The remaining 48,122 genes are unspliced, partial, or non-coding and will require further investigation. Altogether in this build, we annotated 79,590 genes, with 201,359 alternative transcript variants.
In this release, 74,114 (64%) of the spliced transcripts are supported by a single identified clone covering the entire CDS; the remaining could represent an inappropriate concatenation of two clones (28%) or more (8%), because, if no conflicts arise, we merge partial transcripts.
Statistics on Human build 31 (for comparison,
Thanks to code amelioration, we now map unambiguously 2,763,401 ESTs (representing 58% of the ESTs currently in GenBank/dbest) and 83,872 mRNAs from the public databases, as well as 18,000 NCBI RefSeqs, representing 99.3% of the current RefSeq set. AceView clusters these into 83,874 genes, with altogether 210,122 alternative transcript variants. 33,286 genes have at least one validated gt-ag or gc-ag spliced intron, and on average 4.6 alternatively spliced variants.
Statistics on Human build 30 (August 2002)
In this human genome release, known as NCBI build 30, after filtering, we present on this site 66,830 genes, containing 138,040 reconstructed mRNAs, supported by 1,898,911 mRNAs and ESTs. We currently align 14,970 (95.0%) of the 15,748 NM reference sequences, confirming the near completion of the human genome sequence.