Statistics
on previous AceView builds
Statistics on Human build 33 (June 5th, 2003)
Thanks to the increase in mRNA and EST sequences submitted to the public
databases and to the amelioration of the genome sequence, we align on this new
build 20% more mRNAs and ESTs in 7% fewer genes. On this genome sequence, we
map 3,449,800 cDNA sequences: 3,313,569 ESTs (67% of all ESTs in GenBank/dbest,
a 9% increase relative to last time) and 117,475 mRNAs from the public
databases (93%), as well as 18,756 NCBI RefSeq, representing 99.5% of the
current RefSeq set. Although we do not mask for repeats, only 1.5% of the mRNAs
and 2.75% of the ESTs are ambiguous and match the genome with indistinguishable
quality in more than one place. As for the current quality of the
alignments, mRNAs on average measure 2,017 bp and align over 98.3% of their
length with 99.7% accuracy. ESTs, especially prone to sequencing errors near
the end of the read, on average measure 531 bp, yet align over 93% of their
length with over 98% accuracy.
AceView clusters the 3.5 million reads into 30,953 genes with at
least one validated [gt-ag] or [gc-ag] spliced intron and on average 4.9
alternative variants per gene (altogether 152,371 different mRNAs), exactly as
we had reported in the Lander
et al. main genome paper! Another 515 genes do not have confirmed introns,
yet encode proteins of more than 300 amino-acids and are hence likely
functional intronless genes. The remaining 48,122 genes are unspliced, partial,
or non-coding and will require further investigation. Altogether in this build,
we annotated 79,590 genes, with 201,359 alternative transcript
variants.
In this release, 74,114 (64%) of the spliced transcripts are
supported by a single identified clone covering the entire CDS; the remaining
could represent an inappropriate concatenation of two clones (28%) or more (8%),
because, if no conflicts arise, we merge partial transcripts.
Statistics on Human build 31 (for comparison,
Thanks to code
amelioration, we now map unambiguously 2,763,401 ESTs (representing 58% of the
ESTs currently in GenBank/dbest) and 83,872 mRNAs from the public databases, as
well as 18,000 NCBI RefSeqs, representing 99.3% of the current RefSeq set.
AceView clusters these into 83,874 genes, with altogether 210,122 alternative
transcript variants. 33,286 genes have at least one validated gt-ag or gc-ag
spliced intron, and on average 4.6 alternatively spliced variants.
Statistics on Human build 30 (August 2002)
In this human genome release, known as NCBI build 30, after
filtering, we present on this site 66,830 genes, containing 138,040
reconstructed mRNAs, supported by 1,898,911 mRNAs and ESTs. We currently align
14,970 (95.0%) of the 15,748 NM reference sequences, confirming the near
completion of the human genome sequence.