We are pleased to present, at
long last, a new release of the human genes, built in AceView on the current
genome (NCBI 36.2/ UCSC hg18), using
last release, we were happy to observe that AceView transcripts, in the ENCODE
regions, are extremely similar to the Havana/Vega hand curated transcripts that
were selected as the Gold standard by NHGRI. The other methods participating in
the Gencode project see at best one third of the transcripts seen by
The Pfam motifs and associated
InterPro and GO descriptions used to annotate the proteins are version 20.0 of Pfam, downloaded from
- Promoter sequence: To facilitate the search for promoter elements, we currently provide the 5 kb sequence upstream of each mRNA. We noticed that Xie et al, 2005 define each promoter region as the non-coding sequence contained within a 4-kilobase (kb) window centered at the annotated transcriptional start site (TSS). Please tell us if you would prefer this 4 kb sequence instead of the 5 kb upstream.
October 7, 2005: The NCBI version of the Acedb/AceView software was made available on the ftp site:
August, 2005: new release, with updated code and data for human and worm
The Pfam motifs and InterPro
descriptions used to annotate the proteins are version 17.0 of Pfam,
To reconstruct the AceView mRNAs, we use 91% of the GenBank mRNA (174,129) and 76% of the dbEST human data (4,325,853), a 5% increase relative to the Nov04 version. We also use 23,895 RefSeq. Quality of the alignments remains excellent. Between 1.1% and 2.2% of the sequences map ambiguously, in about 1500 fully or partly repeated genes.
The 4.5 million cDNA sequences aligned on the genome define 312,932 different standard introns, all well supported (i.e. a cDNA sequence exactly matches the genome over 16 bp, equally split on both sides of the intron boundary). 97.7% are gt-ag, 2.2% gc-ag (a bit high?) and 0.08% at-ac (265 cases). An additional 4485 introns are anomalous, and most probably correspond to defects in the cDNA clones (or the genome).
We currently annotate 65% of the introns as alternative, but this is an upper bound because we over-count a little, due to a newly introduced bug.
Introns are easy to define, and very reliable markers of expression and of the splicing and alternative splicing patterns. Long primers sitting across exon boundaries could be excellent tools for microarray building, or RT-PCR experiments. We plan to add such sequences, possibly 80 or 100 bp (40/50 bp on each side?), taken on the reconstructed mRNA model, and to count the support and alternative nature of the primer. We are still thinking about how to present these data, in each gene or mRNA model as well as in bulk, so that they are useful to our users. If you have comments or ideas, please tell us.
Using this simple measure of standard intron counts, we may ask what is the added value of each type of data:
- The NM and NR RefSeqs see a total of 171,864 standard introns (and 442 non-standard), or 55% of the total number of introns in the human cDNAs from the public databases. Therefore, despite the small size of the RefSeq database, they already touch more than half of the known standard introns. Their pattern is: 99.07% gt-ag, 0.80% gc-ag, 0.11% U12 at-ac (182 cases).
- If we considered only the 174,129 GenBank mRNAs, they define a total of 226,014 standard introns, 72% of the total (98.7% gt-ag, 1.17% gc-ag, 0.10% at-ac (228 cases); in addition to 1623 non-standard introns).
About the genes:
AceView clusters the 4.524 million reads into 56,491 genes with standard introns or well-coding (in addition to 40,567 putative genes, more partial or dubious, and 251,183 cloud genes, not supported by many clones, unspliced and not clearly coding, which may represent intermediates in the transcription process).
- 35,413 genes have fully-supported standard introns: 16,497 of these genes (or 47%) have at least one RefSeq, 22,280 genes with standard introns encode putative proteins above 100 aminoacids, the remaining 13,133 may encode short proteins, be non-coding, or partial.
- 22,469 genes supported by mRNAs or ESTs do not have introns, yet encode putative CDS of more than 100 aminoacids (1005 are above 300 aminoacids). These intronless genes are possibly functional in the proteome. 1297 of these genes (5.8%) are represented in RefSeq (Aug 11, 05).
- If we consider the potential for protein-coding of all genes, with and without introns, we find that 44,749 genes in this release encode a putative CDS of more than 100 aa, of which 14,152 encode CDS above 300 residues.
- But alternative splicing generates on average of the order of 5 variants, 4 protein isoforms per gene with introns. So the variety of proteins expected to be present in the cells is much greater: AceView annotates 133,744 different protein isoforms of more than 100 aminoacids, 35,536 are above 300 aminoacids. 69% of the AceView CDS above 100 aa, and above 300 aa are fully encoded within a single identified clone in GenBank/dbEST (91,738 different proteins > 100 aa; 24,367 distinct proteins > 300 aa have the coding part fully covered by a known cDNA), bringing the strictly non-dubious protein isoforms to 3.1 per gene with introns on average. The remaining 31% AceView proteins could still result from an inappropriate concatenation of two clones (26%) or more (5%), because, if no conflicts arise, we merge partial mRNAs.
- 6% of the proteins of more than 300 aminoacids are encoded by more than one alternative mRNA variants (38,013 mRNA variants encode only 35,536 different proteins). This is true for 9% of the proteins above 100 aa (185,073 mRNA give only 168,366 distinct proteins). That is because a vast majority (91 to 94%) of the alternative introns, alternative promoter or last exon affect the coding region.
- Using stringent criteria to define antisense genes (part of our standard annotation, under regulation), we identify 10,960 genes with standard introns antisense to another gene with standard introns, i.e. 31% of all genes with standard introns might undergo this type of flip-flop regulation. Also, the function of up to one third of the non-coding genes with introns could be to regulate the gene on the other strand by antisense.
Total available in public DB apr/aug05
Alignment quality: total length aligned, # diff from genome
Ave. bp aligned per sequence, %
% bp diff relative to genome
67.927 Mb, 66,529 diff
(2830 bp) 99.9%
343.198 Mb 748,102 diff
(1929 bp) 99.3%
Multi-aligned sequences (%)
In # (repeated) genes
January-April, 2005: minor modifications to the site, as they are made public
April 10: We updated the worm version to the current freeze of the genome (known as WS140).
March 23: We noticed to our dismay that the lists of genes or clones with specific properties were no longer accessible from our GOLD work, and re-established the links from the supplementary material. We ignore how long the outage lasted, but are very sorry for it.
January 18: We fixed the PFAM and AceView mRNA Blast searches, which were not properly connected to the current InterPro definition files for PFAM, and to the current AceView mRNA files for tBlastn.
January 16: On the download site, we added a file providing The relation between all the aligned ESTs/mRNAs, the genes, the locusIds and the alternative variants, with indication of quality of match, tissue of origin, type of AceView gene, for all genes (main, putative and cloud). Help on the format of the accession to gene table has been updated as five more data columns have been added to enrich the data in this table. Meanwhile, we got rid of the 0.5% worse quality alignments (30,403 alignments/6,428,329) by removing the ESTs or mRNAs aligned at AceView quality 10 or 11 (see the help here). Two files (The way all the ESTs/mRNAs are aligned in each gene, for all non-cloud genes (55.3 Mb) and the Same file as above, but for cloud genes (5.7 MB) ) thus became redundant and were removed. The links to download the GOLD alignments was broken and has been restored.
November 24, 2004: New WormGenes AceView version (on the WS130 genome) and improved human genes on human build 35
- In nematode, we are doing manual edition of all the genes to complement and enhance the WormBase view. Manually, it is easy to distinguish the case complex locus and the case concatenated genes touching through their 3UTR/5UTR. Complex loci were initially defined on the basis of complementation tests in phage or Drosophila. In such a gene, two complete alternative variants may not share a base, although a third variant bridges them and overlaps both. By convention, we denote this property by adding the letter C for complex behind the name (e.g., 1C777C) or by a + sign if the gene had acquired two independent official gene names (e.g., mai-1+gpd-2+gpd-3). This convention follows that chosen by Ed Lewis for the Drosophila complex genes (e.g. BXC for bithorax complex locus, of which bx, Ubx, Cbx and pbx for example are alleles). In the case of nearby genes expressed from the same strand but with overlap between the 3 end of the first gene and the 5 end of the second, we use the suffix Co (for sequence in COmmon) appended after the gene name (e.g., cul-1Co or 5K225Co). We may also use the sign AND if both genes have a name (e.g., mev-1ANDced-9).
October 24, 2004 New version on human build 35 : new features
September 23, 2004 News on human build 35 : statistics
The Pfam motifs used to annotate the proteins from the human build 35/July
2004 are version 14.0 of Pfam, downloaded from
Statistics for this release (build 35/July 2004):
The percentage of mRNAs and ESTs that we align increased only by 0.55% and 0.64% respectively, because the genome sequence gained in quality more than in quantity over the past year, from build 34 to 35. Yet, thanks to the increase in mRNA and EST sequences submitted to the public databases, we now identify a total of 96,797 well supported transcribed genes, a net increase of 11,839 relative to the release of July 2003 (on build 34). In particular, we gained 1,248 genes with standard introns, bringing the total number of independent genes with standard introns to 32,748.
On the current genome sequence, we reliably map 4,297,980 cDNA sequences present in GenBank August 2, 2004, a net increase of 269,318 mRNA or EST sequences from last year. That includes 4,131,646 ESTs (73.37% of all ESTs in GenBank/dbest) and 166,334 mRNAs from GenBank (94.96%). In addition, this build includes alignments for 21,565 NCBI RefSeq: the increase of 1,622 RefSeq mRNA mostly corresponds to creations by the NCBI RefSeq group, which generates this useful resource.
Although we do not mask for repeats, only 1.1% of the mRNAs and 2.2% of the ESTs are ambiguous and match the genome with indistinguishable quality in more than one place. The percentages of clones did not change since last time, but AceView has greatly improved in its ability to identify truly repeated genes. To make this more visible graphically, we now draw the clones aligning in multiple genes at the same quality in blue (rather than black), but keep the same color code for point variations relative to the genome: a thin red line for a single base change, transition or transversion, blue line for a single base insertion or deletion. (from build 34, to be updated: As for the current quality of the alignments, mRNAs on average measure 2,132 bp and align over 98.8% of their length with 99.74% accuracy. ESTs, especially prone to sequencing errors near the end of the read, on average measure 532 bp, and align over 93.3% of their length with 98.16% accuracy.)
AceView clusters the 4.30 million reads into a total of 96,797 genes (and 248,728 cloud genes, not supported by many clones, unspliced and not clearly coding, which may represent intermediates in the transcription process). If we push in the direction of the current trend for low numbers of genes, there are a total of 46,220 genes in this release that either encode a putative CDS of more than 100 aa or are spliced with standard introns. 32,618 genes encode a putative CDS of more than 100 aminoacids and 32,748 genes have at least one validated gt-ag or gc-ag spliced intron.
Using stringent criteria to define antisense genes, we find that 8,572 genes with standard introns in this release are antisense to another gene with standard introns, i.e. 26% of all genes with standard introns might undergo this type of flip-flop regulation. Also, the function of up to one third of the non-coding genes with introns could be to regulate the gene on the other strand by antisense. Finally, 13,472 genes have no standard introns, yet they encode a putative CDS of more than 100 amino-acids (346 are above 300 aa), hence are possibly functional intronless genes.
The genes with introns have on average 5.37 alternative variants per gene, of which 4.51 have introns.
We annotate two proteins in 3.2% of the AceView mRNAs (7,607/239,991), but do not for now make public the uORF annotations, so as not to scare the users away!
In this AceView release, 69% of the CDS >100 aa are fully supported by a single identified clone covering the entire CDS (64,656 CDS from 30,110 genes encoding > 100 aa), bringing the strictly non-dubious protein isoforms to 2.8 per gene with introns on average. The remaining 31% AceView proteins could still result from an inappropriate concatenation of two clones (26%) or more (5%, down from 8% last time!), because, if no conflicts arise, we merge partial mRNAs, hoping that more of you will sequence the entire insert of clones that we have to concatenate now, by lack of more complete data...
May, 2004 New build for Arabidopsis thaliana
News January 2004 on human build 34 and new features
June 11, 2004: A file describing the association of each mRNA/EST accession satisfactorily aligned in AceView build 34 to the corresponding AceView gene is now available from the downloads page.
January 12, 2004: We released our new AceView genes, built over the human genome sequence from July 2003 (build 34/golden path hg16).
The Pfam motifs used for human December 2003 (build 34) are from version 10.0 downloaded September 24th, 2003.
We have improved the alignments, the clustering, the analysis and the presentation of the results, in part following reports or requests from our users, be they thanked wholeheartedly for their comments!
- We have given unique names to the mRNA models, so that there will be no confusion from one release to the next: for this release, Dec03 was appended to each mRNA name. It should be expected that variants may change sequence from one release to the next: variant aDec03 will possibly differ from variant aMar04. Indeed, let us recall that the gene names are tracked in AceView, but the variant names are release dependent.
- In this release, we have allowed multiple proteins per mRNA, although we limited this to cases where
o The two proteins do not overlap and have similar sizes; this may reflect an actual operon-like precursor (we have noticed a number of confirmed ones even in human), or a bug of AceView missing a 3 end or a 5 end and concatenating cDNAs that only exist separately
o the two proteins are only marginally overlapping (possible reasons for this could be: the genome has a single base deletion or insertion relative to the mRNA creating a frameshift, or more rarely there is a translational frameshift, or else there is a bug in the chaining of cDNA clones by AceView, and a real 3 or 5 end was missed)
o the two proteins overlap, but they have about equal length so we cannot decide which frame is the biological one
o In all cases, looking at the Annotated mRNA usually allows to see if one or both proteins are real, since both have been annotated on the graphical view (but the text explanation in the annotated page has not yet been modified to clearly treat this case and only one protein (the best) is annotated).
- Sequences of the mRNAs (derived from the genome or from the cDNA consensus .AM), 5UTR, 3UTR, coding regions (nucleotide and protein sequences), introns and exons and upstream genomic area (some containing the promotor) are now available by clicking from the corresponding tables (mRNA, Proteins, Introns and exons). We have also added sequences containing the putative promotors. Most sequences are now available from a natural place: the mRNAs table, the Proteins table or the introns and exons table. We were also asked to provide couples of primers to amplify each and every exon, and that will come for the next release.
- The GO terms, in the summary table in the top of the gene page, are now comprehensively connected to all genes with the same terms.
- We have changed the Table of contents in both its look and the organization of the data, especially in the Gene summary section. Please tell us if you notice some information is missing.
- We have fixed the bug that used to split some genes vertically (hence the slightly reduced number of spliced genes in this release) and a bug that in 90 genes was creating introns with negative length (thanks to our nice users).
- We have also fixed a few bugs in the analysis: our counts of alternative promotors were sometimes inaccurate; the length of sequence involved in an antisense was cumulative over the alternative mRNAs; our table of Main cDNA clones, which should allow you to get all clones required to fully support all variants, was sometimes incomplete; a sentence about operons was there irrespective of the data.
Human build 34 Sorry we forgot to update this part until
Thanks to the increase in mRNA and EST sequences submitted to the public databases and to the amelioration of the genome sequence, we align on this new build 7% more mRNAs and ESTs in 547 more genes with standard introns. On this genome sequence, we map 4,028,662 cDNA sequences: 3,910,845 ESTs (73% of all ESTs in GenBank/dbest, a 7% increase relative to last time) and 117,817 mRNAs from the public databases (94.4%), as well as 19,943 NCBI RefSeqs, representing 99.8% of the current RefSeq collection.
Although we do not mask for repeats, only 1.0% of the mRNAs and 2.2% of the ESTs are ambiguous and match the genome with indistinguishable quality in more than one place. As for the current quality of the alignments, mRNAs on average measure 2,132 bp and align over 98.8% of their length with 99.74% accuracy. ESTs, especially prone to sequencing errors near the end of the read, on average measure 532 bp, and align over 93.3% of their length with 98.16% accuracy.
AceView clusters the 4.02 million reads into 31,500 genes with at least one validated gt-ag or gc-ag spliced intron and on average 5.36 alternative variants per gene, 5.57 protein isoforms (because we accept a limited number of mRNAs where we annotate two proteins). Another 14199 intronless genes do not have confirmed introns, yet have open reading frames of more than 300 amino-acids (of which 596 are above 300 aa) and are hence likely functional intronless genes. The remaining 223,435 genes are unspliced, partial, or non-coding and will require further investigation. The reason why we have so many genes in this build is because we used to filter the last category to keep only the genes supported by at least six clones, or long. But people were sometimes searching AceView to locate the best alignment of a clone, and failing to find it because of the filter. We call this class of genes the cloud, and indicate this in the gene title. Their biological relevance remains to be established.
In this release, 98,629 (68%) of the spliced mRNAs are supported by a single identified clone covering the entire CDS; the remaining could represent an inappropriate concatenation of two clones (24%) or more (8%), because, if no conflicts arise, we merge partial cDNAs.
We describe below the new releases and a few new features for
- human (June 5th, 2003)
- the worm Caenorhabditis elegans (June 23rd, 2003)
- Arabidopsis (May 15th, 2003)
News on Human Build 33 (June 5th) posted September 20th
We released our new Acembly genes, built over the human genome sequence from April 2003 (build 33 / golden path hg15), on June 5th.
Most significant improvements: Progress for this build (September 20th, 2003)
The most significant improvements for this build are:
- Tables of genes with a given Psort motifs are now available from the Table of Psort motifs. They used to be kind of accessible by clicking on the red boxes in the mRNA diagram, but most of the time, we could not show the whole lot because we had put a limit to the maximal number of genes in a single table. The table of genes with a coiled-coil, which are putatively involved in protein interactions, has been most appreciated by our loudest users.
- Any AceView table now comes with an indication of the level of expression of each gene in the table, through the number of cDNA clones that Acembly assigns to that gene. We also re-ordered the genes, not by their alphabetical name, but according to their map position on the chromosomes. In human, we now provide in the tables both the known cytolocation and the actual chromosome on which Acembly assembled it, for a quick comparison.
- Minor detail fixed to avoid spaghetti genes in Acembly. We unplug a very distant first or last exon if it has too many errors at an early step in the program. We may have lost a few small good looking genes in the process. We also have a bug in this release: some genes with very many clones are split into pieces and appear as a cluster of overlapping small genes. This will be fixed in release 34.
- The query system has been bettered. We fused what we used to call extended search and fast search into a new single query box which tries to implement the notoriously impossible 'what you get is what you wish' paradigm. By observing the queries we receive, and the data in the various fields of the database that we want accessible, we ended up ordering and classifying the data to be searched, and have drawn a set of reasonable rules (= based on common sense) for extending what users type in the query box. The algorithm nearly always gives us a reasonable answer, but we would really appreciate users feedback if they encounter problems. In practice and to simplify, we first try to recognize exact gene symbols, locusID, Genbank identifiers and cDNA clones: if we do, we stop the search. If we dont, we proceed by searching all other fields in the database using a general search, allowing word completion and extension both ways. All objects in the database point to genes, we then collect the lists of genes. The inconvenient is that we may bring genes somewhat distantly related to the query. To compensate, we have added the possibility of recursive searches. The reply to a query comes as a table giving the gene names, their position and a few words describing their function or phenotype (we have done that systematically for worm genes), plus a new query box that will query only inside this list. One possibly irritating feature with the query answer is that it may be difficult to trace back the term among the many documents attached to a gene; often the terms are found in the abstract of one of the papers, and the user has to open and read them all to find why the gene belongs to the list. Yet the system is not bugged: computers have many qualities, but they lack imagination and creativity: we can guarantee the words typed were found somewhere in this genes information.
- A new indexing system has been developed to speed up the answers to the queries over the entire database. It indexes on strings of three characters and has been a lot of fun to develop, maybe it is original, maybe not. In any event, we hope you enjoy the outcome. Too few of you use the box to query AceView. Please try it you will be impressed!
Statistics on Human build 33 (June 5th, 2003)
Note: build 32 was never made public
Thanks to the increase in mRNA and EST sequences submitted to the public databases and to the amelioration of the genome sequence, we align on this new build 20% more mRNAs and ESTs in 7% fewer genes. On this genome sequence, we map 3,449,800 cDNA sequences: 3,313,569 ESTs (67% of all ESTs in GenBank/dbest, a 9% increase relative to last time) and 117,475 mRNAs from the public databases (93%), as well as 18,756 NCBI RefSeq, representing 99.5% of the current RefSeq set. Although we do not mask for repeats, only 1.5% of the mRNAs and 2.75% of the ESTs are ambiguous and match the genome with indistinguishable quality in more than one place. As for the current quality of the alignments, mRNAs on average measure 2,017 bp and align over 98.3% of their length with 99.7% accuracy. ESTs, especially prone to sequencing errors near the end of the read, on average measure 531 bp, yet align over 93% of their length with over 98% accuracy.
AceView clusters the 3.5 million reads into 30,953 genes with at least one validated [gt-ag] or [gc-ag] spliced intron and on average 4.9 alternative variants per gene (altogether 152,371 different mRNAs), exactly as we had reported in the Lander et al. main genome paper! Another 515 genes do not have confirmed introns, yet encode proteins of more than 300 amino-acids and are hence likely functional intronless genes. The remaining 48,122 genes are unspliced, partial, or non-coding and will require further investigation. Altogether in this build, we annotated 79,590 genes, with 201,359 alternative mRNA variants.
In this release, 74,114 (64%) of the spliced mRNA models are supported by a single identified clone covering the entire CDS; the remaining could represent an inappropriate concatenation of two clones (28%) or more (8%), because, if no conflicts arise, we merge partial cDNAs.
The Pfam motifs used are from version 8.0, downloaded May 2003.
Reference sequences of the nematode genes have been updated at NCBI on July 16th (and on the web on June 23rd).
We describe below
C.elegans researchers, we count on you to help us in the NCBI annotation process, please send us text descriptions for the genes you studied. Note that for us, texts are better than keywords, they are more precise, and because we index all texts and abstracts for searches, the language can evolve and we will not miss new concepts. Also please check the bibliography in the bottom of the Gene on Genome page, and send us corrections: missing papers, and above all papers attributed to a gene that they do not describe create a lot of nuisance. Thank you!
Statistics of C.elegans genes
The complete genome of release WS97 (a fragment of about 10 kb was added since) was provided by the Genome Sequencing Consortium through WormBase. This version of the genome consists of 100 264 081 bp and 22725 CDS/non coding RNA models, recognizable by their cosmid.number name. We have used all Genbank mRNAs as of Feb 10, 2003 to replace, whenever possible, models of genes by complete mRNAs: this happened for 1271 reviewed RefSeq mRNAs, from 1106 genes. Through collaboration with Yuji Kohara and the transcriptome project, we were able to indicate which of the predictions are fully and exactly supported by cDNAs, even when the full length cDNA is not yet in Genbank as an mRNA. This happened for 5186 validated RefSeqs mRNA from 4783 genes. The fully confirmed reviewed and validated RefSeqs are drawn in pink to signify their trustworthiness. Another 11555 mRNAs are approximately or partially supported, and 4442 are not supported by expression data or phenotype so far: all these mRNAs are drawn in dubious blue rather than pink.
In term of genes, this release includes
. 19173 would produce mRNA(s): 14731 have been shown to be expressed, 3065 are associated to a phenotype, by mutation or gene specific RNA interference.
. 1886 produce
other RNAs, including 733 tRNA, rRNA,
snRNA, scRNA, miRNA,
pseudogenes, non coding polyadenylated RNAs, and unpredicted
transcribed genes. 54 have an associated phenotype.
A glossary of C.elegans Gene Names
6029 genes now have an official name such as tra-1 or ced-4 (from the CGC). All molecularly identified genes also have a positional Worm Transcriptome Project name, which gives unambiguously the gene order and the strand: for example 1C94 is the gene on chromosome number 1, megabase section C (3rd), at or near kilobase 94: In addition, its strand information is encoded as follows: odd genes run downstream, on the direct strand, even genes run upstream, on the reverse strand.
Some nematode genes appear
to belong to a complex locus as initially defined by complementation
tests in phage or Drosophila. In such a gene, two complete alternative variants
may not share a base, although a third variant bridges them and overlaps
both. By convention, we denote this property by adding the letter C for complex behind the name (e.g. 1C777C), or by a
dot if the gene had acquired two independent gene names (e.g. mai-1.gpd-2.gpd-3).
Another frequent case is that of nearby genes expressed from the same strand,
but with overlap between the 3 end of the first gene and 5 end of the second:
disputably, but for technical reasons, these are represented as a single gene,
with the suffix Co (for common
UTR sequence) appended after the gene name (e.g. cul-1Co or 5K225Co), or with
the sign AND if both genes have a name (e.g. mev-1ANDced-9).
In AceView, the extent of the nematode genes (represented by the named turquoise bar in the graphs on genome) is based on the actual mRNA and EST supporting data rather than on the predicted sequences from WormBase. Most of the RefSeq genes also contain this information, except, for technical reasons, for the genes mispredicted and too long. The advantage is that, although WormBase changed 2620 of their predictions over the last 7 months, the position or extent of only 118 of the AceView genes had to be changed. The same stability is seen in the nematode LocusLink.
As explained above, the sequences of pink mRNAs in AceView are fully supported, unlike sequences of blue mRNAs.
Gene annotation and biology
Our annotation of the worm genes initially relied on a selection from WormBase. We have started to re-annotate the genes, with the help of the Worm Community and of the Worm Transcriptome Project led by Yuji Kohara.
1. Expression data: For this release, we annotated the level and developmental pattern of expression for 11,000 genes (from Acembly analysis of cDNA libraries representation). We list all clones in each gene, point to the best clone and describe the anomalies we see in the anomalous clones. We provide direct access to NextDB, the extraordinary resource provided by Kohara and ShinI to visualize the in situ expression patterns on a partition of all developmental stages. Note that this data is still unpublished: Users of the data presented on the NextDB web pages should not publish the information without Kohara permission and appropriate acknowledgment. The database is still in embryonic phase, so, if you find any discrepancies in the data, please let them know straightaway. They would also appreciate any suggestions and comments.
2. mRNA, protein and gene titles were generated. All gene titles now bear an indication of their function: for all genes where a phenotype has been described, the word essential or phenotype appears in the gene title. Proteins and genes titles are in the process of hand edition. In this release, we have hand annotated from the literature 581/2 gene-gene interactions and 694/2 protein-protein interactions, and have worked on identifying and renaming all collagen genes (187) in the worm (with J Kramer and the genome consortium). Finally, we have hand edited the entire file of phenotypic descriptions from C.elegans II, because it described the phenotypes with a coded vocabulary that made it difficult for the neophyte to understand.
The Pfam motifs used for this lot are from version 8.0, downloaded May 2003.
May 15, 2003 Arabidopsis thaliana new release
On May 15th, we released our reconstruction of the Arabidopsis genes, using all 178,464 ESTs and 28,721 mRNAs available from GenBank on March 5th, 2003 and aligning them on the TIGR genome release (downloaded from the TIGR site November 2002).
AceView aligned 168,076 ESTs (94.2%) and 28,653 mRNAs (99.8%), and from this, reconstructed 21,431 genes, giving rise, by alternative splicing or alternative promoters, to 26,687 mRNAs. The genes look like nematode or fly genes in their overall organization and dimensions. All the proteins were annotated by our pipeline.
The level of alternatives was strikingly and significantly lower than in worm, and a fortiori human, taking into account the high level of coverage: 25% of the mRNAs are alternative forms. Among the genes with at least 3 cDNA clones, only 20% contain alternative variants (2,728 genes out of 13,502). That may be an interesting biological difference.
Another remarkable feature is the extremely high quality of many cDNA and even EST sequences: many do not have a single base different from the genome, and we see very few cDNA clones needing to be tagged as abnormal (only 307 of 192,351, 20 to 40 times less than in the other species). In other terms, this project just does not look like the other projects we analyzed: worm, Drosophila, and human. It could be that plants are really different, or that the excellent sequences of cDNAs in GenBank are predictions, or at least sequences corrected by genome alignment. In the same vein, there is also a low level of polymorphisms (which came as a surprise to us) and of genome sequencing errors, as monitored by discrepancies between sets of cDNAs and the genome sequence.
The next standard step would have been to get the official gene names and the biology, and we contacted Tair to this effect, but we did not yet find time to proceed. So this Arabidopsis release is rudimentary; it can be queried by sequence accessions or by protein annotation, but that is about all.
The Pfam motifs used for this release of Arabidopsis are from version 8.0, downloaded May 2003.
February 20, 2003 (corresponds to human build 31)
We just released our new Acembly genes, built over the human genome sequence from November 2002 (build 31 / golden path hg13).
We improved the Acembly clustering algorithms to make the genes more precise. We reinforced the mRNAs that already had a fully sequenced clone and reduced the combinatorics. As a result, we may end up with incomplete alternative variants that should trigger scientists to obtain the clones for full length re-sequencing. We also display more validated alternatively spliced variants.
The Pfam motifs used are from version 7.7, downloaded December 2002.
The most significant interface improvements for this build are:
Statistics on Human build 31 (February 20th, 2003)
Thanks to code amelioration, we now map unambiguously 2,763,401 ESTs (representing 58% of the ESTs currently in GenBank/dbest) and 83,872 mRNAs from the public databases, as well as 18,000 NCBI RefSeqs, representing 99.3% of the current RefSeq set. AceView clusters these into 83,874 genes, with altogether 210,122 alternative mRNA variants. 33,286 genes have at least one validated gt-ag or gc-ag spliced intron, and on average 4.6 alternatively spliced variants.
We added Caenorhabditis elegans data to our system. This involved revising the web service to correctly display data for multiple species.
August 2002 corresponds to human build 30
In this release we replaced our complicated positional identifiers (such as G_t1_Hs1_4478_30_0_2551) by the official Locus name, if possible, else by a PFAM name.number, when the protein contains a PFAM motif and otherwise by an artificially generated name. We compose either a pseudo Japanese sounding name, such as sayuri, kimu and nowara, to label genes where one of the principal clones is Japanese, else a pseudo English sounding name, like jawker or sneery.
Statistics on Human build 30 (August 2002)
In this human genome release, known as NCBI build 30, after filtering, we present on this site 66,830 genes, containing 138,040 reconstructed mRNAs, supported by 1,898,911 mRNAs and ESTs. We currently align 14,970 (95.0%) of the 15,748 NM reference sequences, confirming the near completion of the human genome sequence.
AceView shows the alignment of cDNAs to the genome sequence (see Contig Assembly Process), and the genes and mRNAs reconstructed from these alignments, using the Acembly program developed by Jean and Danielle Thierry-Mieg on top of the acedb object oriented database manager.
Protein BLAST hits to the genomic sequence and conserved motifs identified by RPS-BLAST are displayed, but they are not used to generate the model. Complementary views of the genes are shown in LocusLink (All info about the gene), OMIM (Phenotypes), MapView (All maps) and Blink (protein homologies).
In this first release, only the 10,544 RefSeq mRNA available October 13th were aligned to NCBI contigs (October 5 freeze). A model is displayed if the mRNA aligns on the genomic sequence, finished or draft, over at least 700bp or half its length, with less than 3% discrepancy. A discrepancy is either a single base substitution, a base insertion or a base deletion.
Of the 10,544 RefSeqs, 9,409 (89%) aligned as well or better than defined above, in 9,045 genes, containing 9,299 mRNAs (1.03 RefSeq per gene).
Quality of alignment:
Some 5,423 RefSeq mRNAs (58%) produced an excellent alignment to 5,224 genes (by excellent, we mean more than 98% of the length aligned with less than 1% discrepancies from the underlying genome sequence). These models are expected to be complete.
Other RefSeqs match less well: if we allow for 3% base mismatches relative to the current genome (3% single base insertion, deletion or variation between the RefSeq and the underlying genome sequence), 1718 (18%) align over 90 to 98% of their length, 690 (7%) over 80% to 90%, and the remaining 1578 (17%), over 50% to 80% or 700 bp.
These results, showing that 89% of the genes represented in RefSeqs can be mapped at good stringency in the genome, confirm the quality of both the genome sequence and the RefSeqs, and may be used to ameliorate both.
The few redundant RefSeqs revealed by alignment to the genome are currently being reviewed.
Gene duplications or assembly errors: After selecting the best hits and keeping only hits of similar quality, 179 RefSeqs map to more than one place in the genome, hence contributing to multiple gene models. Some correspond to genome assembly errors, recently being reviewed, some to actual paralogs and some to retroposons. Note that there are fewer gene duplications than was expected (max 2%), and most genes are uniquely assigned.
On the other hand, more genes than expected map uniquely in single exons. This happens for 7% of the RefSeqs, and the length of the gene is up to 3477 bp (Nuclear receptor interacting protein).
408 genes in this sample have more than one mRNA because they are alternatively spliced. The proportion is largely below the number of actually alternatively spliced genes, since the average number of protein products from a gene is above 2 per gene (from aligning all mRNA and EST in GenBank).
Unfortunately, a technical problem prevented export of 164 genes, some excellent, often in draft regions, to ASN1, so that, in this release, only 9305 RefSeqs (NM accessions) provided models for 8881 genes (Locus ID), containing 9222 mRNAs and proteins (XM and XP accessions).
Using draft genome sequence and only RefSeqs mRNAs in this release introduces natural limitations: because of the draft nature of the genomic sequence, of polymorphisms, of imperfections in the RefSeq sequences or in the Acembly program, many models are partial. 42% are shorter than the RefSeq they originated from, usually by failure to align at the 3 or 5 end, while 13% have internal gaps.
On the positive side, 77% of the model mRNAs (7296 of the 9409 aligned) appear to contain the entire CDS.
Draft quality of the genome sequence (end of contigs as well as pieces of sequence not fully ordered and oriented (see Contig Assembly Process)) is the major cause of problems: Aceview provides a graphical display of both elements, facilitating evaluation of the quality of the area.
3UTRs align less well than the coding region, due to a higher level of polymorphisms and to less careful cDNA sequencing. Acembly may then align the mRNA but refuse to add the divergent sequence in the model, if too many discrepancies accumulate locally.
5 missing exons have two major causes: the first is the draft, the second affects about 100 genes in which a bug prevented the program from finding the first exon. This bug has been solved now.
Future releases will
include all mRNAs and ESTs from Genbank. The RefSeqs will be
improved and probable genome sequencing errors will be signaled. Using the
entire mRNA/EST set from GenBank allows Acembly to reconstruct 77518 mRNAs
encoding more than 80 amino acids, from 36453 genes, on the October 5 draft.
The AceView web pages present three bubbling graphics of each gene model:
The AceView gene models can be accessed from the Map Viewer, LocusLink (from the av symbol) and the XM record.
Data Input In the current release, the sequence data used to develop the models included:
The mRNA sequences were aligned to genomic contigs using the Acembly program, described in the next section. The genomic contigs were also BLASTed against the vertebrate, non-human proteins and RPS-Blasted against a motif database to annotate the homologies.
The Acembly program was used to produce the gene models. It may align all human mRNA and EST sequences to the genomic sequence. If there are ESTs or mRNAs that produce different models because of alternative splicing, all models will be displayed. If an mRNA aligns to multiple locations on the genomic sequence, Acembly keeps only the best alignment. But if two of the alignments are of similar quality, Acembly keeps both, and shows that mRNA in bold (as described in item 6 under Data displayed).
A model might be incomplete, if the available mRNA and EST sequence data for a gene is incomplete or if it does not match the genomic sequence over its entire length.