AceView News and release updates (includes AceView statistics, progress report and archives).

Last updated January 12th, 2007

o   January 2007 (human 36.2/hg18)

o   August 2005 (human 35.4/hg17; updated WormGenes, still on WS140)

o   June 2005 (human 35.3/hg17; WormGenes on WS140 since April 10, updated June);

o   November 2004 (WormGenes on WS130, Human 35.2);

o   October 2004 (Human 35.1);

o   September 2004 (Human 35),

o   May 2004 (Arabidopsis);

o   January 2004 (human 34),

o   July 2003 (Human 33, nematode WS97),

o   May 2003 (Arabidopsis),

o   February 2003 (Human 31),

o   August 2002 (Human 30),

o   December 2000

If you want to receive an announcement when we update the data on the site, please e-mail us. If you want to unsubscribe, please click here.

 January 12, 2007: new release for human

back to top

We are pleased to present, at long last, a new release of the human genes, built in AceView on the current genome (NCBI 36.2/ UCSC hg18), using RefSeqs NM and NR entries, GenBank mRNAs and ESTs from January 8, 2007.

In this release, as detailed below, we substantially enriched the content: we now systematically annotate alternative promoters and alternative polyadenylation sites, in addition to alternative splicing patterns and we provide a more careful selection of the putative coding sequences. We also developed a new compact display where all the alternative variants are shown co-aligned, their introns being spliced out, so that the details of proteins annotations and microarray expression results can be analyzed at a glance.  participated in the analysis of the microarray quality control project, and display our view of the results in a new display: many genes outside of the Entrez Gene catalog appear as differentially expressed in brain versus cell lines. We also show that cases of contradictory differential expression across platforms often result from probes assessing different sets of alternative variants: alternative transcripts are regulated differentially!

Finally, we continued to improve the code, fine-tuned the definitions and filters, and manually treated the data so as to obtain a quality in transcripts reconstruction as good or better than the best manual annotation. .

The large increase in data, mainly the 1.8 million new 5’-complete cDNA clones from the FLJ project (Kimura,…, Isogai, Sugano 2006), lead to an increase by 4% of the number of genes with introns, to 36,812, and an increase by 31% of the number of alternative spliced transcripts, to 192,671 (an average of 5.2 alternative mRNA per spliced gene, or N per spliced gene with more than 5 cDNAs). Altogether, we annotate in this AceView release a total of 60,371 main genes (36,812 with introns, of which N are protein coding, and 23,559 intronless genes also potentially protein-coding). There are also N ‘putative’ genes and N cloud elements.

 

Since the last release, we were happy to observe that AceView transcripts, in the ENCODE regions, are extremely similar to the Havana/Vega hand curated transcripts that were selected as the Gold standard by NHGRI. The other methods participating in the Gencode project see at best one third of the transcripts seen by Havana, whereas AceView has 80% exactly identical transcripts. In Genome Biology August 2006, in ‘AceView: a comprehensive cDNA-supported gene and transcripts annotation’, we published this, the principles governing our gene reconstruction, and our main results.

 

The Pfam motifs and associated InterPro and GO descriptions used to annotate the proteins are version 20.0 of Pfam, downloaded from St Louis September 20th, 2006. 

Statistics are below.

Conceptual and code improvements for this release

Selecting open reading frames to annotate

We witness important progress in peptide identification, and large proteome datasets have started to become available. But we should keep in mind that most annotated protein sequences are predicted in view of the mRNA sequence (or the genome): even the current Swissprot/Uniprot is in majority composed of predicted rather than experimentally validated proteins.

A new way to select the best encoded predicted protein.

We have devised a new score system to choose the most likely product from each mRNA, and to help decide if a transcript is potentially protein-coding or not.

-         We score the length of the predicted CDS:

o       CDS above 100 aminoacids: every 100 aminoacids stretch scores 1 point

o       between 60 and 100 aminoacids: 0 point

o       below 60 aminoacids, -1 point

-         if the initiator codon Met is atypical (fits Kozak’s rule, but starts at a non-ATG codon) and the CDS is below 80 aminoacids, -0.85

-         We count introns within the CDS and introns outside the CDS. “Good looking CDSs” have all or almost all intron scars within the CDS.

o       If all introns are within the CDS, each intron scores 1.

o       If 1 intron is outside the CDS, each intron in the CDS scores 1 point, except if there are only two introns total, then the score is 0. If a transcript has a unique intron outside the CDS, the score is -1.

o       If 2 or more introns are outside the CDS, we score +1 per intron inside the CDS and -1 per intron outside the CDS. However, some rare transcripts have an operon-like structure (e.g. BAGE4andTPTE.aNov06): unlike pre-messengers, they may have lots of introns, but potentially encode multiple CDSs, located successively on the transcript and possibly molecularly distinct (i.e. not clearly belonging to a unique protein and encoded by a partly unspliced mRNA).  To avoid losing these significant CDSs that could all end up having negative scores and be all dismissed, we score a maximum of 4 introns outside the CDS (maximum penalty of 4 for introns outside the coding region).

Note that NMD has been suggested to lead to rapid degradation of mRNAs with introns located downstream of ~55 bp up from the last intron scar, but the large number of cDNA sequences from scores of cDNA libraries that share this property indicates that NMD does not act on all genes, all transcripts, in all tissues or at all times, or else that it is quite inefficient.

-          We examine BlastP homologies with expect less than 10-3 and run TaxBlast. Existence of any BlastP hit(s) to a species other than self scores 1 point.

-          We consider the Pfam significant hits, with thresholds as recommended by Sean Eddy. We exclude hits to frequent retrotransposons and retroposons, not to rescue these products too actively: DDE, gag_*, GP36, rve, rvp, rvt_1, transposase_* and ribosomal_*. Existence of any other Pfam hit(s) scores 1 point.

-          We examine the motifs defined by Psort2 and exploit the predicted cellular localization. We score a maximum of 1 point if any of the following domains is found:

o         transmembrane domain

o         coiled coiled region

o         ER retention domain

o         Golgi transport domain

o         N-myristoylation domain

o         Prenylation domain

o         With high probability (>=50%, except when indicated in parenthesis for each compartment), the NH2 and COOH complete CDS of more than 70 aminoacids is predicted to be localized either in the plasma membrane, the mitochondria, the endoplasmic reticulum, the Golgi (40%), the cytoskeleton, peroxisomes, lysosomes, secretory vesicles, or is secreted or extracellular

-          If the CDS is NH2 and COOH complete and already has at least one point, it score an extra 0.8 points if it is the first 5’ encoded peptide,.

-          To decide between very close CDS, we add 0.001 points per aminoacid, so that the longest CDS wins over a shorter one with equal annotation grade.

A CDS with 1 to 5 points is ‘good’, a CDS with 5 points or more is ‘very good’. For each mRNA, we select the CDS with the highest score and kill all other CDSs unless they are of ‘very good’ grade. The integral part of the score is given in the text.

All genes with a ‘good’ product are placed either in the main or putative class of genes:

- they go to the main class if they have introns with standard boundaries (gt-ag or gc-ag), or their best product encodes more than 100 aminoacids, or they have more than 30 cDNA supporting clones, or they have an Entrez Gene ID or an OMIM annotation, or finally they include a RefSeq NM or NR.

- other genes with no introns, but with a ‘good’ product or supported by 5 to 29 cDNA clones are ‘putative’ genes.

- genes with none of these properties become ‘cloud genes’; they are supported by 1 to 4 cDNAs, they align with no intron and do not visibly encode a protein. Note that the average length of cloud genes is 500 bp (+-200), leaving open the possibility that a fraction of those genes represent 5’ or 3’UTR of partial transcripts of previously known or new genes. Others may correspond to artefacts, such as DNA contaminations in RNA libraries. Dense microarrays, such as tiling and exon arrays, will teach us more about the properties of these genes and bring some evidence as to their real or artefactual nature.

 

A CDS with negative score, or with score below 1corresponds to either a partial product, or a non-coding RNA.

- Promoter sequence: To facilitate the search for promoter elements, we currently provide the 5 kb sequence upstream of each mRNA. We noticed that Xie et al, 2005 “define each promoter region as the non-coding sequence contained within a 4-kilobase (kb) window centered at the annotated transcriptional start site (TSS)”. Please tell us if you would prefer this 4 kb sequence instead of the 5 kb upstream.

 

Statistics for this release (build 36/January 2007):

This evolution in the number of genes and variants demonstrates that the current transcriptome is far from saturated: if we aim at defining the true complexity of the human genes for applications to human health, there is ample need for more large scale cDNA sequence projects.  We would recommend that all IMAGE clones be re-sequenced in single pass long reads from both the 5’ and 3’ ends, a relatively cheap and certainly cost efficient endeavor, since most cDNA clones’ inserts are below 2 kb and the current sequences are informative over about 1 kb (ie the entire structure of the cDNAs would be made available at once), whereas most have been sequenced only from one end at times when 400-450 bp were the standard amount of information acquired. AceView has no difficulty aligning ESTs over a kb or so: such simple re-sequencing would allow identification of many more alternative variants, and enrich the MGC collection with alternative variants.

 

 

 

 October 7, 2005: The NCBI version of the Acedb/AceView software was made available on the ftp site:

back to top

ftp://ftp.ncbi.nlm.nih.gov/repository/acedb/AceView. It includes the Acedb database manager, written in the early 90s by Richard Durbin, now at Sanger, and Jean Thierry-Mieg, now at NCBI. As a database engine, the NCBI version is compatible with the Sanger Center version: the data files can be freely exchanged between the two systems. They can even run from the same disc and they both support AcePerl. However, over the years, the codes have evolved to suit the needs of the two main Acedb authors and their users. We have incorporated in the AceView NCBI version our very powerful cDNA alignment code, and many optimizations and new graphics to support sequencing trace edition, genome assembly, mRNA to genome alignment and biological annotation of the genes. “TableMaker” was expanded to enable the selection of sets of genes with complex combinations of properties, including sequence constraints. The AceView web site is supported by a new version of the Acedb database server and a C language programmers’ interface called AceC, which are part of the present distribution. The human servers currently run on a standard Intel Linux box with 4 Gb of RAM and two processors. Support for the NCBI version will be done by Jean at NCBI. All Unix/Linux 32 or 64 bits platforms are supported, including IBM, Sun, Intel, opteron, alpha ... MacX, and Windows/Cygnus.

We thanks Nicolas Thierry-Mieg for his help on this subject.

Sept 14, 2005: Typing a query often yields a list of genes. Genes were arranged according to map position, we now offer users a choice of sorting the lists alphabetically (and this mode is now the default).

August, 2005:  new release, with updated code and data for human and worm

back to top

August 31: We are pleased to announce a new release of the human genome (still build 35, UCSC hg17). We updated RefSeqs and GenBank mRNAs on August 11, and ESTs on April 5, 2005.

Our AceView/WormGenes is now updated every two weeks, to reflect our efforts of annotation of the worm genes. We currently use the latest WS140 genome from WormBase, GenBank mRNAs and ESTs from August 16, 2005 and all raw sequence data, most of them hand edited, from the Kohara, Vidal and Saint Louis cDNA projects. We recently added a pointer to a useful resource, RNAiDB (K. Gunsalus and F. Piano, New York University).

The Pfam motifs and InterPro descriptions used to annotate the proteins are version 17.0 of Pfam, downloaded from St Louis May 30th, 2005. 

 

Since the last release, we did important conceptual and code improvements, stimulated by our users feedback: be they thanked. We believe that this release made a quantum leap in quality and we hope our users will enjoy it! Of course with radical changes, we may have introduced bugs: please tell us if you see any misbehavior.

Statistics are below.

Conceptual and code improvements for this release

1.     We have redefined what a “gene” is for AceView: until now, we used to define genes molecularly by the mere clustering and contiguity of the footprint of the mRNA sequences on the genome. But for people who see a “gene” as a set of mRNAs producing one type of protein, our definition is counter-intuitive, because transcription often leaks from one such “gene” into the next. For reconciliation, we now shrug and shed all intronless contacts between mRNAs, and define the gene as the set of mRNAs sharing at least one intron boundary. This has two main consequences:

o first we split into two separate overlapping genes many of the genes that are transcribed from the same strand, have some mRNA sequence contact, and produce two families of distinct well-known proteins. Sequence contact was convincing and real, but usually through the 3’/ 5’UTRs. The move should solve some confusion for the biologists who do not like genes with doubles names, GeneA.and.GeneB: we have split 142 of the previous genes with multiple LocusID, including 87 cases, leading to 174 genes with RefSeq encoding well known proteins. We still have about 1000 genes with more than 1 locus/GeneID, but a majority correspond to provisional/predicted NCBI gene models. Note that our new method should not split unduly the hundreds of real gene complexes, producing at least two types of proteins with no aminoacid in common, A and B, but where a cDNA shares introns with both A and B producing mRNAs: such gene complexes usually produce proteins of type A, B, and AB.

o Second, the new definition leads us to separate and shed from the genes the intronless mRNA variants as well as variants with sequence contact through exons, but only unique introns, not shared with the other variants. This modification is also beneficial, because it removes mRNAs that certainly belong to the gene, but may be incomplete or more dubious. A clear advantage is that unspliced variants no longer contribute to the counts of alternative variants and alternative features. In this way, although we have increased to 4.5 million the number of GenBank/dbest accessions used in the gene models, the number of alternative variants remains stable at 5 per gene (for 28,501 genes with standard introns and at least 2 cDNA clones, EST or mRNA).

2.   We modified the way we choose an mRNA path through the clusters of cDNA footprints on the genome: we now let the protein length mildly influence the path that we select from among the various possibilities. The main effect is that we now refrain from concatenating two pieces when the second appears to correspond to an alternative promoter, because it contains a Stop upstream of the main open reading frame, and hence it should constitute a 5’ end of its own.

However, let us recall that AceView tries to avoid combinatorial: we strictly enforce the rule that a given cDNA cannot be used twice (in different variants), so that AceView is necessarily offering a low-bound limit on the number of alternative variants. By construction, concatenated variants associate structural elements (such as introns) that may be rare, since they cannot be merged in other variants. Hence we expect that, when more sequences become available (in particular from the cDNA clones with an EST in the variant), a number of concatenates will become split into two or more alternative variants. 

3.   We ignore more actively the cDNAs clones that have non-standard introns or are rearranged: We used to mask all introns with atypical boundaries and whose feet lied inside other exons of the gene, unless they were supported by more than 1 (2000-2004), or 2 clones (2004-today). The rationale was that these non-standard exons result more likely from deletions in the insert of the cDNA than from real non-standard splicing. We were carefully labeling these exceptions in the cDNA, then using the locally masked clone in the reconstructions. Thanks to the Havana team and the Gencode project (displayed at UCSC http://genome.ucsc.edu/encode/encode.html , http://genome.imim.es/gencode/ ) we learnt that most of the anomalies, whether supported by one or many clones, and irrespective of where they lie with respect to exons, do not retest when using RT-PCR. We have therefore taken a radical approach to eradicate non-standard introns: if a given cDNA brings at least one standard gt-ag or gc-ag intron which is unique to the clone, we keep it even if it also contains an anomaly, and we eventually mask the anomaly. But if the clone’s only novel feature is a non-standard intron, a gap in the alignment, or a fuzzy intron incompatible with gt-ag or gc-ag, we list the clone as belonging to the gene, but ignore it and do not show its specific anomalous mRNA model. In the Aug05 build, this applies to 65,889 accessions, or about 1.5% of all cDNAs. Interestingly, among the mRNAs rejected in this way as redundant variants with an anomaly are 5,779 GenBank mRNA (3.3% of all aligned mRNAs) and 411 RefSeq (1.7% of the RefSeq public on Aug. 11, listed here).

This simple treatment has made AceView look more conventional and apparently more correct at very low cost.

4.   Annotating proteins:

-       Our knowledge about translation in higher eukaryotes is still meager: we lack direct protein data. We feel AceView knows how to align cDNAs back on the genome, but we have no pretension with respect to proteins. To choose the initiator Met in NH2 complete proteins, our current approach is naďve.  If we gain 30 aminoacids by starting on an NTG rather than the more standard ATG, we do it, but do not imply this is biological.

-       We catalogued 238 mRNA, from 191 genes, where a selenocysteine or a leaky stop would enlarge the protein considerably. Similarly, we noted 8469 proteins that would largely benefit from a translational (or genomic) frameshift. Quite often, the two contiguous fragments, although in two different reading frames, are homologous to the same protein. We do not know what is the signification of this observation, if it has one. 

-       Multiple proteins per mRNA: We are in fact convinced that discovery of new proteins by researchers, using for instance mass spectrometry, would be greatly accelerated if they were provided with a complete un-amputated list of coding sequence, since their technique relies on recognizing candidate sequences. Their progress will be inhibited if the space of possible coding sequences provided to them is limited by bioinformatics considerations not substantiated by hard experimental evidence.

For these reasons, we do not hesitate to annotate multiple proteins per AceView mRNA. In the August 05 release, we freely annotate more than one putative open reading frame per reconstructed mRNA in 26% of the mRNAs (127,941). This helps us choose the “best” protein when there is some ambiguity as to which CDS/ORF, if any, is most likely to be translated. We are happy to annotate short non-overlapping CDS present in the 5’UTRs (akin to uORFs, see this work for example) or in long 3’UTRs (a frequent happening, not yet validated at the protein level in the literature to our knowledge). We have rationalized the naming system in such instances. If for example mRNA ABC.a may encode 3 putative proteins, we define the “best protein” and call it by the name of the variant ABC.a, other putative proteins are called ABC.a1 and ABC.a2. 

5.   A few graphic improvements:

o  We now identify graphically the variants whose best coding region is fully supported by at least one single cDNA clone in GenBank: their names (below the 3’ end of the pink mRNA, shortened to a, b, c...) are underlined on the diagram. Any concatenated mRNA model, where more than one clone is required to cover the CDS, does not have its name underlined. For these, the annotated CDS could turn out to be a mosaic of different alternative variants. Bench work is needed.

o   Encouraged by the progress we believe we see in the cleanliness and reliability of our proposed variants, we have added, below the RefSeq summary and on top of the pages presenting the complete cDNA annotations, a little summary of the AceView results, with pointers to the main tables: introns-exons, cDNA supporting clones, and mRNA/protein annotation summary.

 

Statistics for this release (build 35/August 2005):

 

To reconstruct the AceView mRNAs, we use 91% of the GenBank mRNA (174,129) and 76% of the dbEST human data (4,325,853), a 5% increase relative to the Nov04 version. We also use 23,895 RefSeq. Quality of the alignments remains excellent. Between 1.1% and 2.2% of the sequences map ambiguously, in about 1500 fully or partly repeated genes.

 

About introns:

The 4.5 million cDNA sequences aligned on the genome define 312,932 different standard introns, all well supported (i.e. a cDNA sequence exactly matches the genome over 16 bp, equally split on both sides of the intron boundary).  97.7% are gt-ag, 2.2% gc-ag (a bit high?) and 0.08% at-ac (265 cases). An additional 4485 introns are anomalous, and most probably correspond to defects in the cDNA clones (or the genome).

We currently annotate 65% of the introns as alternative, but this is an upper bound because we over-count a little, due to a newly introduced bug.

Introns are easy to define, and very reliable markers of expression and of the splicing and alternative splicing patterns. Long primers sitting across exon boundaries could be excellent tools for microarray building, or RT-PCR experiments. We plan to add such sequences,  possibly 80 or 100 bp (40/50 bp on each side?), taken on the reconstructed mRNA model, and to count the support and alternative nature of the primer. We are still thinking about how to present these data, in each gene or mRNA model as well as in bulk, so that they are useful to our users. If you have comments or ideas, please tell us.

 

Using this simple measure of standard intron counts, we may ask what is the added value of each type of data:

-   The NM and NR RefSeqs see a total of 171,864 standard introns (and 442 non-standard), or 55% of the total number of introns in the human cDNAs from the public databases. Therefore, despite the small size of the RefSeq database, they already touch more than half of the known standard introns. Their pattern is: 99.07% gt-ag, 0.80% gc-ag, 0.11% U12 at-ac (182 cases).

-    If we considered only the 174,129 GenBank mRNAs, they define a total of 226,014 standard introns, 72% of the total (98.7% gt-ag, 1.17% gc-ag, 0.10% at-ac (228 cases); in addition to 1623 non-standard introns).  

 

About the genes:

AceView clusters the 4.524 million reads into 56,491 genes with standard introns or well-coding (in addition to 40,567 “putative” genes, more partial or dubious, and 251,183 “cloud” genes, not supported by many clones, unspliced and not clearly coding, which may represent intermediates in the transcription process).

-       35,413 genes have fully-supported standard introns: 16,497 of these genes (or 47%) have at least one RefSeq, 22,280 genes with standard introns encode putative proteins above 100 aminoacids, the remaining 13,133 may encode short proteins, be non-coding, or partial. 

-       22,469 genes supported by mRNAs or ESTs do not have introns, yet encode putative CDS of more than 100 aminoacids (1005 are above 300 aminoacids). These intronless genes are possibly functional in the proteome. 1297 of these genes (5.8%) are represented in RefSeq (Aug 11, 05).

-       If we consider the potential for protein-coding of all genes, with and without introns, we find that 44,749 genes in this release encode a putative CDS of more than 100 aa, of which 14,152 encode CDS above 300 residues. 

-       But alternative splicing generates on average of the order of 5 variants, 4 protein isoforms per gene with introns. So the variety of proteins expected to be present in the cells is much greater: AceView annotates 133,744 different protein isoforms of more than 100 aminoacids, 35,536 are above 300 aminoacids.  69% of the AceView CDS above 100 aa, and above 300 aa are fully encoded within a single identified clone in GenBank/dbEST (91,738 different proteins > 100 aa; 24,367 distinct proteins > 300 aa have the coding part fully covered by a known cDNA), bringing the strictly non-dubious protein isoforms to 3.1 per gene with introns on average. The remaining 31% AceView proteins could still result from an inappropriate concatenation of two clones (26%) or more (5%), because, if no conflicts arise, we merge partial mRNAs.

-       6% of the proteins of more than 300 aminoacids are encoded by more than one alternative mRNA variants (38,013 mRNA variants encode only 35,536 different proteins). This is true for 9% of the proteins above 100 aa (185,073 mRNA give only 168,366 distinct proteins). That is because a vast majority (91 to 94%) of the alternative introns, alternative promoter or last exon affect the coding region.

-       Using stringent criteria to define antisense genes (part of our standard annotation, under “regulation”), we identify 10,960 genes with standard introns antisense to another gene with standard introns, i.e. 31% of all genes with standard introns might undergo this type of flip-flop regulation. Also, the function of up to one third of the non-coding genes with introns could be to regulate the gene on the other strand by antisense.

Some basic statistics for AceView on human, build 35/hg27, August 2005 (mRNA and RefSeq from Aug 11, 2005; ESTs from April 5, 2005.

Elementary alignments

RefSeq

mRNA

EST

Total available in public DB apr/aug05

23,973

191,058

5,699,664

Number aligned

%

23,895

99.67%

174,129

91.14%

4,325,853

75.9%

Alignment quality: total length aligned, # diff from genome

Ave. bp aligned per sequence, %

% bp diff relative to genome

67.927 Mb, 66,529 diff

(2830 bp) 99.9%

0.10%

343.198 Mb 748,102 diff

(1929 bp) 99.3%

0.22%

 

Multi-aligned sequences (%)

In # (repeated) genes

227 (0.95%)

364

2466 (1.42%)

1712

94543 (2.19%)

19807

To be completed...

June, 2005: 

back to top

June 28, 2005: We put a preliminary version of our new build online but did not advertise it, because it had all the code ameliorations described above, but also a few bugs: we had lost alignments for a million cDNA sequences, usually intronless! Many thanks to the users who noticed this and let us know!

Data from this build were the RefSeqs NM and GenBank mRNAs from June 11, and ESTs from April 5.

January-April, 2005:  minor modifications to the site, as they are made public

back to top

April 10: We updated the worm version to the current freeze of the genome (known as WS140).

March 23: We noticed to our dismay that the lists of genes or clones with specific properties were no longer accessible from our GOLD work, and re-established the links from the supplementary material. We ignore how long the outage lasted, but are very sorry for it.

January 18: We fixed the PFAM and AceView mRNA Blast searches, which were not properly connected to the current InterPro definition files for PFAM, and to the current AceView mRNA files for tBlastn.

January 16: On the download site, we added a file providing The relation between all the aligned ESTs/mRNAs, the genes, the locusIds and the alternative variants, with indication of quality of match, tissue of origin, type of AceView gene, for all genes (main, putative and cloud). Help on the format of the accession to gene table has been updated as five more data columns have been added to enrich the data in this table. Meanwhile, we got rid of the 0.5% worse quality alignments (30,403 alignments/6,428,329) by removing the ESTs or mRNAs aligned at AceView quality 10 or 11 (see the help here). Two files (The way all the ESTs/mRNAs are aligned in each gene, for all non-cloud genes (55.3 Mb) and the Same file as above, but for cloud genes (5.7 MB) ) thus became redundant and were removed. The links to download the GOLD alignments was broken and has been restored.

 

November 24, 2004:  New WormGenes AceView version (on the WS130 genome) and improved human genes on human build 35

back to top

For this new version of the human genes, we use the same data as in the previous public release of October 24: the genome is NCBI Build 35/UCSC hg17 and the mRNA and EST were collected from GenBank/DBest on Sept 24, 2004.

If we consider build 35 releases of September, October, November taken together, the main improvements are:

-       We have defined main genes as genes that have either a standard intron or encode a putative protein of more than 100 aminoacids: AceView has about 51,000 human genes of that kind in this release, 40245 coding for CDS of more than 100 aminoacids and 11,046 with standard introns, but no CDS above 100 aa (see the statistics below for more details).

-       Neighborhood icon: On top of each gene page, we now display the “main genes” in the neighborhood of a given gene, on the strand where they occur (top or bottom of the line). The area shown is of the order of 4 genes on average, i.e. we display 400 kb in human and 20 kb in worm. Clicking on any neighbor in the icon leads to that gene’s description. This new feature, inspired by the Gene section at NCBI (but less beautiful in AceView), provides a natural way to “see” the neighborhood and to move around an area. Our current implementation may not be esthetic, but it seems to be efficient (thanks to Walter Zorn for his javascript graphics library). We are not sure that it is browser independent. Please mail us if you do not see something like this or cannot move by clicking (tell us which browser/machine you use, thank you):

 

-      

 

Note that, in some dense areas, we had to “bump” the overlapping gene names and show only one or a few letters of the real gene name. Tell us if you think this is too confusing: the alternative (given the limited time we have) would be to write the names on top of one another in such cases. Only “main genes” are displayed in this icon.

-       To help protein studies, we would like to generate a catalog of putative proteins, as derived from the reconstructed mRNA model, using as sequence the genome footprint of the mRNA. We have put a version of this on our ftp site, hoping that this may, for example, help decode mass spectrometry spectra. Toward a complete catalog, we authorize multiple-proteins per mRNA, because their existence is now well established. For instance, the reality of uORF has been shown on a relatively large scale, thanks to the wonderful work of the Sugano team. Furthermore, we have evidence in the worm that mRNAs encoding non-overlapping well known proteins are not degraded as they should if the mRNAs were rare errors of the RNA maturing process. This observation lends support to the idea that either ribosome reentry and/or tRNA variants adding an amino acid even when the codon encodes a Stop and/or translational frameshifts may play a more prominent role in vivo than currently thought. And this might lead to either translation of more than one protein per mRNA, or to translation of a single protein rather than the two predicted on the basis of the mRNA sequence.

-       To be able to show the annotation of a single protein per mRNA on the web, as is conventional, and also to dump AceView genes for UCSC/EBI, we now define the “best” protein in each mRNA by using the protein annotation, not just the protein length: sometimes the most interesting protein is not the longest. In the same vein, we used the annotation of the (multiple) proteins to define “good” proteins. Then we display on the web only the good and the best protein, and to simplify, we annotate only the best so far.

-       We added in each mRNA summary some explicit information on which accessions (and clone) cover the main coding sequence encoded by the mRNA, and how well they match. There are indeed different levels of support:

o       The CDS read on the mRNA footprint on the genome may be exactly encoded by a given clone, with not a single residue difference. This is reported as an “exact match”, and such clones are listed under their GenBank accession name. When there are too many such clones, we give the total count, and we list preferably those from the large scale cDNA projects KIAA, FLJ, DKFZ and MGC, because they a priori make their clones available to the community: users may contact the person listed in the GenBank record to kindly ask for a clone of interest.

o       A cDNA clone may support the CDS almost exactly, and encode a putative protein of the same length as the genome footprint, but differing by one or a few amino acids. Such variations may be due to sequencing errors, mutations or polymorphisms in the cDNA (or more rarely in the genome). The differences in amino acids are indicated explicitly, using the standard code, e.g. P249H : the Proline at position 249 in the genome footprint is replaced by a Histidine in the protein encoded by the clone. X is an unknown amino acid, and corresponds to an uncalled base n in the clone. Because of the genetic code degeneracy, a number of variations at the level of the nucleotide sequence are silent. In some cases, the genome has a mutation or bears a very rare polymorphism; then most clones covering the entire CDS will show the same mutation.

o       A cDNA clone may support the CDS almost exactly but encode a protein of a different length, either smaller or larger, due to a mutation affecting a Stop (e.g Stop582Tyr) or leading to a Stop (e.g. Gln374Stop), or due to a base insertion/deletion leading to a frameshift (different protein length in addition to a number of consecutive amino acid variations). These clones are usually useable just like the clones above, provided one is willing to fix at least one base by in vitro mutagenesis. 

To select a clone when all have some amino acid differences, simple arithmetic does not help, unless you plan to do an in vitro mutagenesis [Note: resequencing is always recommended].  Of course, more similar amino acids (e.g. Ile/Leu) are usually more interchangeable. Knowledge of the protein and careful analysis of the alignments to proteins from the same family should be informative, because it may happen that a protein with three variations in less conserved areas (or in structural domains) will be fully functional while a protein with a single mutation in the active site will be dead! Unfortunately, the impact of the aminoacid changes on the function of the protein is beyond our automatic analysis…

o       Finally, for nearly one in four proteins, the AceView mRNA and complete CDS is reconstructed by concatenating more than one compatible cDNA clones, so that no single clone sequence available so far covers the entire CDS. To validate or study such a putative protein, we recommend to visually inspect the image of the mRNA variant and to choose candidate clones containing the N terminal part of the CDS, and not yet fully sequenced (these are ESTs, recognizable by their length (< 1kb) and gradient of blue/red point variations, increasing along the read. If they are FLJ or Image clones, you may be able to get them (see procedure above). The alternative strategy is to RT-PCR with the primers we provide, but be aware that the frequency of alternative variants in human is so high that you will have to subclone and sequence many colonies. Our impression, which is not main stream, is that it is better to use the clones for which there is already some partial information.

We do not yet provide a list of candidates to resequence, but will if there is some interest.

-       The interest of hand annotation is clear: one can recognize defects and erase or fix them by attributing them to the most likely cause, for example a sequence was submitted on the wrong strand, or the insert appears partly deleted or rearranged, or a clone appears really 5’ complete and should not be merged on the 5’ side. So we have started to “look” at the human genes and resolve the anomalies, in particular the gaps, by hand annotating the cDNA clones. So far, we have finished one pass over all genes from chromosomes Y, 22, 21, 20 and half of 19. This type of hand annotation is transferred from one release to the next, so it will gradually build up. We don’t have time to do as much as would be needed (we are only a couple!), but we are more than happy to do this for your gene of interest (please mail us) if you believe that would be useful.

 

-       As already described in the October news, we have modified the presentation, in particular the graph, to allow a more in depth grasp of the alternative features that distinguish the variants. Introns are now colored differently when they differ locally, and the variants names are identified by a letter beneath each. The mRNAs themselves are colored according to the strength of their support: pink for the variants whose CDS is structurally supported all the way by single clones (i.e. these structural isoforms do exist in nature), green for those requiring concatenation of 2 or 3 cDNA clones, hence some further validation, usually by resequencing one or two clones. The expected results of resequencing are that the green mRNAs will validate into one or two or more alternative pink mRNA variants, because our clustering method minimalizes the number of variants.

 

-       We now propose the links to the excellent resources from LocusLink or gene@ncbi, Genecard and UCSC directly from the first paragraph, and have added new links to H-inv and Unigene. Some links, for example to ensembl at EBI, are still only available from the links page.

 

-       Naming the complex or partly concatenated genes: For a number of genes, including officially named genes, two named genes actually overlap in sequence because some cDNA clones bridge the two genes. In some instances, the overlap is only in the UTR regions, in others, it also affects the coding regions, and a molecular gene including both known genes A and B may include proteins of type A, B and AB. This observation may be interesting biologically, since it indicates that at least in some cases, transcription of one gene may leak through the next gene in cis, so that the two genes may belong to some kind of operon-like or multi-cistronic transcription unit. But because we define a molecular “gene” in AceView by the clustering and contiguity of the footprint of the mRNA sequences on the genome, such known genes are considered a single molecular gene. This creates ambiguity for naming, especially when the two genes have an official name. To solve the problem in these rather frequent instances (there are altogether 626 instances in this build), we have introduced the notation GENEA_and_GENEB. 

-       In nematode, we are doing manual edition of all the genes to complement and enhance the WormBase view. Manually, it is easy to distinguish the case “complex locus” and the case “concatenated genes touching through their 3’UTR/5’UTR”. Complex loci were initially defined on the basis of complementation tests in phage or Drosophila. In such a gene, two complete alternative variants may not share a base, although a third variant “bridges” them and overlaps both. By convention, we denote this property by adding the letter C for complex behind the name (e.g., 1C777C) or by a + sign if the gene had acquired two independent official gene names (e.g., mai-1+gpd-2+gpd-3). This convention follows that chosen by Ed Lewis for the Drosophila complex genes (e.g. BXC for bithorax complex locus, of which bx, Ubx, Cbx and pbx for example are alleles). In the case of nearby genes expressed from the same strand but with overlap between the 3’ end of the first gene and the 5’ end of the second, we use the suffix Co (for sequence in COmmon) appended after the gene name (e.g., cul-1Co or 5K225Co). We may also use the sign AND if both genes have a name (e.g., mev-1ANDced-9).

 

-       The main difference between this release and the previous public release we put up October 24th on the web is that we have fixed a gene naming problem affecting chromosomes 10 and 11 and some bugs reported by users (Our special thanks for the careful reports of Drs Elke Weiler and Christine Gosden). Also we found that two areas of chromosomes X and 2 had been lost. The data being stable enough, we are putting it on the ftp site for bulk use and for display at the UCSC Genome Browser and at EBI ensembl (Acembly tracks). Please check your gene of interest and please complain if something does not look right!

For the new nematode genes, the huge difference is that we now display the actual mRNA sequences derived from the cDNA sequences that we have hand-edited and we treat this data in AceView the very same way we do for the human cDNA sequences. We used to display the predictions made by the WormBase consortium (we still show it on the side, in blue-gray), but they are still too often unsupported and incorrect. However, predictions may be useful when there are absolutely no cDNA available, or when the sequences available from the cDNAs do not allow complete reconstruction of the most likely mRNA. In such cases, we may fill the gaps in the sequences of the actual mRNAs with a predicted bit, if the predictions and the actual mRNA are compatible at the border. Also, to get a complete catalog of the putatives, we import in full the predictions from WormBase not supported by any cDNA data, and annotate those as predictions. Graphically, the predicted mRNAs or their “stolen” parts are displayed in blue, while the fully supported mRNAs appear as usual in pink.

October 24, 2004  New version on human build 35 : new features

back to top

This new release is also on build 35, but the EST and mRNA data are now from September 24th, 2004. We have not yet updated the statistics, but please refer to the September release on build 35, which describes in detail the properties of the genes.

But as of today, we know of a trivial bug in the gene name appearing in the title (the official name does not always appear in the title by error). There is also another bug affecting 0.5% of the genes that are split in the middle. We will fix these bugs before we replace the current default version on build 34 by this one.

We thank our users for their thoughtful input on AceView.

September 23, 2004  News on human build 35 : statistics

back to top

September 23, 2004: The AceView human genes, using mRNAs and ESTs from August 2nd, 2004, aligned on the June 2004 version of the human genome (NCBI Build 35/UCSC hg17) are now available.

The Pfam motifs used to annotate the proteins from the human build 35/July 2004 are version 14.0 of Pfam, downloaded from St Louis August 3rd, 2004. 

 

Statistics for this release (build 35/July 2004):

The percentage of mRNAs and ESTs that we align increased only by 0.55% and 0.64% respectively, because the genome sequence gained in quality more than in quantity over the past year, from build 34 to 35. Yet, thanks to the increase in mRNA and EST sequences submitted to the public databases, we now identify a total of 96,797 well supported transcribed genes, a net increase of 11,839 relative to the release of July 2003 (on build 34). In particular, we gained 1,248 genes with standard introns, bringing the total number of independent genes with standard introns to 32,748.

 

On the current genome sequence, we reliably map 4,297,980 cDNA sequences present in GenBank August 2, 2004, a net increase of 269,318 mRNA or EST sequences from last year. That includes 4,131,646 ESTs (73.37% of all ESTs in GenBank/dbest) and 166,334 mRNAs from GenBank (94.96%). In addition, this build includes alignments for 21,565 NCBI RefSeq: the increase of 1,622 RefSeq mRNA mostly corresponds to “creations” by the NCBI RefSeq group, which generates this useful resource.

Although we do not mask for repeats, only 1.1% of the mRNAs and 2.2% of the ESTs are ambiguous and match the genome with indistinguishable quality in more than one place.  The percentages of clones did not change since last time, but AceView has greatly improved in its ability to identify truly repeated genes. To make this more visible graphically, we now draw the clones aligning in multiple genes at the same quality in blue (rather than black), but keep the same color code for point variations relative to the genome: a thin red line for a single base change, transition or transversion, blue line for a single base insertion or deletion.  (from build 34, to be updated: As for the current quality of the alignments, mRNAs on average measure 2,132 bp and align over 98.8% of their length with 99.74% accuracy. ESTs, especially prone to sequencing errors near the end of the read, on average measure 532 bp, and align over 93.3% of their length with 98.16% accuracy.)

 

AceView clusters the 4.30 million reads into a total of 96,797 genes (and 248,728 “cloud” genes, not supported by many clones, unspliced and not clearly coding, which may represent intermediates in the transcription process). If we push in the direction of the current trend for low numbers of genes, there are a total of 46,220 genes in this release that either encode a putative CDS of more than 100 aa or are spliced with standard introns. 32,618 genes encode a putative CDS of more than 100 aminoacids and 32,748 genes have at least one validated gt-ag or gc-ag spliced intron.

Using stringent criteria to define antisense genes, we find that 8,572 genes with standard introns in this release are antisense to another gene with standard introns, i.e. 26% of all genes with standard introns might undergo this type of flip-flop regulation. Also, the function of up to one third of the non-coding genes with introns could be to regulate the gene on the other strand by antisense. Finally, 13,472 genes have no standard introns, yet they encode a putative CDS of more than 100 amino-acids (346 are above 300 aa), hence are possibly functional intronless genes.

 

The genes with introns have on average 5.37 alternative variants per gene, of which 4.51 have introns.

 

We annotate two proteins in 3.2% of the AceView mRNAs (7,607/239,991), but do not for now make public the uORF annotations, so as not to scare the users away!  

In this AceView release, 69% of the CDS >100 aa are fully supported by a single identified clone covering the entire CDS (64,656 CDS from 30,110 genes encoding > 100 aa), bringing the strictly non-dubious protein isoforms to 2.8 per gene with introns on average. The remaining 31% AceView proteins could still result from an inappropriate concatenation of two clones (26%) or more (5%, down from 8% last time!), because, if no conflicts arise, we merge partial mRNAs, hoping that more of you will sequence the entire insert of clones that we have to concatenate now, by lack of more complete data...

 

May, 2004  New build for Arabidopsis thaliana

back to top

All Arabidopsis mRNAs and ESTs were extracted from GenBank on April 27, 2004, and a new enhanced AceView release was made public on June 5, 2004. We did not have time to annotate the biology of these genes, in relation with Tair, but the molecular aspects of gene construction has been fully updated, and the proteins have been annotated with our now standard pipeline, using PFAM, Psort2, BlastP and TaxBlast. 

 

News January 2004  on human build 34 and new features

back to top

June 11, 2004: A file describing the association of each mRNA/EST accession satisfactorily aligned in AceView build 34 to the corresponding AceView gene is now available from the downloads page.

 

January 12, 2004: We released our new AceView genes, built over the human genome sequence from July 2003 (build 34/golden path hg16).

The Pfam motifs used for human December 2003 (build 34) are from version 10.0 downloaded September 24th, 2003.

 

We have improved the alignments, the clustering, the analysis and the presentation of the results, in part following reports or requests from our users, be they thanked wholeheartedly for their comments!

-         We have given unique names to the mRNA models, so that there will be no confusion from one release to the next: for this release, Dec03 was appended to each mRNA name. It should be expected that variants may change sequence from one release to the next: variant aDec03 will possibly differ from variant aMar04. Indeed, let us recall that the gene names are tracked in AceView, but the variant names are release dependent.

-         In this release, we have allowed multiple proteins per mRNA, although we limited this to cases where

o       The two proteins do not overlap and have similar sizes; this may reflect an actual operon-like precursor (we have noticed a number of confirmed ones even in human), or a bug of AceView missing a 3’ end or a 5’ end and concatenating cDNAs that only exist separately

o       the two proteins are only marginally overlapping (possible reasons for this could be: the genome has a single base deletion or insertion relative to the mRNA creating a frameshift, or more rarely there is a translational frameshift, or else there is a bug in the chaining of cDNA clones by AceView, and a real 3’ or 5’ end was missed)

o       the two proteins overlap, but they have about equal length so we cannot decide which frame is the biological one

o       In all cases, looking at the “Annotated mRNA” usually allows to “see” if one or both proteins are real, since both have been annotated on the graphical view (but the text explanation in the annotated page has not yet been modified to clearly treat this case and only one protein (the “best”) is annotated).

-         Sequences of the mRNAs (derived from the genome or from the cDNA consensus .AM), 5’UTR, 3’UTR, coding regions (nucleotide and protein sequences), introns and exons and upstream genomic area (some containing the promotor) are now available by clicking from the corresponding tables (mRNA, Proteins, Introns and exons). We have also added sequences containing the putative promotors. Most sequences are now available from a natural place: the “mRNAs” table, the “Proteins” table or the “introns and exons” table. We were also asked to provide couples of primers to amplify each and every exon, and that will come for the next release.

-         The GO terms, in the summary table in the top of the gene page, are now comprehensively connected to all genes with the same terms.

-         We have changed the “Table of contents” in both its look and the organization of the data, especially in the “Gene summary” section. Please tell us if you notice some information is missing.

-         We have fixed the bug that used to split some genes vertically (hence the slightly reduced number of spliced genes in this release) and a bug that in 90 genes was creating introns with negative length (thanks to our nice users).

-         We have also fixed a few bugs in the analysis: our counts of alternative promotors were sometimes inaccurate; the length of sequence involved in an antisense was cumulative over the alternative mRNAs; our table of “Main cDNA clones”, which should allow you to get all clones required to fully support all variants, was sometimes incomplete; a sentence about operons was there irrespective of the data.

 

Statistics on Human build 34  Sorry we forgot to update this part until February 23…

Thanks to the increase in mRNA and EST sequences submitted to the public databases and to the amelioration of the genome sequence, we align on this new build 7% more mRNAs and ESTs in 547 more genes with standard introns. On this genome sequence, we map 4,028,662 cDNA sequences: 3,910,845 ESTs (73% of all ESTs in GenBank/dbest, a 7% increase relative to last time) and 117,817 mRNAs from the public databases (94.4%), as well as 19,943 NCBI RefSeqs, representing 99.8% of the current RefSeq collection.

Although we do not mask for repeats, only 1.0% of the mRNAs and 2.2% of the ESTs are ambiguous and match the genome with indistinguishable quality in more than one place.  As for the current quality of the alignments, mRNAs on average measure 2,132 bp and align over 98.8% of their length with 99.74% accuracy. ESTs, especially prone to sequencing errors near the end of the read, on average measure 532 bp, and align over 93.3% of their length with 98.16% accuracy.

AceView clusters the 4.02 million reads into 31,500 genes with at least one validated gt-ag or gc-ag spliced intron and on average 5.36 alternative variants per gene, 5.57 protein isoforms (because we accept a limited number of mRNAs where we annotate two proteins). Another 14199 intronless genes do not have confirmed introns, yet have open reading frames of more than 300 amino-acids (of which 596 are above 300 aa) and are hence likely functional intronless genes. The remaining 223,435 genes are unspliced, partial, or non-coding and will require further investigation. The reason why we have so many genes in this build is because we used to filter the last category to keep only the genes supported by at least six clones, or long. But people were sometimes searching AceView to locate the best alignment of a clone, and failing to find it because of the filter. We call this class of “genes” the cloud, and indicate this in the gene title. Their biological relevance remains to be established.

 

In this release, 98,629 (68%) of the spliced mRNAs are supported by a single identified clone covering the entire CDS; the remaining could represent an inappropriate concatenation of two clones (24%) or more (8%), because, if no conflicts arise, we merge partial cDNAs.

 

 

News June and July 2003  (posted September 20th)

back to top

We describe below the new releases and a few new features for

-   human (June 5th, 2003)

-   the worm Caenorhabditis elegans (June 23rd, 2003)

-   Arabidopsis (May 15th, 2003)

 

 

News on Human Build 33  (June 5th)   posted September 20th

back to top

October 3: Sorry we had to remove human build 31, from Nov 2002, because of lack of space. These models should still be available from UCSC . If this causes a problem, please tell us.

We released our new Acembly genes, built over the human genome sequence from April 2003 (build 33 / golden path hg15), on June 5th.

 

 

 

Most significant improvements:  Progress for this build (September 20th, 2003)

back to top

The most significant improvements for this build are:

-         Tables of genes with a given Psort motifs are now available from the “Table of Psort motifs”. They used to be kind of accessible by clicking on the red boxes in the mRNA diagram, but most of the time, we could not show the whole lot because we had put a limit to the maximal number of genes in a single table. The table of genes with a coiled-coil, which are putatively involved in protein interactions, has been most appreciated by our loudest users.

-         Any AceView table now comes with an indication of the level of expression of each gene in the table, through the number of cDNA clones that Acembly assigns to that gene. We also re-ordered the genes, not by their alphabetical name, but according to their map position on the chromosomes. In human, we now provide in the tables both the known cytolocation and the actual chromosome on which Acembly assembled it, for a quick comparison.

-         Minor detail fixed to avoid spaghetti genes in Acembly. We unplug a very distant first or last exon if it has too many errors at an early step in the program. We may have lost a few small good looking genes in the process. We also have a bug in this release: some genes with very many clones are split into pieces and appear as a cluster of overlapping small genes. This will be fixed in release 34.

-         The query system has been bettered. We fused what we used to call extended search and fast search into a new single query box which tries to implement the notoriously impossible 'what you get is what you wish' paradigm. By observing the queries we receive, and the data in the various fields of the database that we want accessible, we ended up ordering and classifying the data to be searched, and have drawn a set of reasonable rules (= based on common sense) for extending what users type in the query box. The algorithm nearly always gives us a reasonable answer, but we would really appreciate users’ feedback if they encounter problems. In practice and to simplify, we first try to recognize exact gene symbols, locusID, Genbank identifiers and cDNA clones: if we do, we stop the search. If we don’t, we proceed by searching all other fields in the database using a general search, allowing word completion and extension both ways. All objects in the database point to genes, we then collect the lists of genes. The inconvenient is that we may bring genes somewhat distantly related to the query. To compensate, we have added the possibility of recursive searches. The reply to a query comes as a table giving the gene names, their position and a few words describing their function or phenotype (we have done that systematically for worm genes), plus a new query box that will query only inside this list. One possibly irritating feature with the query answer is that it may be difficult to trace back the term among the many documents attached to a gene; often the terms are found in the abstract of one of the papers, and the user has to open and read them all to find why the gene belongs to the list. Yet the system is not bugged: computers have many qualities, but they lack imagination and creativity: we can guarantee the words typed were found somewhere in this genes’ information.

-         A new indexing system has been developed to speed up the answers to the queries over the entire database. It indexes on strings of three characters and has been a lot of fun to develop, maybe it is original, maybe not. In any event, we hope you enjoy the outcome. Too few of you use the box to query AceView. Please try it… you will be impressed!

Statistics on Human build 33 (June 5th, 2003)

Note: build 32 was never made public


Thanks to the increase in mRNA and EST sequences submitted to the public databases and to the amelioration of the genome sequence, we align on this new build 20% more mRNAs and ESTs in 7% fewer genes. On this genome sequence, we map 3,449,800 cDNA sequences: 3,313,569 ESTs (67% of all ESTs in GenBank/dbest, a 9% increase relative to last time) and 117,475 mRNAs from the public databases (93%), as well as 18,756 NCBI RefSeq, representing 99.5% of the current RefSeq set. Although we do not mask for repeats, only 1.5% of the mRNAs and 2.75% of the ESTs are ambiguous and match the genome with indistinguishable quality in more than one place.  As for the current quality of the alignments, mRNAs on average measure 2,017 bp and align over 98.3% of their length with 99.7% accuracy. ESTs, especially prone to sequencing errors near the end of the read, on average measure 531 bp, yet align over 93% of their length with over 98% accuracy.

 

AceView clusters the 3.5 million reads into 30,953 genes with at least one validated [gt-ag] or [gc-ag] spliced intron and on average 4.9 alternative variants per gene (altogether 152,371 different mRNAs), exactly as we had reported in the Lander et al. main genome paper! Another 515 genes do not have confirmed introns, yet encode proteins of more than 300 amino-acids and are hence likely functional intronless genes. The remaining 48,122 genes are unspliced, partial, or non-coding and will require further investigation. Altogether in this build, we annotated 79,590 genes, with 201,359 alternative mRNA variants. 

 

In this release, 74,114 (64%) of the spliced mRNA models are supported by a single identified clone covering the entire CDS; the remaining could represent an inappropriate concatenation of two clones (28%) or more (8%), because, if no conflicts arise, we merge partial cDNAs.

 

The Pfam motifs used are from version 8.0, downloaded May 2003.



 

 

Caenorhabditis elegans: New RefSeq release of genes and chromosomes July 16th

back to top

 

Reference sequences of the nematode genes have been updated at NCBI on July 16th (and on the web on June 23rd).

We describe below

1.    The statistics of this release

2.    A glossary of C.elegans Gene Names, as found in RefSeq, LocusLink and AceView

3.    AceView gene representation, gene annotation and biology

 

C.elegans researchers, we count on you to help us in the NCBI annotation process, please send us text descriptions for the genes you studied. Note that for us, texts are better than keywords, they are more precise, and because we index all texts and abstracts for searches, the language can evolve and we will not miss new concepts. Also please check the bibliography in the bottom of the “Gene on Genome” page, and send us corrections: missing papers, and above all papers attributed to a gene that they do not describe create a lot of nuisance. Thank you!

 

Statistics of C.elegans genes

The complete genome of release WS97 (a fragment of  about 10 kb was added since) was provided by the Genome Sequencing Consortium through WormBase. This version of the genome consists of 100 264 081 bp and 22725 CDS/non coding RNA models, recognizable by their cosmid.number name. We have used all Genbank mRNAs as of Feb 10, 2003 to replace, whenever possible, models of genes by complete mRNAs: this happened for 1271 ‘reviewed’ RefSeq mRNAs, from 1106 genes. Through collaboration with Yuji Kohara and the transcriptome project, we were able to indicate which of the predictions are fully and exactly supported by cDNAs, even when the full length cDNA is not yet in Genbank as an mRNA. This happened for 5186 ‘validated’ RefSeqs mRNA from 4783 genes. The fully confirmed reviewed and validated RefSeqs are drawn in pink to signify their trustworthiness. Another 11555 mRNAs are approximately or partially supported, and 4442 are not supported by expression data or phenotype so far: all these mRNAs are drawn in dubious blue rather than pink. 

 

In term of genes, this release includes

  1. 1736 genes defined by mutant phenotypes and not yet cloned
  2. 21059 genes defined molecularly:

                .  19173 would produce mRNA(s):  14731 have been shown to be expressed, 3065 are associated to a phenotype, by mutation or gene specific RNA interference.

                .  1886 produce other RNAs, including 733 tRNA, rRNA, snRNA, scRNA, miRNA, pseudogenes, non coding polyadenylated RNAs, and unpredicted transcribed genes. 54 have an associated phenotype.

A glossary of C.elegans Gene Names

6029 genes now have an official name such as tra-1 or ced-4 (from the CGC). All molecularly identified genes also have a positional Worm Transcriptome Project name, which gives unambiguously the gene order and the strand: for example 1C94 is the gene on chromosome number 1, megabase “section” C (3rd), at or near kilobase 94: In addition, its strand information is encoded as follows: odd genes run downstream, on the direct strand, even genes run upstream, on the reverse strand.

Some nematode genes appear to belong to a “complex locus” as initially defined by complementation tests in phage or Drosophila. In such a gene, two complete alternative variants may not share a base, although a third variant “bridges” them and overlaps both. By convention, we denote this property by adding the letter C for complex behind the name (e.g. 1C777C), or by a dot if the gene had acquired two independent gene names (e.g. mai-1.gpd-2.gpd-3). Another frequent case is that of nearby genes expressed from the same strand, but with overlap between the 3’ end of the first gene and 5’ end of the second: disputably, but for technical reasons, these are represented as a single gene, with the suffix Co (for common UTR sequence) appended after the gene name (e.g. cul-1Co or 5K225Co), or with the sign AND if both genes have a name (e.g. mev-1ANDced-9).
 

Gene representation
In AceView, the extent of the nematode genes (represented by the named turquoise bar in the graphs on genome) is based on the actual mRNA and EST supporting data rather than on the predicted sequences from WormBase. Most of the RefSeq genes also contain this information, except, for technical reasons, for the genes mispredicted and too long. The advantage is that, although WormBase changed 2620 of their predictions over the last 7 months, the position or extent of only 118 of the AceView genes had to be changed. The same stability is seen in the nematode LocusLink.
As explained above, the sequences of pink mRNAs in AceView are fully supported, unlike sequences of blue mRNAs.


Gene annotation and biology
 Our annotation of the worm genes initially relied on a selection from WormBase. We have started to re-annotate the genes, with the help of the Worm Community and of the Worm Transcriptome Project led by Yuji Kohara.

1.    Expression data: For this release, we annotated the level and developmental pattern of expression for 11,000 genes (from Acembly analysis of  cDNA libraries representation). We list all clones in each gene, point to the best clone and describe the anomalies we see in the anomalous clones. We provide direct access to NextDB, the extraordinary resource provided by Kohara and Shin’I to visualize the in situ expression patterns on a partition of all developmental stages.  Note that this data is still unpublished: Users of the data presented on the NextDB web pages should not publish the information without Kohara permission and appropriate acknowledgment. The database is still in embryonic phase, so, if you find any discrepancies in the data, please let them know straightaway. They would also appreciate any suggestions and comments.

2.    mRNA, protein and gene titles were generated. All gene titles now bear an indication of their function: for all genes where a phenotype has been described, the word essential or phenotype appears in the gene title. Proteins and genes titles are in the process of hand edition.  In this release, we have hand annotated from the literature 581/2 gene-gene interactions and 694/2 protein-protein interactions, and have worked on identifying and renaming all collagen genes (187) in the worm (with J Kramer and the genome consortium). Finally, we have hand edited the entire file of phenotypic descriptions from C.elegans II, because it described the phenotypes with a coded vocabulary that made it difficult for the neophyte to understand.

The Pfam motifs used for this lot are from version 8.0, downloaded May 2003.

 

 

May 15, 2003  Arabidopsis thaliana new release

back to top

 

On May 15th, we released our reconstruction of the Arabidopsis genes, using all 178,464 ESTs and 28,721 mRNAs available from GenBank on March 5th, 2003 and aligning them on the TIGR genome release (downloaded from the TIGR site November 2002).

 

AceView aligned 168,076 ESTs (94.2%) and 28,653 mRNAs (99.8%), and from this, reconstructed 21,431 genes, giving rise, by alternative splicing or alternative promoters, to 26,687 mRNAs. The genes look like nematode or fly genes in their overall organization and dimensions. All the proteins were annotated by our pipeline.

 

The level of alternatives was strikingly and significantly lower than in worm, and a fortiori human, taking into account the high level of coverage: 25% of the mRNAs are alternative forms. Among the genes with at least 3 cDNA clones, only 20% contain alternative variants (2,728 genes out of 13,502). That may be an interesting biological difference.

 

Another remarkable feature is the extremely high quality of many cDNA and even EST sequences: many do not have a single base different from the genome, and we see very few cDNA clones needing to be tagged as abnormal (only 307 of 192,351, 20 to 40 times less than in the other species). In other terms, this project just does not “look” like the other projects we analyzed: worm, Drosophila, and human. It could be that plants are really different, or that the excellent sequences of cDNAs in GenBank are predictions, or at least sequences corrected by genome alignment. In the same vein, there is also a low level of polymorphisms (which came as a surprise to us) and of genome sequencing errors, as monitored by discrepancies between sets of cDNAs and the genome sequence.

The next standard step would have been to get the official gene names and the biology, and we contacted Tair to this effect, but we did not yet find time to proceed. So this Arabidopsis release is rudimentary; it can be queried by sequence accessions or by protein annotation, but that is about all.

 

The Pfam motifs used for this release of Arabidopsis are from version 8.0, downloaded May 2003.

 

 

 

February 20, 2003  (corresponds to human build 31)

back to top

 

We just released our new Acembly genes, built over the human genome sequence from November 2002 (build 31 / golden path hg13).

We improved the Acembly clustering algorithms to make the genes more precise. We reinforced the mRNAs that already had a fully sequenced clone and reduced the combinatorics. As a result, we may end up with incomplete alternative variants that should trigger scientists to obtain the clones for full length re-sequencing. We also display more validated alternatively spliced variants.

The Pfam motifs used are from version 7.7, downloaded December 2002.

The most significant interface improvements for this build are:

  1. The query system is more powerful and versatile. You can search the Acembly human and worm genes by sequence, using the "Blast search"; this way you can find your gene, and navigate among closest homologs in diverse organisms, between Acembly databases. You can also search in natural language, using the "Search all" button, and get lists of genes that have something to do with what you asked (and it s even quite fast! try kidney function, or over expressed in tumor, or muscular atrophy in human, or orientation of the mitotic spindle in worm). You can search genes by protein families and Interpro associated Gene Ontology terms using the Pfam search, and refine your query in a new interface. And of course you can still search, now called "Search by name" as you used to.
  2. The availability of the Acembly mRNA reference sequences (the .AM), generated in the spirit of the NCBI RefSeq, where each AM sequence is a "golden path" composite of cDNAs where we choose, for each sequence segment, the clone fully compatible with the intron structure of the variant that best matches the genome.
  3. The origin of the cDNA clone, i.e. organ or tissue, is now indicated in the "Table of clones", and five grades of level of expression are now reported in the gene summary. The next version will propose a profile of expression for each gene, computed from the tissue, stage, and pathological nature of all the clones belonging to the gene.
  4. We redesigned the intron exon support table, by adding the names of the mRNA variant(s) to which each exon and intron belongs. In this way, you can "see" the alternative splice pattern. Some of these improvements were triggered by our users; we thank them and encourage your comments and feedback.
  5. We have a new PFam query that searches for keywords in PFam motif titles, descriptions, etc. From the resulting list of motifs, there are links to lists of genes that produce protiens matching each motif. Currently, the PFam query only works on the C. elegans database. It should work on the human database when we have the NCBI Build 31 release of the human genome available, some time in early February.

 

Statistics on Human build 31 (February 20th, 2003)

Thanks to code amelioration, we now map unambiguously 2,763,401 ESTs (representing 58% of the ESTs currently in GenBank/dbest) and 83,872 mRNAs from the public databases, as well as 18,000 NCBI RefSeqs, representing 99.3% of the current RefSeq set. AceView clusters these into 83,874 genes, with altogether 210,122 alternative mRNA variants. 33,286 genes have at least one validated gt-ag or gc-ag spliced intron, and on average 4.6 alternatively spliced variants.

 

 

November 2002

 

 

We added Caenorhabditis elegans data to our system. This involved revising the web service to correctly display data for multiple species.

 

August 2002  corresponds to human build 30

back to top

 

In this release we replaced our complicated positional identifiers (such as G_t1_Hs1_4478_30_0_2551) by the official Locus name, if possible, else by a PFAM name.number, when the protein contains a PFAM motif and otherwise by an artificially generated name. We compose either a pseudo Japanese sounding name, such as sayuri, kimu and nowara, to label genes where one of the principal clones is Japanese, else a pseudo English sounding name, like jawker or sneery.

Statistics on Human build 30 (August 2002)

In this human genome release, known as NCBI build 30, after filtering, we present on this site 66,830 genes, containing 138,040 reconstructed mRNAs, supported by 1,898,911 mRNAs and ESTs. We currently align 14,970 (95.0%) of the 15,748 NM reference sequences, confirming the near completion of the human genome sequence.

 

 

December 2000

back to top

12/12/00

AceView shows the alignment of cDNAs to the genome sequence (see Contig Assembly Process), and the genes and mRNAs reconstructed from these alignments, using the Acembly program developed by Jean and Danielle Thierry-Mieg on top of the acedb object oriented database manager.

Protein BLAST hits to the genomic sequence and conserved motifs identified by RPS-BLAST are displayed, but they are not used to generate the model. Complementary views of the genes are shown in LocusLink (All info about the gene), OMIM (Phenotypes), MapView (All maps) and Blink (protein homologies).

In this first release, only the 10,544 RefSeq mRNA available October 13th were aligned to NCBI contigs (October 5 freeze). A model is displayed if the mRNA aligns on the genomic sequence, finished or draft, over at least 700bp or half its length, with less than 3% discrepancy. A discrepancy is either a single base substitution, a base insertion or a base deletion.

Of the 10,544 RefSeqs, 9,409 (89%) aligned as well or better than defined above, in 9,045 genes, containing 9,299 mRNAs (1.03 RefSeq per gene).

 

Quality of alignment:

Some 5,423 RefSeq mRNAs (58%) produced an excellent alignment to 5,224 genes (by excellent, we mean more than 98% of the length aligned with less than 1% discrepancies from the underlying genome sequence). These models are expected to be complete.

Other RefSeqs match less well:  if we allow for 3% base mismatches relative to the current genome (3% single base insertion, deletion or variation between the RefSeq and the underlying genome sequence), 1718 (18%) align over 90 to 98% of their length, 690 (7%) over 80% to 90%, and the remaining 1578 (17%), over 50% to 80% or 700 bp.

 

These results, showing that 89% of the genes represented in RefSeqs can be mapped at good stringency in the genome, confirm the quality of both the genome sequence and the RefSeqs, and may be used to ameliorate both.

The few redundant RefSeqs revealed by alignment to the genome are currently being reviewed.

Gene duplications or assembly errors: After selecting the best hits and keeping only hits of similar quality, 179 RefSeqs map to more than one place in the genome, hence contributing to multiple gene models. Some correspond to genome assembly errors, recently being reviewed, some to actual paralogs and some to retroposons. Note that there are fewer gene duplications than was expected (max 2%), and most genes are uniquely assigned.

On the other hand, more genes than expected map uniquely in single exons. This happens for 7% of the RefSeqs, and the length of the gene is up to 3477 bp (Nuclear receptor interacting protein).

408 genes in this sample have more than one mRNA because they are alternatively spliced. The proportion is largely below the number of actually alternatively spliced genes, since the average number of protein products from a gene is above 2 per gene (from aligning all mRNA and EST in GenBank).

 

Unfortunately, a technical problem prevented export of 164 genes, some excellent, often in draft regions, to ASN1, so that, in this release, only 9305 RefSeqs (NM accessions) provided models for 8881 genes (Locus ID), containing 9222 mRNAs and proteins (XM and XP accessions).

 

CAUTION:
Using draft genome sequence and only RefSeqs mRNAs in this release introduces natural limitations: because of the draft nature of the genomic sequence, of polymorphisms, of imperfections in the RefSeq sequences or in the Acembly program, many models are partial. 42% are shorter than the RefSeq they originated from, usually by failure to align at the 3’ or 5’ end, while 13% have internal gaps.

On the positive side, 77% of the model mRNAs (7296 of the 9409 aligned) appear to contain the entire CDS.

Draft quality of the genome sequence (end of contigs as well as pieces of sequence not fully ordered and oriented (see Contig Assembly Process)) is the major cause of problems:  Aceview provides a graphical display of both elements, facilitating evaluation of the quality of the area.

3’UTRs align less well than the coding region, due to a higher level of polymorphisms and to less careful cDNA sequencing. Acembly may then align the mRNA but refuse to add the divergent sequence in the model, if too many discrepancies accumulate locally.

5’ missing exons have two major causes: the first is the draft, the second affects about 100 genes in which a bug prevented the program from finding the first exon. This bug has been solved now.

Future releases will include all mRNAs and ESTs from Genbank. The RefSeqs  will be improved and probable genome sequencing errors will be signaled. Using the entire mRNA/EST set from GenBank allows Acembly to reconstruct 77518 mRNAs encoding more than 80 amino acids, from 36453 genes, on the October 5 draft.

The AceView web pages present three bubbling graphics of each gene model:

  • whole gene view - one gene per page
  • detailed view – 900 bp of sequence data per page
  • genomic context view - 100 kb upstream and downstream of the gene, including models of surrounding genes

The AceView gene models can be accessed from the Map Viewer, LocusLink (from the av symbol) and the XM record.

 

Data Input   In the current release, the sequence data used to develop the models included:

  • Genomic contigs, frozen oct 3 and assembled at NCBI from finished and draft high throughput genomic sequence data generated by the human sequencing centers.
  • human mRNAs from known genes represented in the 10544 RefSeqs of october 13.

The mRNA sequences were aligned to genomic contigs using the Acembly program, described in the next section. The genomic contigs were also BLASTed against the vertebrate, non-human proteins and RPS-Blasted against a motif database to annotate the homologies.

 

Modeling Method

The Acembly program was used to produce the gene models. It may align all human mRNA and EST sequences to the genomic sequence. If there are ESTs or mRNAs that produce different models because of alternative splicing, all models will be displayed. If an mRNA aligns to multiple locations on the genomic sequence, Acembly keeps only the best alignment. But if two of the alignments are of similar quality, Acembly keeps both, and shows that mRNA in bold (as described in item 6 under Data displayed).

A model might be incomplete, if the available mRNA and EST sequence data for a gene is incomplete or if it does not match the genomic sequence over its entire length.