Help on the “Gene on Genome” page

Last updated March 14th, 2006

 

In this section, we document the features described in the texts, define the words we use if they are ambiguous, and explain which information is contained in the various tables of results. Explanations arise in the order of the texts and tables on the “Gene on Genome” page.                                                                                                 

 

 The Gene on Genome page

back to top

 

·        Gene synopsis (collected annotations): RefSeq and AceView summary, Alias names, map, links, closest protein homolog between human  and the well annotated nematode (in AceView/ WormGenes); functional annotation using Gene Ontology, phenotype, products and protein family, interactions with other genes and proteins. Bibliography.

·        AceView synopsis: inferences deduced from AceView cDNA alignments

o       AceView inferences on gene expression level, number of transcript variants and protein isoforms

o       AceView inferences on expression profile and pattern (so far only in the worm)

o       AceView inferences on number of introns, their alternative/constitutive nature and their boundaries

o       AceView inferences on alternative features: about promotors, last exons, cassette exons, retained introns and their effects on protein variations

o       AceView inferences on gene regulation

·        Molecules: all objects in this chapter link to their sequences

 

  

Gene synopsis

back to top

 

The first few paragraphs of the “Gene” chapter are annotations inherited from either LocusLink or RefSeq, or from Pfam or Psort2 analysis of the AceView reconstructed transcripts. The data in this section should be considered an entry point, not an end point: we try our best but do not aim at completeness, rather we provide links to other sources, such as NCBI LocusLink/Gene, Genecards or UCSC. Direct searches in PubMed and analysis of the literature are always recommended.

 

     RefSeq, Proteome or AceView summary are directly copied from the source and reported at the top of the page.

 

     Alias names

Names and symbols aliases are taken from multiple sources. The main gene name is always preferably the official name and symbol, but if the gene does not have an official gene name yet, we generate names as explained here. Aliases are extracted from GenBank submissions, from LocusLink in human or from the worm transcriptome project and WormBase in the nematode. In addition, positional names, with chromosome_ coordinate on the chromosome, are generated by AceView for each new human release.

     Map

-         For human, we rely on the NCBI resources provided by RefSeq, which includes the official HUGO mapping information. We do not compare the NCBI to the EBI map assignments, but this is done by the GeneCards group: if you are interested in maps, you can use the direct link we provide in the Link page.

-         For worm, the official map made by the CGC provides us with the value “measured by recombination”, and we use the genes that have been mapped to deduce an interpolated map, with a value for each and every nematode gene.

More details on our procedure:

By combining the excellent CGC genetic map with the known complete sequence of the genome, we can assign a genetic position in centiMorgan to each molecular gene. By trial and error, we found that a very good fit could be obtained using a polynome of degree 5 whose coefficients were adjusted by best fit, using the Origin statistical package.

For most genes, the 'measured' and 'interpolated' genetic positions are quasi identical. For some genes, there is a significant difference, but we checked in those cases that the displacement was compatible with the raw experimental data. Finally we discarded a dozen controversial cases.

As a result, the NCBI/AceView genetic Map is exactly colinear with the chromosome sequence. Notice that in the worm, the zero is assigned by convention to a nearly central gene on each chromosome; hence half the genes have a negative map position.

 

·        Encoded products and eventual protein family description

We collect here the names of the products as they have been submitted by authors of GenBank records. We also add manual annotation in the worm.

If our Pfam search yielded significant matches, we report the InterPro description of the family, and provide the number of products and genes in the entire database that belong to the same protein family. The list of such genes is accessible by clicking on the number. The list is ordered by map and position on the chromosomes; it provides the gene title, often with phenotypic or functional indications, and an idea of the expression level, through the total number of clones coming from that gene (according to AceView).

 

We encourage users to also follow the link on the family name, which will bring them to the Pfam site at Sanger, where additional instructive information, such as 3D structures, is displayed.

 

·        Phenotype

We encourage researchers to contribute notes on phenotypes or any other important biological topic.

In human, the best source of information we link to is the literature report made by the OMIM group. 

In the nematode, we update the data only 3 times a year except for data contributed by authors, which becomes visible immediately. We use the C. elegans II book as a reference, we perform some annotation ourselves from papers or use direct author contributions, and we import the limited phenotypic annotation from WormBase.  We critically review the RNA interference data, and get the knock out data from the KO Consortium directly. We select some strains in the list available from the CGC. These and other strains can be asked from Theresa Stiernagle and Robert Herman.

Note that pointers to WormBase and to the Avery C. elegans www server are provided in the “Links” page and from the text, by clicking the WormBase gene name: cosmid.number.

 

·        Functional annotation using Gene Ontology

This paragraph and table summarize the phenotypic and functional annotation. They include terms derived directly or indirectly from OMIM in Human and AceView/WormGenes annotation in the nematode. Gene Ontology terms are either taken from Entrez Gene/LocusLink, who currently get them from the GOA project at EBI, Pfam and Psort2 analysis.

The keywords used are preferentially chosen from Ashburner’s Gene Ontology, whenever possible. Of course, this is seldom the case for phenotypic annotations, yet we try to use controlled vocabulary and regularly rationalize the terms.

The table in this section provides links of each keyword to the lists of genes sharing the same annotation in AceView. The list is ordered by map and position on the chromosomes, or by alphabetic order; it provides the gene title, often with phenotypic or functional indications, and an idea of the expression level, through the total number of clones coming from that gene (according to AceView).

At the bottom of the page are Gene Ontology terms for cellular localization. There may be a number of seemingly contradictory information, but that is usually because annotations for all alternative variants and protein isoforms are collated there, so one variant may be predicted membrane bound, another secreted and another cytoplasmic. There is also a low rate of incorrect prediction from Psort2, although this is not frequent due to the high thresholds that we impose before reporting a predicted localization.

 

·        Interactions with other genes and proteins

Those have been extracted from the literature or from dedicated Web sites. Interacting genes are traditionally evidenced through epistasis experiments, in which the phenotype of a double mutant (preferably loss of function) is compared to the phenotypes of each mutant taken separately, and the order in which genes act in the pathway can be deduced. The result is a formal network that most often can be understood and validated at the molecular level. Reading the articles would be most enlightening since such formal genetic analysis is such a beautiful and powerful area of science.

Interactions between proteins are inferred from physical experiments, such as co-immunoprecipitation followed by mass spectrometry (or immuno-) identification of the partners; 2 hybrid is a complementary method with different sets of strength and limitations. See for example Marc Vidal’s site on systematic 2-hybrid studies in the nematode.

All types of interactions are reported here, with a link to the interacting genes annotated in AceView and usually a pointer to the reference or data source.

 

AceView synopsis: Level of expression and number of transcript variants and protein isoforms

back to top

 

Example:

Expression level and number of variants ?

According to AceView, it is expressed at very high level. Its sequence is supported by 217 sequences from 197 cDNA clones and produces, by alternative splicing, 7 different transcripts a, b, c, d, e, g, h altogether encoding 7 different protein isoforms.

 

Note that the word “sequence” in that paragraph links to the sequence page in FASTA format. You may be wondering how does AceView approach the problem, how we quantify the expression levels

, count the alternative variants and the protein isoforms.

 

How does AceView approach the problem?

Data in this and all following paragraphs in the “Gene” chapter describe inferences on gene expression, alternative variants and gene regulation deduced from AceView cDNA alignments. If you use this data, please cite us, so that we get some credit for our work and can continue to improve AceView.

Alignments in AceView do not use a priori knowledge; they do not impose rules for splicing, for translation capacity or for resembling similar proteins in the databases. AceView does not inject elements of prediction: we simply aim at aligning at best the cDNA sequences on the genome. Although this is not the place to explain the details of how we do this, enough is to say that, in a benchmark alignment test where we compared the main aligning programs making their alignment data available, AceView came first in its alignments of mRNAs. No program to our knowledge is ambitious enough to try to also align ESTs and cluster the aligned cDNAs into transcripts. But when these programs make their data public, we do not fear competition and will gladly support a benchmark on that more sophisticated aspect.

 

Level of expression:

Upon aligning all available cDNA sequences, EST or mRNA from GenBank, and keeping only the best alignment genomewide for each clone, then filtering to impose a minimal quality of the alignment, AceView uses the number of cDNA clones aligned in any given gene to classify the genes by level of expression in five groups genomewide. The terms used are:

     at very high level” for genes expressed more than 4 times above the average gene in that release.

     at high level” for genes expressed between 1.4 and 4 times more than the average gene.

     well” for genes expressed between 0.4 and 1.4 times the average.

     moderately” for genes between 0.2 and 0.4 times the average.

     at low level” for genes below 0.2 times the average.

Note that we do not use any other type of data, such as microarray or chip analyses, for this description.

 

Number of transcript variants

Contrary to what is often believed, the AceView models are not predictions. Once all sequences are finely aligned, we cluster them into transcripts by imposing that no conflict is created at the structural level by merging clones that contact, so that all cDNAs (5’ AND 3’ reads or mRNA) that get associated and used to support a given transcript fully match the transcript. A clone that conflicts will be used for an alternative variant and merged with non-conflicting clones. A main difficulty remains to limit the combinatorial increase in putative alternative transcripts: there may be 20 alternative introns in a gene, each fully supported, yet you do not want to generate all possible combinations: we know that this does not occur in the cell, where the choice of exons to thread is dependent on the promotor and tissue, the introns, the exons, and the last exon choice, so it would not make sense to do the combinatorial in the computer... To reach our goal, we limit to a minimum the number of transcripts a given sequence will belong to and we prefer to merge sequences into transcripts that are previously strongly supported, for example by a full length mRNA, rather than to let them participate in yet another alternative variant. As a result, a number of transcripts that could mentally be extended to resemble a longer one will de facto remain short and possibly incomplete. Yet interested scientists will be able to “see” the incompatibility in structure and ask for a clone to resequence or else devise a RT-PCR experiment, possibly extending the alternative transcript with solid data.

The important fact here is that we guarantee that all the transcripts presented in AceView have support that makes them non-mergeable in the other variants, so that the number of variants we enumerate is a minimal number of alternative forms. In fact, we even do a final check just before opening the site and we kill the worse looking transcripts in large genes, even if they are supported in full by cDNAs. This may not be too wise, but AceView already reports a much greater complexity than all other sites; we are sure the complexity exists and we minimalize it rather than maximalize it, yet we do not want to scare our users away!

The structural features that distinguish the alternative variants can be found and analyzed most easily in the “Table of introns and exons”. In principle, the diagram of the gene should also help, yet because of the scale of human genes on the genome and the fact that our zooming function is unacceptably slow, it is often impossible to see by eye some subtle differences, such as small changes in the length of an intron, or the presence of an extra exon in a close cluster of exons.

 

Number of protein isoforms

Finding the coding region in a transcript can be done most of the time under minimal hypotheses, such as

-       Trust the genome sequence more than the mRNA or EST sequences (this is right 9 times in 10, at least when we compared, on a sample of 400 cases, the consensus of the mRNA and EST sequences to the genome). As a consequence and for simplicity, we systematically analyze by default the genomic sequence underlying the transcripts. However, we also provide users with the consensus sequence of the mRNAs and ESTs that best matches the genome. We call this sequence the AM, for Acembly mRNA reference sequence. Each AM sequence is a "golden path" composite of cDNAs, where we use locally, for each segment, the clone fully supporting the intron structure of the variant that best matches the genome, in effect knitting a consensus by recombining compatible cDNA clones.

-       pick the longest Open Reading Frame (ORF), especially if it covers exon boundaries, looks like something else in the nr protein database, is C terminal complete if the 3’ most clones are polyadenylated and primed on a true polyA rather than an A rich genomic area, and is N terminal complete if the 5’ most clone comes from a capped library. By giving scores and penalties to these properties, we usually can select and annotate a single coding sequence (CDS) per mRNA.

-       In the case of a complete protein, how should we select the initiator Met? Since all the proteins we annotate are predicted from the mRNA sequence and not experimentally validated (or at least we do not know which ones have really been completely validated), we have decided to use the codon usage table maintained by the Taxonomy group at NCBI. There we read that three codons are candidates to be used as initiators in most species. Although ATG is used more often than TTG or CTG, we annotate the protein starting at whichever of the three codons gives us the longest predicted protein.

-       If there is no evidence that the messenger is complete, we annotate the protein deduced from the first open codon in the main ORF. This will be the case if there is no upstream Stop in the frame of the main CDS and, in the case of the nematode, if the transcript is not trans-spliced to a leader, or is not trans-spliced but its 5’ end is defined by multiple clones from different cDNA cap-selected libraries.

-        Yet it is not always a trivial enterprise to select a single coding sequence per mRNA: some transcripts encode two successive proteins (à la manière of operons) or mildly overlapping proteins, while some long and well spliced transcripts are not obviously protein-coding. In our working database, we like to annotate multiple proteins per transcript if choosing is difficult; in the public view, we chose to show a maximum of 1 CDS per transcript until January 2004, when we decided to let a few transcripts with two very convincing putative products become public. There are two classes of reasons for those:

o       artefactual, due to either errors in the genome sequence, such as a single base deletion or insertion leading to a frameshift, or to errors in AceView missing an obvious 3’ end and concatenating transcripts that should not be merged

o       biological, and there are non dubious examples that cannot be ignored. There are anyway multiple good biological explanations for such happenings, among which translational frameshift, use of selenocysteine or of leaky stop anticodon tRNAs, internal ribosomal entry associated to operon-like transcripts.

-       In the latest release, we have allowed multiple products per mRNA, although we limited this to cases where

o       The two products do not overlap and have similar sizes; this may reflect an actual operon-like precursor (we have noticed a number of confirmed ones even in human), or a bug of AceView missing a 3’ end or a 5’ end and concatenating transcripts that should exist separately

o       the two products are only marginally overlapping (possible reasons for this could be: the genome has a deletion or insertion relative to the mRNA, or more rarely there is a translational frameshift, or there is a bug in the chaining of cDNA clones by AceView, and a real 3’ or 5’ end was missed)

o       the two products overlap, but they have about equal length so we cannot decide which frame is the biological one

o       In all cases, looking at the “Annotated mRNA” usually allows to “see” if one or both products are real, since both have been annotated on the graphical view (although the text explanation in the annotated page has not yet been modified to clearly treat this case.

-       Taking into account all these considerations, we count the number of distinct protein isoforms. Two transcripts may differ in the structure of their 5’ or 3’ UTR part, yet encode the very same protein. This happens in human in almost 3% cases. We therefore report the total number of transcripts and protein isoforms. A more in depth comparison of the proteins among them, showing completeness and inclusion, is available from the first column of the “Proteins” table.

In summary, the laconic little sentence in the paragraph “Level of expression and number of transcript variants and protein isoforms” represents a dense summary of what AceView is best at: transcript reconstruction from marrying cDNA and genome sequences!

 

AceView inferences on expression profile and pattern

back to top

 

Once we have identified the set of all clones from the public databases that match best a given gene, we may hypothesize that they actually were transcribed from that gene, and provided that we are able to extract from the public database (GenBank/EBI/DDBJ) some information about the origin of the clone, the tissue and stage of development, and the normal versus pathological nature of the sample, we should be able to deduce an expression profile. Of course, we would need to rationalize with respect to the total number of clones assigned to a gene from each tissue/stage/pathology, so we first need to rationalize the descriptions submitted by the authors.

For human, we are in the process of doing that, and hope to have nice profiles of expression available for the release on build 35.

For the nematode C. elegans, there is one main source of good cDNAs, the Kohara lab, and AceView was born as the database to treat that data. Since June 2003, we have annotated the level and developmental profile of expression for 11,000 nematode genes by using the staged libraries information, adding when useful information from other GenBank records. AceView provides direct access to NextDB, the extraordinary resource provided by Kohara and Shin-I to visualize the in situ expression patterns on a partition of all developmental stages.  We also hand annotated some genes by describing the in situ hybridization patterns. Note that these data are still unpublished: Users of the data presented on the NextDB Web pages should not publish the information without Kohara’s permission and appropriate acknowledgment. Finally, we list all cDNA clones in each gene, point to the best clone, and describe the anomalies we saw in the anomalous clones.

 

AceView inferences on number of introns, their alternative/constitutive nature and their feet

back to top

 

Example:

Introns ?
The gene contains 50 confirmed introns, 47 of which are alternative. Comparison to the genome sequence shows that 47 introns follow the consensual [gt-ag] rule, 3 are fuzzy or ill defined. See this table for details.

 

You may want help on how confirmed introns are defined by AceView alignments, alternative versus constitutive nature of introns, or introns feet and their graphical representation.

 

o                                           How are introns defined by Acembly alignments? Alignments in AceView do not impose rules for splicing, yet a cDNA often matches non-contiguous stretches of genomic sequence along its length. Usually, these stretches are exons, and the sequence between two exons is an intron.

     The exact match case: In tens of thousands of cases, the best alignment cannot slide, and there is only one way to get an exact cDNA to genome match around the splice site. By exact match, we mean at least eight unambiguous basepairs matching the exons on each sides of the intron. In those cases, we observe that the vast majority of cases have a common sequence of two nucleotides at both sides of the intron. The most frequent intron feet are, in this order, [gt-ag], [ct-ac] ([gt-ag] on the wrong strand), [gc-ag], [at-ac], or any [other]. But if for example the last letter of an exon is identical to the last letter of the next intron, the intron can slide by one base. In those cases, we let introns slide freely to best match the intron feet sequences in the order above. Once this basic work is done, we regularize some of the introns with feet of unusual type, [other] or [fuzzy], by using the strength of cooperative alignment, or “team jump”.

     No “exact match”:

-       Either the cDNA clone sequence matches the genome sequence, but the match is ambiguous or not perfectly, because there is at least one base difference or one uncalled base (n) in the 8 bp bordering the intron: the feet are then considered [fuzzy]. It would usually be easy to find a consensus sequence and to regularize the intron to become standard, but this is against AceView philosophy which is to stick strictly to the experimental data.

-       Alternatively, there is a gap or a topological problem in the local alignment. That may be because the genome or the cDNA is deleted or rearranged: if multiple cDNAs fail to align locally, the genome is likely faulty, otherwise the cDNA is suspicious. Another case corresponds to a sequence gap, for example a clone was sequenced from both ends, its 5’ and 3’ ESTs make it a unique variant, yet we do not have the central part of the sequence (such a clone should be resequenced).

 

o                   Alternative versus constitutive nature of introns, graphical representation

An intron which is found in all transcripts is constitutive, any other intron is alternative. Similarly for exons. The triangles representing constitutive exons in the “Annotated RNA” diagram are empty whereas triangles representing alternative exons are filled with color.

 

o       Intron feet and their graphical representation

§         A well defined typical intron is an intron supported by at least one clone exactly matching the genome over 8 bp on each side and with typical feet [gt-ag] or [gc-ag]. It is displayed as a pink broken line joining the exons in the “Gene on Genome” view, and as a pink triangle in the “Annotated mRNA” view.

§         A well defined but atypical intron, that is an intron supported by at least one cDNA clone with an 8 base exact match on each side, but with any intron feet other than [gt-ag] or [gc-ag], that is [at-ac] or any [other], is shown as a blue broken line joining the exons in the “Gene on Genome” view, and as a blue triangle in the “Annotated mRNA” view.

§         A [fuzzy] intron is supported at best by a clone with a mismatch or an n in the 8 bp bordering the intron on either side. A fuzzy intron is displayed as a pink straight line between the exons in the “Gene on Genome” view, and as a blue triangle in the “Annotated mRNA” view.

§         A gap in the alignment is shown as a conspicuous straight black line in the “Gene on genome” view. Its representation in the “Annotated mRNA” view is not conspicuous so far (Jan 2004): it just shows as horizontal lines in the pink object.

 

AceView inferences on alternative features: about promotors, last exons, cassette exons, retained introns and their effects on protein variations

back to top

 

Once the transcripts are reconstructed as explained above, AceView analyses the differences among them by comparing various features: the exon composition of all alternative transcripts, the putative promotors and the confirmed last exons. We then automatically construct a brief summary of the results.

 

Example:

Alternative features  ?
There are 6 probable alternative promotors and 2 non overlapping alternative last exons. The transcripts appear to differ by truncation of the 5' end, truncation of the 3' end, presence or absence of 8 cassette exons, common exons with different boundaries, because an internal intron is not always spliced out.

 

We define a putative promotor as the area upstream of a transcript whose main open reading frame is N-terminal complete, that is, in human, bounded by an upstream Stop, in the nematode, either this or trans-spliced, as explained here. If the first exons of two complete CDS transcripts do not overlap, we consider that the two transcripts define 2 probable alternative promotors. Note that we do not count as a promotor the area upstream of a transcript open at the 5’ end, even if it is a RefSeq.  To perform analysis of these regions, we provide the sequence of the 5 kb segment upstream of each transcript in the “Table of transcripts”, in the 7th column.

Similarly, two last exons are considered alternative if the transcripts they belong to are encoding COOH complete proteins and if the last exons do not overlap.

A cassette exon is an exon that is internal to a transcript, but fully within an intron of another transcript. We compare transcripts 2 by 2, yet we do not overcount cassette exons.

Other terms are self explanatory.

 

AceView inferences on gene regulation

back to top

 

AceView reports possible regulatory mechanisms affecting gene expression, as inferred from gene to genome alignment.

We report the presence of an antisense gene if both genes in antisense have confirmed standard introns which allow confirmation of their strandedness. Quite often in human, one of the genes is non-coding, yet in most cases our annotation of antisense is quite reliable.

We also annotate close neighbors and putative operons, complex loci, as well as other possible regulatory mechanisms that are evidenced from mRNA to genome alignments, including eventual RNA editing, translational frameshift, internal entry sites, use of selenocysteine or leaky Stop (so far only a beginning, in the worm). (more details later)

 

The table of “Transcripts

back to top

This table summarizes what is known about the global structure and transcript sequence, completeness, UTRs, level of expression and extent on genome for the mRNA variants reconstructed by AceView. It also provides links to the transcript related sequences.

Example   Transcripts?

Variant

5' UTR

Completeness

3' UTR

# exons

# clones

Transcription
unit

coordinates
on gene

aDec03
2201bp

 

516bp

1 exon inferred

335bp, polyA

9

8

4637bp.
5 kb just
upstream

24218 to
28854

bDec03
2192bp AM-2186AM-2186

516bp

 

335bp, polyA

9

80

4637bp.
5 kb just
upstream

24218 to
28854

AM-2237

571bp

 

 

335bp, polyA

10

80

28854bp.
5 kb just
upstream

1 to
28854

       Column 1: Variant  mRNA variants are named and ordered as a function of the size of the conceptual translation product: variant a would produce a protein longer than variant b or c etc. All transcripts in this table have at least one structural element that makes them different and non mergeable with the others. Since January 2004, names include a date, to make the variant sequences unique from release to release. Clicking on the variant name brings you to the annotated mRNA page, where you can see a diagram of the variant once we spliced the introns out, with the protein annotation.

       The length in bp and sequence of the mRNA (a click away), as derived from the genome and annotated in AceView, is given under the transcript name. Below, we give the AM sequence if we were able to build one of good quality. Each AM or Acembly mRNA sequence is a "golden path" consensus of cDNAs, where we use, to calculate the sequence in each position, the clone whose sequence best matches the genome locally. We clip the AM when it gets noisy outside of the coding region, this is why the AM often is slightly shorter than the genome derived transcript.

       Column 2: 5’ UTR This gives the length of the 5’ UTR when it exists, i.e. when the encoded predicted protein is N-terminal complete. The 5’UTR is calculated under the assumptions explained here, in particular concerning the choice of the coding region and initiator Met.

       Column 3: Completeness This column is not always present, since here we report eventual incompleteness of the transcript sequence. Two types of cases are reported: cases in which there is a gap in the sequence (for example a cDNA clone has been sequenced from both ends, but the sequences do not meet; the length of the gap is estimated from the average length of the cDNA inserts in the library), or cases there was a gap that we filled by “stealing” exons from alternative variants sharing sequences at the level of the gap extremities. We then report the number of exons “inferred”, i.e. from which we stole part of the sequence.

       Column 4: 3’ UTR We report here the length and sequence of the longest 3’ UTR associated to the transcript, excluding the bases encoding the stop codon, when the encoded predicted protein is C-terminal complete. The word polyA appears if at least one clone has a clear polyA (drawn as a black circle in the cDNA). Note that in AceView we do not annotate the alternative 3’ UTR that differ only by the length of the 3’UTR because alternative polyA addition sites are so preeminent, in human and in worm, that it would not be practical to report all polyadenylation sites. Yet interested users can look at the annotated transcript diagram (by clicking on the variant name in the table), where those are very obvious. The presence of an eventual polyadenylation signal is annotated in the mRNA page. If you d like one or the other information reported here, please say so.

       Column 5: # exons gives the number of exons of the transcript, including the eventual inferred exons reported in column 3. This may be useful information to “see” unspliced transcripts for example. All transcripts here differ in their structure and cannot be merged.

       Column 6: # clones reports the number of cDNA clones that match this variant. Note that because of the way we build the transcripts and of our effort to use each cDNA sequence in a minimal number of variants, in order to limit the combinatorial effect, the number of clones indicated here only gives an estimate of the level of expression of any given variant. Yet this is important information to “see” the most and least common forms.

       Column 7: Transcription unit The length of the extent of the transcript on the genome, which should correspond to the premessenger, is given here. By clicking, you can get the corresponding sequence. Below that, the sequence of the genomic piece extending from base -1 of the transcript to base -5000 is made available by clicking on “5 kb just upstream”. In cases where the transcript is complete, this sequence most likely contains the promotor.

       Column 8: coordinates on gene gives the coordinates of the 5’ and 3’ ends of the transcript on the gene. Base 1 of the gene is the 5’ most base of the 5’ most transcript.

The table gives access to sequences by clicking on the bp values of the transcript itself (upper case for coding region including Stop, lower case for UTRs), the 5’UTR and 3’ UTR (excluding Stop codon) sequences when they exist, the transcription unit i.e. the premessenger sequence (or extent covered on the genome), then the sequence of the 5 kb just upstream of the premessenger.

 

The summary of mRNA and protein annotations

back to top

 

Example: Annotation of variants?

mRNA variant

Overview (for structural details see previous table)

aDec03

This complete CDS mRNA is 2201 bp long. We annotate here the sequence derived from the genome, although the best path through the available clones differs from it in 78 positions. It has 9 exons. The premessenger covers 4.64 kb on the NCBI build 34, August 2003 genome. The protein (449 aa, 48.0 kDa, pI 5.7) contains no Pfam motif. It contains 2 coiled coil stretches [Psort2]. Taxblast (threshold 10^-3) tracks ancestors down to Bilateria.

bDec03

This complete CDS mRNA is 2192 bp long…….

We collate in this table the summaries of annotations of the transcript variants and protein isoforms, as they result from our analysis presented in the annotated mRNA page. Although the table looks repetitive, it contains interesting information about completeness, match to genome, motifs, predicted intracellular localization and sequence ancestry that can be used to quickly track functional differences among the variants.

 

The “Proteins” table

back to top

 

This table allows to see at a glance from the last column if an isoform has its exonic structure fully supported by a single clone, or if it requires concatenation of two or more cDNA clones.

 Example: Proteins ?

Protein

Extends from

coordinates
on mRNA

minimal set of supporting clones

aDec03 complete
744aa

Met to Stop

300 to 2534

AF248482
AF271088

bDec03
530aa

1st codon to Stop

197 to 1789

AFxxxxx

cDec03 complete

included in b
500aa

Met to Stop

189 to 1691

AFyyyyy

       Column 1: Protein gives the identifier of the protein, which matches the identifier of the mRNA variant (their full names would contain the gene name followed by the variant identifier). Proteins are named and ordered according to their size, given in aminoacids. The sequence of the protein (derived from the genome sequence) is available by clicking on the size.

-       The completeness of the protein is indicated. By complete, we mean that the protein lies in an open reading frame bounded by a Stop on each side. If “complete” is not indicated, either there is no convincing evidence that the protein is complete on both the N and C termini, or there is a gap in the sequence or the alignment (either due to a genome or a cDNA problem).

-       Finally a comparison of the protein to the other isoforms is given. It could say a=b meaning that although the two transcripts structurally differ, the two putatively encoded products are identical: the transcripts differ in their UTR parts. Alternatively it could say       Column 2: extends from  This column shows the extent of the putative CDS. Met to Stop indicates the protein is N and C terminal complete, although it may have an internal gap. 1st codon to Stop would apply to a protein C-complete but N-terminal incomplete-or-uncertain; Met to last codon to a protein N-complete C-incomplete and so on.

       Column 3: coordinates on the mRNA gives the coordinates of the annotated protein on the transcript

       Column 4: minimal set of supporting clones gives the accessions of cDNAs necessary and sufficient to support the coding region of the transcript. This brings critical information about the justification of exons chaining: an isoform with a single accession has its structure fully supported. If it is complete and fully sequenced/no gap in the alignment, it is of NCBI RefSeq quality.

Note: We usually annotate only one open reading frame (ORF) per mRNA, choosing the longest, and deriving its sequence from the underlying genome. If there is an error in the genome, a better ORF may be derived from the cDNA consensus sequence. It is also possible that the cell uses another frame, or makes more than one product per mRNA. The ORF we annotate on each transcript is shown as a broad solid pink area on the drawing. An open reading frame that does not cover most of the standard gt-ag or gc-ag intron boundaries (both drawn in pink, blue being reserved for atypical splice sites) is in our opinion suspicious: such a gene may be non-coding. If you are interested in the gene, we recommend that you reanalyze yourself all these possibilities using the sequences given in the transcripts table, in particular the .AM AceView reference sequences, which represent the consensus of cDNA sequences guided by the genome sequence.

The “introns and exons structure and support” table

back to top

 

See the help on introns for more explanations. This table offers a complete summary of the structure of the gene, with links to the relevant sequences: click on the exon or intron length to get the underlying genome sequence.

 

Example:

Intron/exon structure and support

 

 

in
variant

Length

& DNA

Coordinates

on gene

Supporting
clone (s)

Exon 1

c

136

1 to 136

BM563894

Alternative intron [gt-ag]

c

2049

137 to 2185

BM563894
and 2 others

Alternative exon 2

o

142

394 to 535

BF686314

Alternative exon 3

s

113

423 to 535

AV714035

………

 

 

 

 

 

-         Column 1 lists exons and introns, indicates their feet and their alternative/constitutive nature.

-         Column 2 “in variant” explicitly attributes each exon and intron to the AceView transcripts it belongs to, and provides a link to the annotated transcript.

-         Columns 3 “Length & DNA” provides the exon or intron length in bp, as well as a link to the sequence of each element. When the positions and sizes of two alternative exons or introns are close, they will not easily be distinguished in the graphic, but they cannot be missed in the table.

-         Column 4 “Coordinates on gene” provides the coordinates in the gene, in bp. Base 1 in the gene is the first base of the 5’ most transcript.

-         Column 5 “Supporting clones” provides the accession number of an example clone supporting the exon or the intron and the total number of clones supporting the segment. A clone “supports” an exon or an intron if it has exactly the same boundaries. This allows for a quick evaluation of the rarity or frequency of any given intron or exon.

 

All exons and most introns in this table are fully supported by one or more cDNA clones. A clone supports an exon or an intron if it has exactly the same boundaries and no error or uncalled base in the 8 bases bordering the intron on each side. Some supported exons or introns may be shown in this table, although the corresponding variants are not displayed, because they got filtered out at the last step in our procedure. If an exon is supported by overlapping clones, they are not listed. This is frequently the case for the last (and first) exon, because alternative polyadenylation is so prevalent that we have chosen to merge and show only the longest 3'UTR. All features in the table (up to programming bugs) are supported by mRNAs or ESTs from the public databases (DDBJ/EMBL/GenBank).

 

The “Main supporting clones” table

back to top

 

In each table of clones are indicated the tissue and stage information, the sequence accessions (linked to the GenBank record), and the quality of the match.

This particular table lists the set of clones necessary and sufficient to reconstruct all the AceView reference mRNA variants (the .AM), with sequence best matching the genome.  Each AM sequence is a "golden path" composite of cDNAs, where we choose, for each segment, the clone compatible with the intron structure of the variant that best matches the genome.

not yet written, sorry

The tables of “All Supporting clones” for the gene and the transcript(s)

back to top

In each table of clones are indicated the tissue and stage information, the sequence accessions (linked to the GenBank record), and the quality of the match.

not yet written sorry

The Fasta sequence page

back to top

 This page contains its own explanations. If you are interested in more complete sequence data, please go to the tables of transcripts for mRNA (sequence reconstructed from the genome or from the consensus of the cDNAs guided by the genome (.AM)), premessenger, 5’UTR, 3’UTR, and the 5 kb upstream of the transcript on the genome (probably containing the promotor in case of mRNA with complete CDS). Go to the protein table for the sequences of proteins (deduced from the genome), and to the introns and exons table for the sequences of introns and exons. The sequences of primers to amplify the CDS are given in each mRNA page. Please tell us if you would need other sequences.

   

 Help! How do I read an AceView graph?

back to top

Description of the gene display (sorry, not yet written).

The “Gene on Genome” page has a supporting color-coded zoomable diagram that shows the gene aligned on the genome, the AceView reconstructed transcripts with indication of standard and non-standard introns and delineation of the main open reading frame, and the gene’s neighbors.

In AceView, the extent of the nematode genes (represented by the named turquoise bar in the graphs on genome) is based on the actual mRNA and EST supporting data rather than on the predicted sequences from WormBase.

 

In the annotated mRNA page, the AceView transcripts can be viewed as spliced mRNA variants, decorated with BlastP homologies, Pfam and Psort motifs, Stops and Met(AUG) in the three frames, and all supporting mRNAs and ESTs, with color-coded indications of differences from the genome sequence and labeled anomalies.

 


 


Freedom of Information Act | Disclaimer