Help on the “Gene on Genome” page
Last updated March 14th,
2006
In this section, we document the features described in the texts, define the words we use if they are ambiguous, and explain which information is contained in the various tables of results. Explanations arise in the order of the texts and tables on the “Gene on Genome” page.
The Gene on Genome page |
·
Gene synopsis (collected annotations): RefSeq and AceView
summary, Alias names, map, links, closest protein homolog between human and the well annotated nematode (in AceView/
WormGenes); functional annotation using Gene Ontology, phenotype, products and
protein family, interactions with other genes and proteins. Bibliography.
·
AceView synopsis: inferences deduced from
AceView cDNA alignments
o
AceView inferences on gene expression level, number of transcript variants and
protein isoforms
o
AceView inferences on expression profile and pattern (so far only
in the worm)
o
AceView inferences on number of introns, their alternative/constitutive
nature and their boundaries
o
AceView inferences on alternative features: about promotors, last exons,
cassette exons, retained introns and their effects on protein variations
o
AceView inferences on gene regulation
· Molecules: all objects in this chapter link to their sequences
Gene synopsis |
The
first few paragraphs of the “Gene” chapter are annotations inherited from
either LocusLink or RefSeq, or from Pfam or Psort2 analysis of
the AceView reconstructed transcripts. The data in this section should be
considered an entry point, not an end point: we try our best but do not aim at completeness, rather we provide links to other sources, such
as NCBI LocusLink/Gene, Genecards or UCSC. Direct searches in PubMed and
analysis of the literature are always recommended.
• RefSeq, Proteome or AceView summary are directly copied from the source and reported at
the top of the page.
•
Alias names
Names and
symbols aliases are taken from
multiple sources. The main gene name is always preferably the official name and
symbol, but if the gene does not have an official gene name yet, we generate
names as explained here. Aliases are extracted from
GenBank submissions, from LocusLink in human
or from the worm transcriptome project and WormBase in the nematode. In addition, positional names, with
chromosome_ coordinate on the chromosome, are generated by AceView for each new
human release.
• Map
-
For human, we rely on the NCBI resources provided by RefSeq, which includes the
official HUGO mapping information. We do not compare the NCBI to the EBI map
assignments, but this is done by the GeneCards group: if you
are interested in maps, you can use the direct link we provide in the Link
page.
-
For worm, the official map made by the CGC provides us with the
value “measured by recombination”, and we use the genes that have been mapped to
deduce an interpolated map, with a value for each and every nematode gene.
More details on our
procedure:
By combining the excellent
CGC genetic map with the known complete sequence of the genome, we can assign a
genetic position in centiMorgan to each molecular gene. By trial and error, we
found that a very good fit could be obtained using a polynome of degree 5 whose
coefficients were adjusted by best fit, using the Origin statistical package.
For most genes, the
'measured' and 'interpolated' genetic positions are quasi identical. For some
genes, there is a significant difference, but we checked in those cases that
the displacement was compatible with the raw experimental data. Finally we
discarded a dozen controversial cases.
As a result, the NCBI/AceView
genetic Map is exactly colinear with the chromosome
sequence. Notice that in the worm, the zero is assigned by convention to a
nearly central gene on each chromosome; hence half the genes have a negative
map position.
·
Encoded products
and eventual protein family description
We
collect here the names of the products as they have been submitted by authors
of GenBank records. We also add manual annotation in the worm.
If our Pfam search yielded significant
matches, we report the InterPro description of the family, and provide the
number of products and genes in the entire database that belong to the same
protein family. The list of such genes is accessible by clicking on the number.
The list is ordered by map and position on the chromosomes; it provides the
gene title, often with phenotypic or functional indications, and an idea of the
expression level, through the total number of clones coming from that gene
(according to AceView).
We
encourage users to also follow the link on the family name, which will bring
them to the Pfam site at
Sanger, where additional instructive information, such as 3D structures, is
displayed.
·
Phenotype
We
encourage researchers to contribute
notes on phenotypes or any other important biological topic.
In human, the best
source of information we link to is the literature report made by the OMIM group.
In the nematode, we update
the data only 3 times a year except for data contributed by authors, which
becomes visible immediately. We use the C.
elegans II book as a reference, we perform some annotation ourselves from
papers or use direct author contributions, and we import the limited phenotypic
annotation from WormBase. We critically review the RNA interference data,
and get the knock out data from the KO Consortium directly. We
select some strains in the list available from the CGC. These and other strains
can be asked from Theresa Stiernagle and Robert
Herman.
Note
that pointers to WormBase and to the Avery C. elegans www server are provided
in the “Links” page and from the text, by clicking the WormBase gene name:
cosmid.number.
·
Functional annotation using Gene Ontology
This
paragraph and table summarize the phenotypic and functional annotation. They
include terms derived directly or indirectly from OMIM in Human and AceView/WormGenes
annotation in the nematode. Gene Ontology terms are either taken from Entrez
Gene/LocusLink, who currently get them from the GOA project at EBI, Pfam and Psort2 analysis.
The
keywords used are preferentially chosen from Ashburner’s Gene Ontology,
whenever possible. Of course, this is seldom the case for phenotypic
annotations, yet we try to use controlled vocabulary and regularly rationalize
the terms.
The
table in this section provides links
of each keyword to the lists of genes sharing the same annotation in AceView.
The list is ordered by map and position on the chromosomes, or by alphabetic
order; it provides the gene title, often with phenotypic or functional
indications, and an idea of the expression level, through the total number of
clones coming from that gene (according to AceView).
At
the bottom of the page are Gene Ontology terms for cellular localization. There
may be a number of seemingly contradictory information, but that is usually
because annotations for all alternative variants and protein isoforms are
collated there, so one variant may be predicted membrane bound, another
secreted and another cytoplasmic. There is also a low rate of incorrect
prediction from Psort2, although this is not frequent due to the high thresholds that
we impose before reporting a predicted localization.
·
Interactions with other genes and proteins
Those
have been extracted from the literature or from dedicated Web sites. Interacting genes are traditionally
evidenced through epistasis experiments, in which the phenotype of a double
mutant (preferably loss of function) is compared to the phenotypes of each
mutant taken separately, and the order in which genes act in the pathway can be
deduced. The result is a formal network that most often can be understood and
validated at the molecular level. Reading the articles would be most
enlightening since such formal genetic analysis is such a beautiful and
powerful area of science.
Interactions between proteins are inferred from physical experiments, such as
co-immunoprecipitation followed by mass spectrometry (or immuno-)
identification of the partners; 2 hybrid is a
complementary method with different sets of strength and limitations. See for
example Marc Vidal’s site on
systematic 2-hybrid studies in the nematode.
All
types of interactions are reported here, with a link to the interacting genes annotated
in AceView and usually a pointer to the reference or data source.
AceView synopsis: Level of expression and number of transcript variants and protein isoforms |
Example:
Expression level and number of variants ?
According
to AceView, it is expressed at very high level.
Its sequence is
supported by 217 sequences from 197 cDNA clones and produces, by alternative
splicing, 7 different transcripts a, b, c, d, e, g, h altogether encoding 7 different protein isoforms.
Note that the word “sequence” in that paragraph links to the sequence page in FASTA format. You may be wondering how does AceView approach the problem, how we quantify the expression levels
, count the alternative variants and the protein isoforms.
How does AceView approach the problem?
Data
in this and all following paragraphs in the “Gene” chapter describe inferences
on gene expression, alternative variants and gene regulation deduced from
AceView cDNA alignments. If you use this data, please cite us,
so that we get some credit for our work and can continue to improve AceView.
Alignments
in AceView do not use a priori knowledge;
they do not impose rules for splicing, for translation capacity or for
resembling similar proteins in the databases. AceView does not inject elements
of prediction: we simply aim at aligning at best the cDNA sequences on the
genome. Although this is not the place to explain the details of how we do
this, enough is to say that, in a benchmark alignment test where we compared
the main aligning programs making their alignment data available, AceView came
first in its alignments of mRNAs. No program to our knowledge is ambitious
enough to try to also align ESTs and cluster the aligned cDNAs into
transcripts. But when these programs make their data public, we do not fear
competition and will gladly support a benchmark on that more sophisticated
aspect.
Upon aligning all available cDNA sequences, EST or mRNA from GenBank, and keeping only the best alignment genomewide for each clone, then filtering to impose a minimal quality of the alignment, AceView uses the number of cDNA clones aligned in any given gene to classify the genes by level of expression in five groups genomewide. The terms used are:
• “at very high level” for genes expressed more than 4 times above the average gene in that release.
• “at high level” for genes expressed between 1.4 and 4 times more than the average gene.
• “well” for genes expressed between 0.4 and 1.4 times the average.
• “moderately” for genes between 0.2 and 0.4 times the average.
• “at low level” for genes below 0.2 times the average.
Note
that we do not use any other type of data, such as microarray or chip analyses,
for this description.
Contrary
to what is often believed, the AceView models are
not predictions. Once all sequences are finely aligned, we cluster them
into transcripts by imposing that no conflict is created at the structural
level by merging clones that contact, so that all cDNAs (5’ AND 3’ reads or
mRNA) that get associated and used to support a given transcript fully match
the transcript. A clone that conflicts will be used for an alternative variant
and merged with non-conflicting clones. A main difficulty remains to limit the
combinatorial increase in putative alternative transcripts: there may be 20
alternative introns in a gene, each fully supported, yet you do not want to
generate all possible combinations: we know that this does not occur in the
cell, where the choice of exons to thread is dependent on the promotor and tissue, the introns, the exons, and the
last exon choice, so it would not make sense to do the combinatorial in the
computer... To reach our goal, we limit to a minimum the number of transcripts
a given sequence will belong to and we prefer to merge sequences into
transcripts that are previously strongly supported, for example by a full
length mRNA, rather than to let them participate in yet another alternative
variant. As a result, a number of transcripts that could mentally be extended
to resemble a longer one will de facto remain
short and possibly incomplete. Yet interested scientists will be able to “see”
the incompatibility in structure and ask for a clone to resequence or else
devise a RT-PCR experiment, possibly extending the alternative transcript with
solid data.
The
important fact here is that we guarantee that all the transcripts presented in
AceView have support that makes them non-mergeable in the other variants, so
that the number of variants we enumerate is a minimal number of alternative
forms. In fact, we even do a final check just before opening the site and we kill
the worse looking transcripts in large genes, even if they are supported in
full by cDNAs. This may not be too wise, but AceView already reports a much
greater complexity than all other sites; we are sure the complexity exists and
we minimalize it rather than maximalize it, yet we do not want to scare our
users away!
The
structural features that distinguish the alternative variants can be found and
analyzed most easily in the “Table of introns and
exons”. In principle, the diagram of the gene should also help, yet because
of the scale of human genes on the genome and the fact that our zooming
function is unacceptably slow, it is often impossible to see by eye some subtle
differences, such as small changes in the length of an intron, or the presence
of an extra exon in a close cluster of exons.
Finding
the coding region in a transcript can be done most of the time under minimal
hypotheses, such as
- Trust the genome sequence
more than the mRNA or EST sequences (this is right 9 times in 10, at least when
we compared, on a sample of 400 cases, the consensus of the mRNA and EST
sequences to the genome). As a consequence and for simplicity, we systematically
analyze by default the genomic sequence underlying the transcripts. However, we
also provide users with the consensus sequence of the mRNAs and ESTs that best
matches the genome. We call this sequence the AM, for Acembly mRNA reference
sequence. Each AM sequence is a "golden path" composite of cDNAs,
where we use locally, for each segment, the clone fully supporting the intron
structure of the variant that best matches the genome, in effect knitting a
consensus by recombining compatible cDNA clones.
- pick the longest Open Reading Frame (ORF),
especially if it covers exon boundaries, looks like something else in the nr
protein database, is C terminal complete if the 3’ most clones are
polyadenylated and primed on a true polyA rather than an A rich genomic area,
and is N terminal complete if the 5’ most clone comes from a capped library. By
giving scores and penalties to these properties, we usually can select and
annotate a single coding sequence (CDS) per mRNA.
- In the case of a complete protein, how should we
select the initiator Met? Since all the proteins we annotate are predicted from
the mRNA sequence and not experimentally validated (or at least we do not know
which ones have really been completely validated), we have decided to use the codon
usage table maintained by the Taxonomy group at NCBI. There we read that
three codons are candidates to be used as initiators in most species. Although
ATG is used more often than TTG or CTG, we annotate the protein starting at
whichever of the three codons gives us the longest predicted protein.
- If there is no evidence that the messenger is complete,
we annotate the protein deduced from the first open codon in the main ORF. This
will be the case if there is no upstream Stop in the frame of the main CDS and,
in the case of the nematode, if the transcript
is not trans-spliced to a leader, or is not trans-spliced but its 5’ end is
defined by multiple clones from different cDNA cap-selected libraries.
- Yet it is
not always a trivial enterprise to select a single coding sequence per mRNA:
some transcripts encode two successive proteins (à la manière of operons) or mildly overlapping proteins, while some
long and well spliced transcripts are not obviously protein-coding. In our
working database, we like to annotate multiple proteins per transcript if
choosing is difficult; in the public view, we chose to show a maximum of 1 CDS
per transcript until January 2004, when we decided to let a few transcripts
with two very convincing putative products become public. There are two classes
of reasons for those:
o artefactual, due to either errors in the genome
sequence, such as a single base deletion or insertion leading to a frameshift,
or to errors in AceView missing an obvious 3’ end and concatenating transcripts
that should not be merged
o
biological, and there are non dubious
examples that cannot be ignored. There are anyway multiple good biological
explanations for such happenings, among which translational frameshift, use of
selenocysteine or of leaky stop anticodon tRNAs,
internal ribosomal entry associated to operon-like transcripts.
- In the latest release, we have allowed multiple products per mRNA, although we
limited this to cases where
o
The two products do not overlap and
have similar sizes; this may reflect an actual operon-like precursor (we have noticed
a number of confirmed ones even in human), or a bug of AceView missing a 3’ end
or a 5’ end and concatenating transcripts that should exist separately
o the two products are only marginally overlapping
(possible reasons for this could be: the genome has a deletion or insertion
relative to the mRNA, or more rarely there is a translational frameshift, or
there is a bug in the chaining of cDNA clones by AceView, and a real 3’ or 5’
end was missed)
o
the two products overlap, but they
have about equal length so we cannot decide which frame is the biological one
o In all cases, looking at the “Annotated mRNA”
usually allows to “see” if one or both products are real, since both have been
annotated on the graphical view (although the text explanation in the annotated
page has not yet been modified to clearly treat this case.
- Taking into account all these considerations, we
count the number of distinct protein isoforms. Two transcripts may differ in
the structure of their 5’ or 3’ UTR part, yet encode the very same protein.
This happens in human in almost 3% cases. We
therefore report the total number of transcripts and protein isoforms. A more
in depth comparison of the proteins among them, showing completeness and
inclusion, is available from the first column of the “Proteins”
table.
In summary, the laconic
little sentence in the paragraph “Level of expression and number of transcript
variants and protein isoforms” represents a dense summary of what AceView is
best at: transcript reconstruction from marrying cDNA and genome sequences!
AceView inferences on expression profile and pattern |
Once we have identified
the set of all clones from the public databases that match best a given gene,
we may hypothesize that they actually were transcribed from that gene, and provided
that we are able to extract from the public database (GenBank/EBI/DDBJ) some
information about the origin of the clone, the tissue and stage of development,
and the normal versus pathological nature of the sample, we should be able to
deduce an expression profile. Of course, we would need to rationalize with
respect to the total number of clones assigned to a gene from each
tissue/stage/pathology, so we first need to rationalize the descriptions
submitted by the authors.
For human, we are in the process of doing that, and hope to
have nice profiles of expression available for the release on build 35.
For the
nematode C. elegans, there is one main source of good cDNAs, the Kohara lab, and AceView was born as the
database to treat that data. Since June 2003, we have annotated the level and
developmental profile of expression for 11,000 nematode genes by using the
staged libraries information, adding when useful information from other GenBank
records. AceView provides direct access to NextDB, the extraordinary
resource provided by Kohara and Shin-I
to visualize the in situ expression patterns on a partition of all
developmental stages. We also hand annotated some genes by describing the
in situ hybridization patterns. Note that these data are still unpublished:
Users of the data presented on the NextDB
Web pages should not publish the information without Kohara’s
permission and appropriate acknowledgment.
Finally, we list all cDNA clones in each gene, point to the best clone, and
describe the anomalies we saw in the anomalous clones.
AceView inferences on
number of introns, their alternative/constitutive
nature and their feet |
Example:
Introns ?
The gene contains 50 confirmed introns, 47 of
which are alternative.
Comparison to the genome sequence shows that 47 introns follow the consensual [gt-ag] rule, 3 are fuzzy or ill defined. See this table for details.
You may want help on how confirmed introns are defined by AceView
alignments, alternative versus
constitutive nature of introns, or introns
feet and their graphical representation.
o
How are introns defined
by Acembly alignments? Alignments in
AceView do not impose rules for splicing, yet a cDNA often matches
non-contiguous stretches of genomic sequence along its length. Usually, these
stretches are exons, and the sequence between two exons is an intron.
•
The exact
match case: In tens of thousands
of cases, the best alignment cannot slide, and there is only one way to get an
exact cDNA to genome match around the splice site. By exact match, we mean at
least eight unambiguous basepairs matching the exons on each sides of the
intron. In those cases, we observe that the vast majority of cases have a common
sequence of two nucleotides at both sides of the intron. The most frequent
intron feet are, in this order, [gt-ag], [ct-ac] ([gt-ag] on the wrong strand),
[gc-ag], [at-ac], or any [other]. But if for example the last letter of an exon
is identical to the last letter of the next intron, the intron can slide by one
base. In those cases, we let introns slide freely to best match the intron feet
sequences in the order above. Once this basic work is done, we regularize some
of the introns with feet of unusual type, [other] or [fuzzy], by using the
strength of cooperative alignment, or “team jump”.
•
No “exact
match”:
-
Either the cDNA clone sequence matches
the genome sequence, but the match is ambiguous or not perfectly, because there
is at least one base difference or one uncalled base (n) in the 8 bp bordering
the intron: the feet are then considered [fuzzy]. It would usually be easy to
find a consensus sequence and to regularize the intron to become standard, but
this is against AceView philosophy which is to stick strictly to the
experimental data.
- Alternatively, there is a gap
or a topological problem in the local alignment. That may be because the genome
or the cDNA is deleted or rearranged: if multiple cDNAs fail to align locally,
the genome is likely faulty, otherwise the cDNA is suspicious. Another case
corresponds to a sequence gap, for example a clone was sequenced from both
ends, its 5’ and 3’ ESTs make it a unique variant, yet we do not have the
central part of the sequence (such a clone should be resequenced).
o
Alternative versus constitutive nature of introns, graphical
representation
An
intron which is found in all transcripts is constitutive, any other intron is
alternative. Similarly for exons. The triangles
representing constitutive exons in the “Annotated RNA” diagram are empty
whereas triangles representing alternative exons are filled with color.
o
Intron feet and their graphical representation
§
A well defined
typical intron is an intron supported by at least one clone exactly
matching the genome over 8 bp on each side and with typical feet [gt-ag] or
[gc-ag]. It is displayed as a pink broken line joining the exons in the “Gene
on Genome” view, and as a pink triangle in the “Annotated mRNA” view.
§
A well defined but atypical intron,
that is an intron supported by at least one cDNA clone with an 8 base exact
match on each side, but with any intron feet other than [gt-ag] or [gc-ag],
that is [at-ac] or any [other], is shown as a blue broken line joining the
exons in the “Gene on Genome” view, and as a blue triangle in the “Annotated
mRNA” view.
§
A [fuzzy] intron is supported
at best by a clone with a mismatch or an n in the 8 bp bordering the intron on
either side. A fuzzy intron is displayed as a pink straight line between the
exons in the “Gene on Genome” view, and as a blue triangle in the “Annotated
mRNA” view.
§
A gap in the alignment is
shown as a conspicuous straight black line in the “Gene on genome” view. Its
representation in the “Annotated mRNA” view is not conspicuous so far (Jan
2004): it just shows as horizontal lines in the pink object.
AceView inferences on alternative features: about promotors, last exons, cassette exons, retained introns and their effects on protein variations |
Once the transcripts
are reconstructed as explained above, AceView
analyses the differences among them by comparing various features: the exon composition
of all alternative transcripts, the putative promotors and the confirmed last
exons. We then automatically construct a brief summary of the results.
Example:
Alternative features ?
There are 6 probable alternative promotors and 2 non overlapping alternative last exons. The
transcripts appear to differ by truncation of the 5' end, truncation of the 3'
end, presence or absence of 8 cassette exons,
common exons with different boundaries, because an internal intron is not
always spliced out.
We define a putative
promotor
as the area upstream of a transcript whose main open reading frame is
N-terminal complete, that is, in human, bounded by an upstream Stop, in the
nematode, either this or trans-spliced, as explained here.
If the first exons of two complete CDS transcripts do not overlap, we consider
that the two transcripts define 2 probable alternative promotors. Note that we
do not count as a promotor the area upstream of a transcript open at the 5’
end, even if it is a RefSeq. To perform
analysis of these regions, we provide the sequence of the 5 kb segment upstream
of each transcript in the “Table of transcripts”,
in the 7th column.
Similarly, two last exons are considered alternative if the
transcripts they belong to are encoding COOH complete proteins and if the last
exons do not overlap.
A cassette exon is an exon that is internal to a
transcript, but fully within an intron of another transcript. We compare
transcripts 2 by 2, yet we do not overcount cassette exons.
Other terms are self
explanatory.
AceView inferences on gene regulation |
AceView
reports possible regulatory mechanisms affecting gene expression, as inferred from gene to genome alignment.
We
report the presence of an antisense gene if both genes in antisense have
confirmed standard introns which allow confirmation of their strandedness.
Quite often in human, one of the genes is non-coding, yet in most cases our
annotation of antisense is quite reliable.
We also annotate close neighbors and putative operons, complex loci, as well as other possible regulatory mechanisms that are evidenced from mRNA to genome alignments, including eventual RNA editing, translational frameshift, internal entry sites, use of selenocysteine or leaky Stop (so far only a beginning, in the worm). (more details later)
The table of “Transcripts” |
This table
summarizes what is known about the global structure and transcript sequence,
completeness, UTRs, level of expression and extent on genome for the mRNA
variants reconstructed by AceView. It also provides links to the transcript
related sequences.
Example Transcripts?
Variant |
5' UTR |
Completeness |
3' UTR |
# exons |
# clones |
Transcription |
coordinates |
aDec03 |
516bp |
1 exon inferred |
335bp, polyA |
9 |
8 |
4637bp. |
24218 to |
516bp |
|
335bp, polyA |
9 |
80 |
4637bp. |
24218 to |
|
AM-2237 |
571bp … |
|
335bp, polyA |
10 |
80 |
28854bp. |
1 to |
•
Column 1: Variant mRNA variants are named and ordered as a function of
the size of the conceptual translation product: variant a would produce a protein longer than variant b or c etc. All
transcripts in this table have at least one structural element that makes them
different and non mergeable with the others. Since January 2004, names include
a date, to make the variant sequences unique from release to release. Clicking
on the variant name brings you to the annotated
mRNA page, where you can see a diagram of the variant once we spliced
the introns out, with the protein annotation.
•
The length in
bp and sequence of the mRNA (a click away), as derived
from the genome and annotated in AceView, is given under the transcript
name. Below, we give the AM sequence if we were
able to build one of good quality. Each AM or Acembly
mRNA sequence is a "golden path" consensus of cDNAs, where we use, to
calculate the sequence in each position, the clone whose sequence best matches
the genome locally. We clip the AM when it gets noisy outside of the coding
region, this is why the AM often is slightly shorter than the genome derived
transcript.
•
Column 2:
5’ UTR This gives the length of
the 5’ UTR when it exists, i.e. when the encoded predicted protein is
N-terminal complete. The 5’UTR is calculated under the assumptions explained here, in particular concerning the choice
of the coding region and initiator Met.
•
Column 3:
Completeness This column is not
always present, since here we report eventual incompleteness of the transcript
sequence. Two types of cases are reported: cases in which there is a gap in the sequence (for
example a cDNA clone has been sequenced from both ends, but the sequences do
not meet; the length of the gap is estimated from the average length of the
cDNA inserts in the library), or cases there was a gap that we filled by
“stealing” exons from alternative variants sharing sequences at the level of
the gap extremities. We then report the number of exons
“inferred”, i.e. from which we stole part of the sequence.
•
Column 4:
3’ UTR We report here the length
and sequence of the longest 3’ UTR associated to the transcript, excluding the
bases encoding the stop codon, when the encoded predicted protein is C-terminal
complete. The word polyA appears if at least
one clone has a clear polyA (drawn as a black circle in the cDNA). Note that in
AceView we do not annotate the
alternative 3’ UTR that differ only by the length of the 3’UTR because
alternative polyA addition sites are so preeminent, in human and in worm, that
it would not be practical to report all polyadenylation sites. Yet interested
users can look at the annotated transcript diagram (by clicking on the variant
name in the table), where those are very obvious. The presence of an eventual
polyadenylation signal is annotated in the mRNA page. If you d like one or the
other information reported here, please say
so.
•
Column 5: # exons
gives the number of
exons of the transcript, including the eventual inferred exons reported in
column 3. This may be useful information to “see” unspliced transcripts for
example. All transcripts here differ in their structure and cannot be merged.
•
Column 6: #
clones reports the number of cDNA
clones that match this variant. Note that because of the
way we build the transcripts and of our effort to use each cDNA sequence in
a minimal number of variants, in order to limit the combinatorial effect, the
number of clones indicated here only gives an
estimate of the level of expression of any given variant. Yet this is important
information to “see” the most and least common forms.
•
Column 7:
Transcription unit The length of the extent of the transcript on the genome,
which should correspond to the premessenger, is given here. By clicking, you
can get the corresponding sequence. Below that, the sequence of the genomic
piece extending from base -1 of the transcript to base -5000 is made available
by clicking on “5 kb just upstream”. In cases
where the transcript is complete, this sequence most likely contains the promotor.
•
Column 8:
coordinates on gene gives the
coordinates of the 5’ and 3’ ends of the transcript on the gene. Base 1 of the
gene is the 5’ most base of the 5’ most transcript.
The table gives
access to sequences by clicking on the bp values of the transcript itself
(upper case for coding region including Stop, lower case for UTRs), the 5’UTR
and 3’ UTR (excluding Stop codon) sequences when they exist, the transcription
unit i.e. the premessenger sequence (or extent covered on the genome), then the
sequence of the 5 kb just upstream of the premessenger.
The summary of mRNA and protein annotations |
Example: Annotation of variants?
mRNA
variant |
Overview
(for structural details see previous table) |
aDec03 |
This
complete CDS mRNA is 2201 bp long. We
annotate here the sequence derived from the genome, although the best path
through the available clones differs from it in 78 positions. It has 9 exons.
The premessenger covers 4.64 kb on the
NCBI build 34, August 2003 genome. The protein (449
aa, 48.0 kDa, pI 5.7) contains no Pfam motif. It contains 2 coiled
coil stretches [Psort2]. Taxblast (threshold 10^-3) tracks ancestors down to Bilateria. |
bDec03 |
This
complete CDS mRNA is 2192 bp long……. |
We collate in this table the summaries of annotations of the transcript variants
and protein isoforms, as they result from our analysis presented in the annotated mRNA page. Although the table looks
repetitive, it contains interesting information about completeness, match to genome, motifs,
predicted intracellular localization and sequence ancestry that can be used to
quickly track functional differences among the variants.
The “Proteins” table |
This table allows to see at a glance from
the last column if an isoform has its exonic structure fully supported by a single
clone, or if it requires concatenation of two or more cDNA clones.
Example: Proteins ?
Protein |
Extends
from |
coordinates |
minimal
set of supporting clones |
aDec03
complete |
Met
to Stop |
300
to 2534 |
AF248482 |
bDec03
|
1st
codon to Stop |
197
to 1789 |
AFxxxxx |
cDec03
complete included in b |
Met
to Stop |
189
to 1691 |
AFyyyyy |
•
Column 1:
Protein gives the identifier of the
protein, which matches the identifier of the mRNA variant (their full names
would contain the gene name followed by the variant identifier). Proteins are
named and ordered according to their size,
given in aminoacids. The sequence of the protein
(derived from the genome sequence) is available by clicking on the size.
-
The completeness of the protein is
indicated. By complete, we mean that the protein lies in an open reading frame
bounded by a Stop on each side. If “complete” is not indicated, either there is
no convincing evidence that the protein is complete on both the N and C
termini, or there is a gap in the sequence or the
alignment (either due to a genome or a cDNA problem).
-
Finally a comparison of the protein to the other isoforms is
given. It could say a=b meaning that
although the two transcripts structurally differ, the two putatively encoded
products are identical: the transcripts differ in their UTR parts. Alternatively it could say•
Column 2:
extends from This column shows the extent of the putative CDS. Met
to Stop indicates the protein is N and C terminal complete, although it may
have an internal gap. 1st codon to Stop would apply to a protein C-complete
but N-terminal incomplete-or-uncertain; Met
to last codon to a protein N-complete C-incomplete and so on.
•
Column 3:
coordinates on the mRNA gives the
coordinates of the annotated protein on the transcript
•
Column 4: minimal
set of supporting clones gives the accessions of cDNAs necessary and sufficient to support
the coding region of the transcript. This brings critical information about the
justification of exons chaining: an isoform
with a single accession has its structure fully supported. If it is complete
and fully sequenced/no gap in the alignment, it is of NCBI RefSeq quality.
Note: We usually annotate
only one open reading frame (ORF) per mRNA,
choosing the longest, and deriving its sequence from the underlying genome. If
there is an error in the genome, a better ORF may be derived from the cDNA
consensus sequence. It is also possible that the cell uses another frame, or
makes more than one product per mRNA. The ORF we annotate on each transcript is
shown as a broad solid pink area on the drawing. An open reading frame that
does not cover most of the standard gt-ag or gc-ag intron boundaries (both
drawn in pink, blue being reserved for atypical splice sites) is in our opinion
suspicious: such a gene may be non-coding. If you are interested in the gene,
we recommend that you reanalyze yourself all these possibilities using the
sequences given in the transcripts table, in
particular the .AM AceView reference sequences, which represent the consensus
of cDNA sequences guided by the genome sequence.
The “introns and exons structure and support” table |
See the help on introns for more explanations. This table
offers a complete summary of the structure of the gene, with links to the
relevant sequences: click on the exon or intron length to get the underlying
genome sequence.
Example:
Intron/exon structure and support
|
in |
Length & DNA |
Coordinates on gene |
Supporting |
Exon 1 |
c |
136 |
1 to 136 |
BM563894 |
Alternative intron [gt-ag] |
c |
2049 |
137 to 2185 |
BM563894 |
Alternative exon 2 |
o |
142 |
394 to 535 |
BF686314 |
Alternative exon 3 |
s |
113 |
423 to 535 |
AV714035 |
……… |
|
|
|
|
-
Column
1 lists exons and introns, indicates their feet and their alternative/constitutive nature.
-
Column
2 “in variant” explicitly attributes each exon and intron to the AceView transcripts it belongs to, and
provides a link to the annotated
transcript.
-
Columns
3 “Length & DNA” provides the exon or intron length in bp, as well as a link to the
sequence of each element. When the positions and sizes of two
alternative exons or introns are close, they will not easily be distinguished
in the graphic, but they cannot be missed in the table.
-
Column
4 “Coordinates on gene” provides the coordinates in the gene,
in bp. Base 1 in the gene is the first base of the 5’ most transcript.
-
Column
5 “Supporting clones” provides the accession number of an example clone
supporting the exon or the intron and the total number of clones supporting the
segment. A clone “supports” an exon or an intron if it has exactly the same
boundaries. This allows for a quick evaluation of the rarity or frequency of
any given intron or exon.
All
exons and most introns in this table are fully supported
by one or more cDNA clones. A
clone supports an exon or an intron if it has exactly the same boundaries and
no error or uncalled base in the 8 bases bordering the intron on each side.
Some supported exons or introns may be shown in this table, although the
corresponding variants are not displayed, because they got filtered out at the
last step in our procedure. If an exon is supported by overlapping clones, they
are not listed. This is frequently the case for the last (and first) exon,
because alternative polyadenylation is so prevalent that we have chosen to
merge and show only the longest 3'UTR. All features in the table (up to
programming bugs) are supported by mRNAs or ESTs from the public databases (DDBJ/EMBL/GenBank).
In each table of clones are indicated the tissue and stage information, the sequence accessions (linked to the GenBank record), and the quality of the match.
This particular table lists the set of
clones necessary and sufficient to reconstruct all the AceView reference mRNA variants (the .AM), with
sequence best matching the genome. Each
AM sequence is a "golden path" composite of cDNAs, where we choose,
for each segment, the clone compatible with the intron structure of the variant
that best matches the genome.
not yet written, sorry
The tables of “All Supporting clones” for the gene and the transcript(s) |
In each table of clones are indicated the tissue and stage information, the sequence accessions (linked to the GenBank record), and the quality of the match.
not yet written sorry
The Fasta sequence page |
This page contains its own explanations. If you are interested in more complete sequence data, please go to the tables of transcripts for mRNA (sequence reconstructed from the genome or from the consensus of the cDNAs guided by the genome (.AM)), premessenger, 5’UTR, 3’UTR, and the 5 kb upstream of the transcript on the genome (probably containing the promotor in case of mRNA with complete CDS). Go to the protein table for the sequences of proteins (deduced from the genome), and to the introns and exons table for the sequences of introns and exons. The sequences of primers to amplify the CDS are given in each mRNA page. Please tell us if you would need other sequences.
Help! How do I read an AceView graph? |
Description of the gene display (sorry, not yet written).
The “Gene on Genome” page has a supporting color-coded zoomable diagram that shows the gene aligned on the genome, the AceView reconstructed transcripts with indication of standard and non-standard introns and delineation of the main open reading frame, and the gene’s neighbors.
In
AceView, the extent of the nematode genes (represented by the named turquoise
bar in the graphs on genome) is based on the actual mRNA and EST supporting
data rather than on the predicted sequences from WormBase.
In the annotated mRNA page, the AceView transcripts can be viewed as spliced mRNA variants, decorated with BlastP homologies, Pfam and Psort motifs, Stops and Met(AUG) in the three frames, and all supporting mRNAs and ESTs, with color-coded indications of differences from the genome sequence and labeled anomalies.