Help on the annotated mRNA page
January 27, 2004
•
Viewing the
mRNA page, tabs and menus
•
How to access the alternative variants
annotations
•
Choice of the sequence to annotate: genome
versus cDNA consensus
•
cDNA clones supporting this mRNA
·
Annotation of
the mRNA and protein variant
-
Summary annotation of the mRNA variant
-
Primers
-
Predicted cellular
localization and motifs (Psort)
-
Protein family classification (Pfam)
-
Protein homologies (BlastP)
-
Lineage and Closest
homologues (TaxBlast)
·
Graphical representation of the
AceView-derived mRNA and protein
Viewing
the mRNA page,
tabs and menus |
To see this page and the rest of AceView correctly, please, enable
JavaScript and StyleSheets on your browser: Tools
-> Internet Options
The Annotated mRNA(s) page
is accessible by clicking on the gray tab at the top of the page; it then
becomes blue.
You now see, on the left,
a text describing annotation of the specific mRNA, starting with a menu and
mouse over submenu. The menu and submenu are
transcript dependent: only paragraphs with content in the particular transcript
appear in the menus. Titles in the menu and submenu are clickable (blue), and
that takes you to the corresponding chapter or paragraph in the text, or to a
linked document.
In turn, each paragraph in
the text has an associated help accessible by clicking on the conspicuous
question mark ? Details on what is described,
and how the information is gathered, generated or evaluated in AceView are given.
Please complain if things remain unclear.
Example
of menu in the mRNA page.
mRNA summary |
Gene summary |
Protein
annotation |
mRNA
structure |
cDNA clones |
Sequences |
Diagram |
On
the right is a diagram depicting the spliced variant,
decorated with BlastP homologies, Pfam and Psort motifs, Stops and Met(AUG) in the three frames, and up to 60 supporting mRNAs
and ESTs. Reconstructed consensus sequences are highlighted: the .AM AceView
reference sequence in pale yellow and the NCBI RefSeq in light turquoise. For
each sequence aligned in the mRNA variant, differences from the genome sequence
are color-coded, polyA or unaligned sequences are drawn and cDNA clone
anomalies are labeled.
For
the worm, we indicate the stage and/or tissue and the presence and type of
trans-spliced leader.
How
to access the alternative variants annotations |
If there are alternative
variants, you may view them in either of two ways: by using the toggle in the
top left of the graphic mRNA window, which lists all the annotated variants or
by clicking on the letter characteristic of the variant in the gene summary in
the text window on the left.
Choice
of the sequence to
annotate: underlying genome versus
cDNA consensus |
We have chosen to annotate
the mRNA sequence derived from the sequence of the genome (by concatenating the
exon sequences) rather than the consensus of the cDNAs that we call .AM (itself
depicted alongside the transcript as a cDNA highlighted in yellow), because we
compared the average quality of the sequence of the genome and that of
individual mRNAs from GenBank and found that the average frequency of “errors”
is 8 to 10 times greater in mRNAs than it is in finished genomic sequences (on
a sample of 400 differences, in the worm). Hence we believe that the
genome-derived sequence is overall of better quality than the cDNA consensus.
We however give, in the transcript summary, the explicit count of differences
between the two sequences, and provide the cDNA consensus sequence (.AM) in the
“Fasta sequences” list, in each transcript and in the Gene page in the “Transcripts”
table.
Warning: there are a number
of cases where a single base deletion or insertion in the genome has led to a
frameshift breaking the open reading frame. These cases are easy to see
graphically by association of a line of blue errors in the cDNAs and a shift in
the black/green open reading frames; users should then be aware that there is a
problem and they better reannotate the transcript themselves, preferring the
cDNA consensus sequence .AM that we provide in the Fasta sequences page.
cDNA clones supporting this mRNA |
Selected cDNA clones
supporting this mRNA ?
The
detailed table about these clones is here.
Complete
CDS clones: A21353, NM_002286, X51985.
The table of all clones
supporting the mRNA has the standard format for tables of clones in AceView. If
you need more explanations, please write
to us.
The second paragraph
points to the accessions of the most interesting clones, and clicking brings
the NCBI mRNA or EST accession record.
Summary annotation of the mRNA |
Example: This complete CDS mRNA is 1,035 bp long. It
is supported by 4 cDNA clones. We annotate here the
sequence derived from the genome, although the best path through the available clones
differs from it in 4 positions. The pre-messenger has 5 exons. It covers 9.66
kb on the NCBI build 33, April 2003 genome. The protein (229 amino acids, 26.8
kDa, pI 9.0) contains no Pfam
motif. It contains 2 coiled-coil
stretches and an endoplasmic reticulum membrane domain [PSORT II]. It is predicted to localize in the
nucleus [PSORT II]. TaxBlast (threshold 10^-3) tracks ancestors back to Eutheria.
An overview is provided here; it typically contains a summary
of all the analyses detailed in the following paragraphs. It is computer
generated; therefore it sounds like a cold frog. The information is always
presented in the same order:
1. Is the
transcript complete or truncated, and if so, on which side?
2. What is
its observed length? And this connects to the sequence, with lower case for the
UTRs, upper case for the coding region; exons are indicated in alternate
colors.
3. How many
clones support it?
4. Does the
sequence derived from the cDNA consensus guided by the genome differ from the
genome underneath, and if so, how many point differences are there? Look at the
yellow highlighted clone-like object in the diagram to see if the sequence
differences, indicated by blue marks for insertions or deletions and red marks
for single point mutations, are in the coding (wide dark pink area) or in the
non coding UTRs.
5. How many
exons are there, and what is the size of the genomic piece under the
transcript? And this connects to the sequence, with lower case for the UTRs,
upper case for the coding region; exons are indicated in alternate colors, and
introns in black lower case.
6. Then about
the derived protein: how many amino acids does it contain? What are the
molecular weight and pI (provided by ExPASy)?
Note the influence here of the choice of the
initiator Met and remember that the proteins
we annotate are deduced from the mRNA sequence, and not observed directly.
Translation usually starts at an ATG (Met), but at least three other codons: GTG,
TTG, or CTG are candidates to be used as initiators in most species (see the codon
usage table maintained by the Taxonomy group at NCBI). Confronted to the
choice of annotating a protein possibly too long or possibly too short, we
decided to use as Start any of the possible Met codons, we simply pick
whichever codon gives us the longest predicted protein. Using this simple rule
with no further constraint, we find that 3.3% of the proteins are annotated
starting from one of these “rare” sites, not from ATG.
7. Does the
protein contain reliable type A Pfam motifs?
8. Does it
contain motifs searched by
PSORT II (see details below)?
9. What is
its predicted cellular
localization?
10. Finally, are there
hits by BlastP at 10-3, and if so, how far back in evolution can we trace the
protein?
Primers and temperature
conditions to amplify the CDS are
calculated by Osp (Hillier
L, Green P. PCR Methods Appl. 1991
Nov;1(2):124-8). Usually, an annealing
temperature 2 or 3 degrees Celsius below the lowest T indicated gives good
results.
Predicted cellular localization and motifs
(Psort) |
·
PSORT II by Kenta Nakai gives the a
priori probability for a protein to be found in the various subcellular
compartments. But the localization cannot be reliably assessed if the protein
is incomplete, because it may be missing dominant signals that would influence
its localization, for example a signal peptide. We limit the main localization
annotation to complete proteins, and in the mRNA summary or to generate a title
for a variant, we impose that the Psort probability for the most likely
localization be above 50% for all compartments, except for the nucleus, where
we demand above 60%. However in the text, we report the a priori probabilities
as they come out of the program, even for partial proteins.
The
PSORT II principles are explained in this very useful document and cited papers
therein. PSORT II is based on physical properties of the protein and on the
recognition of addressing and other motifs; inferences on the subcellular
localization are derived by training the program on a set of 1,531 yeast
proteins whose localization is known (thanks to YPD/Proteome).
Validation:
This problem is difficult, although extremely important. In C. elegans, with our thresholds and
conditions, between 75 and 85% of the proteins whose localization is known are
predicted correctly by PSORT II. Large proteins, in particular from the
cytoskeleton or near the cell periphery, are often predicted to be nuclear.
A PSORT
predicted localization example:
PSORT II analysis (K. Nakai http://psort.nibb.ac.jp),
trained on yeast data, predicts that the subcellular location of this protein
is most likely in the Golgi (33%) or in the endoplasmic reticulum (33%). Less
likely possibilities are in the plasma membrane (22%) or secreted (11%).
·
PSORT II also
provides the coordinates and the sequences of the motifs recognized, as
calculated by a variety of programs, some written or modified by Nakai’s team,
some made available by others. These usually short motifs include signal
peptides, transmembrane domains, coiled-coils, nuclear localization domains,
endoplasmic reticulum retention signals, and N-myristoylation and prenylation
domains. The sequences are displayed in the table, and they are shown as wide
red bars in the graphic.
It may be useful to access
all proteins sharing a given motif, and we provide this function by clicking on
the name of the motif in the table. The result table has the usual structure
for a gene list and allows iterative refinements by query. The list is ordered
by map and position on the chromosomes; it provides the gene title, often with
phenotypic or functional indications, and an idea of the expression level,
through the total number of clones coming from that gene (according to
AceView).
A PSORT
domain table example:
From aa |
Domain |
Sequence |
1 to 17 |
N_terminal_Signal_domain |
MSLSFLLLLFFSHLILS |
234 to 240 |
Possible nuclear
localization |
PEKKKPP |
236 to 239 |
Possible nuclear
localization |
KKKP |
264 to 267 |
ER_membrane_domain |
KFRF |
Protein family classification (Pfam) |
·
This paragraph
reports attempts to classify the protein as a member of a previously recognized
and described protein family. We use Pfam and HMMER, as detailed below. If the
protein convincingly belongs to a family, clicking on the family name brings
the more complete description at the Sanger Institute Pfam site,
often with the 3D structure of a member. The AceView genes that belong to the
same family are enumerated and listed a click away.
·
Details:
Proteins are classified into protein families, known as the Pfam A
protein family motifs, defined by a
very active international collaboration (Bateman A, Birney E, Cerruti L, Durbin R, Etwiller
L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, Sonnhammer EL. Nucleic Acids Res. 2002 Jan
1;30(1):276-80).
They are searched using the HMMER program
provided by Sean Eddy. We keep only the highly significant hits according to
the Pfam cut-offs. We download and use systematically the latest release of the
Pfam motifs, about 2 weeks before release date. Version 2.8g of the program is
indeed slow and computer greedy, but it performs in our tests significantly
better than the much faster CD search at NCBI. More recently, we tested HMMER
version 3 and it is considerably faster. The Pfam motifs used for the worm and Arabidopsis are from version 8.0,
downloaded May 2003; for human December 2003 (build 34), they are from version
10.0 downloaded September 24th, 2003.
·
The Pfam
motifs are displayed as orange boxes; clicking on the box, or on the number
given in the text, produces the list of all other genes in that species that
belong to the same family, in the standard format for genes tables, which
allows further refinement of the query.
·
As a bonus of
Pfam, and thanks to the work of the EBI Apweiler team (
Protein
homologies (NCBI
BlastP) |
·
The protein is
decorated with NCBI BlastP homologies
against the nr database. We only consider hits that would arise by chance at
less than 1 per thousand (E < 10-3). We report the total number
of hits in the text but limit the display and the “Table of BlastP hits” to the
best 30 hits. The results are described in the text under “Protein homologies”
and are shown as blue overlapping boxes in the graphical display.
·
Clicking on
the blue box on the graph or in the text under BlastP brings in the BlastP
table of results, calculated a few days before the release (so it may be up to
3-4 months old).
Lineage and closest homologues (abbreviated as Taxonomy in the
Table of Contents) |
BlastP at NCBI comes with a very useful associated
“Taxonomy report” also known as TaxBlast. From this document, we extract the
number of hits in each of a number of selected branches of the tree, the entire
tree is represented, but some branches are purposely merged (for example as
“other amniota”). The accession of the closest homolog
in the most studied species is given, preferring the RefSeq (NM or XM) to
GenBank, SwissProt or PIR whenever there is a choice, because we then gain for
free a connection to the nicely annotated NCBI LocusLink/Gene database.
The
data are represented as a (hopefully self-explanatory) tree.
As
always, some caution in the interpretation is recommended, especially when the
number of hits in a large branch is small, as in the example below.
Example of a possibly ancient gene
(nematode spd-5 spindle defective gene).
Based on Taxblast, homologues to this gene are found in
the following organism(s):
Archaea 2 hits.
Bacteria 3 hits.
--Other Bacteria 3 hits.
Eukaryota 38 hits.
--Mycetozoa 1 hit.
----Dictyostelium discoideum 1 hit, best hit: AAO52027.1.
--Fungi Metazoa group 35 hits.
----Pseudocoelomata 7 hits.
------Caenorhabditis elegans 7 hits, best hit: NP_491539.1.
----Deuterostomia 28 hits.
------Amniota 28 hits.
--------Mus musculus 6 hits, best hit: NP_032507.1.
--------Rattus norvegicus 8 hits, best hit: XP_230851.1.
--------Homo sapiens 11 hits, best hit: P24043.
--------Other Amniota 3 hits.
--Other Eukaryota 2 hits.
Note that it was not found in E. coli or the most
studied bacteria, and only 2 archaea are positive at 10-3. If I was
interested in this gene, I would make sure that the few bacterial hits are not
contaminations.
This paragraph gives
information on the 5’UTR, the splicing pattern, and the 3’UTR including a
report on polyadenylation signal. Access to the various sequences is granted a
click away. Tell us if something
needs documentation.
Example: mRNA Structure ?
The sequence of the 5kb upstream of the transcript is here.
The 5'UTR contains about 156 bp. The
CDS is complete since there is an in frame stop in the
5'UTR 27 bp before the Met.
Splicing Comparison to the genome sequence shows that the 24/24 introns
follows the consensus [gt-ag] rule. Coordinates of the introns in the template
genomic DNA, where 1 denotes the first genomic base matching the RNA, are:
[ type ] |
start |
end |
length |
[ gt-ag ] |
409 |
118864 |
118456 bp |
The
3'UTR contains about 1613 bp
followed by the polyA. The standard AATAAA polyadenylation
signal is seen about 30 bp before the polyA. This 3'UTR 1614 bp is among
the 5% longest we have seen, it may serve a regulatory
function. It contains 23% A, 30% T, 23% G, 21% C.
|
Each
mRNA and protein variant is depicted in its page, decorated with BlastP
homologies, Pfam and Psort motifs, Stops and Met(AUG) in
the three frames, and up to 60 supporting mRNAs and ESTs. Special sequences are
highlighted in this diagram: the .AM AceView reference sequence in pale yellow
and the NCBI RefSeq in light turquoise. For each sequence aligned in the mRNA
variant, differences from the genome sequence are color-coded, polyA or
unaligned sequences are indicated with specific icons and cDNA clone anomalies
are labeled.
For
the worm, we indicate the stage and/or tissue and the presence and type of
trans-spliced leader.
1 |
2 |
3 |
4 |
5 & 6 |
7 |
8 |
9 |
10 |
11 |
Scale in bp |
1st 2nd frame |
3rd frame |
from databases decorated with errors |
1: Scale is
in basepairs, base 1 is the first base of the transcript.
When there is a gap in the
sequence and a clone with a 5’ and 3’ read bordering the gap, we estimate the
size of the gap from the average insert clones in the libraries.
2: Pfam
homologies to known protein families are shown in their exact position according to the
HMMER program. Click on the orange box to get the list of all genes in this
species that contain significant matches to this Pfam family.
Click on the link in the
title above for more information on the procedure, and carefully read the text
associated to the diagram of your transcript, under Protein
family classification for more results, including the InterPro
description of the protein family and a link to the Pfam description.
3: Psort2 motifs are usually short addressing motifs: they are
depicted by wide but usually short red bars. Clicking on some, such as
transmembrane domains, coiled coils, myristoylation or prenylation domains, ER
retention signals, zinc fingers for example brings the list of all genes in the
database with a similar motif detected by the same program (indeed a collection
of programs assembled and tuned by Kenta Nakai, from
Click on the link in the
title above for more information on the Psort2 procedure, and carefully read
the text associated to the diagram of your transcript, under Predicted cellular localization and motifs (Psort)
for more results, in particular the important predicted localization paragraph
not summarized in the diagram, but which is actually the main aim and
originality of the Psort2 program.
4: BlastP /TaxBlast results are shown compressed in a single line, so that the
complete extent of all BlastP homologies in the nr protein database, with cut
off 10-3, is visible. Clicking on the blue box brings a table of
BlastP hits, with up to the 20 best hits and for each all the characteristics
provided by Blast: coordinates, scores and links to
the GenBank records. This table was calculated a few days before the release
(so it may be up to 3 months old).
Click on either of the two
links in the title above for more information on the BlastP or TaxBlast
analyses, and carefully read the texts associated to the diagram of your
transcript, under Protein homologies and most of all under Lineage and closest homologues. The latter
diagram provides you with links to the closest homologues in other species.
Each
mRNA/Protein variant is
graphically represented in tyrian
pink as a spliced mRNA, position of the introns is
indicated by a triangle along the mRNA, a kind of scar of where the intron was.
A well defined intron (by AceView criteria) has at least one cDNA clone exactly
matching the exons over 8 bp on both sides of the intron, and the type of intron feet is color coded accordingly: pink if well
defined* and typical [gt-ag], [gc-ag], or [at-ac]; blue if atypical or not well defined (note this is less
precise than on the gene on genome view, where introns that are not well
defined are represented by a straight line rather than a broken blue one). An
alternative intron is filled, and a common intron is open. Gaps are not well
represented in this view yet.
The protein is represented
by the wide pink area in the diagram, while the narrow areas correspond to the
UTRs.
The three reading frames are schematized, with Stops as black lines and Met(AUG) as green lines: open
reading frames (ORF) lie between two Stops, ORFs above 80 aminoacids are
outlined as black rectangles, and coding sequences (CDS) are classically considered
to go from the first Met to the Stop in the longest open reading frame.
Choice of
the initiator Met: When working with
mRNA and genome sequences, we never manipulate real protein sequences: all of
the proteins we annotate are predicted from the mRNA sequence. To not bias the
analysis, we have decided to use the codon
usage table maintained by the Taxonomy group at NCBI. There we read that
three codons are candidates to be used as initiators in most species, and
although ATG is probably used more often than TTG or CTG, we take whichever of
these three codons gives us the longest predicted protein. Only actual protein
sequencing can definitely say which was the actual initiator, and apart from
detecting NH2-terminal signals such as signal peptides, it does not hurt much
to annotate a protein that may be too long at the NH2 terminus.
Aligned supporting clones
On
the right of the diagram are represented the alignments of the supporting mRNAs
and ESTs, a down arrow for 5’ reads and an up arrow for 3’ reads, with
color-coded indications of differences from the genome sequence: red for a
single basepair difference, transition or transversion; blue for either a single-base
insertion or deletion, green for an uncalled base (n). Too many errors may
merge as brown. Anomalies are also labeled by red dots (minor) or big squares
(major) at the bottom of the clones.
Warning: there are a
number of cases where a single base deletion or insertion in the genome has led
to a frameshift breaking the open reading frame. These cases are easy to see
graphically by association of a line of blue errors in the cDNAs and a shift in
the black/green open reading frames; users should then be aware that there is a
problem and they better reannotate the transcript themselves, preferring the
cDNA consensus sequence .AM that we provide in the Fasta sequences page.
~~~~~
In addition in the worm, we indicate, along the
aligned cDNAs, extra data we have about stage or tissue and trans-spliced
leader. The stage is shown below the clone: e, embryo; 1, L1 larva; 2, L2
larva; 4, L4 larva; m, mixed stages enriched in larvae and adults; similarly
for the tissue: o, ovary; sp, sperm enriched; cd, cadmium induced. The
trans-spliced leader is indicated on top of the clone: 1, SL1; 2, SL2, …, 12, SL12. A suffix ‘ or m is
added to low frequency variants of the standard SL1 to SL12 that, because the
corresponding genes cannot be found in the genome sequence, we consider to be
mutant SLs rather than new SL types (e.g., SL1’).
Freedom of Information Act
| Disclaimer