We are pleased to present, at
long last, a new release of the human genes, built in AceView on the current
genome (NCBI 36.2/ UCSC hg18), using
Since the
last release, we were happy to observe that AceView transcripts, in the ENCODE
regions, are extremely similar to the Havana/Vega hand curated transcripts that
were selected as the Gold standard by NHGRI. The other methods participating in
the Gencode project see at best one third of the transcripts seen by
The Pfam motifs and associated
InterPro and GO descriptions used to annotate the proteins are version 20.0 of Pfam, downloaded from
-
Promoter sequence: To facilitate the search for promoter elements, we
currently provide the 5 kb sequence upstream of each mRNA. We noticed that Xie
et al, 2005 “define each promoter region as the non-coding sequence
contained within a 4-kilobase (kb) window centered at the annotated
transcriptional start site (TSS)”. Please tell
us if you would prefer this 4 kb sequence instead of the 5 kb upstream.
Statistics for this release (build 36/January 2007):
|
|
The Pfam motifs and InterPro
descriptions used to annotate the proteins are version 17.0 of Pfam,
downloaded from
Statistics for this release
(build 35/August 2005):
To
reconstruct the AceView mRNAs, we use 91% of the GenBank mRNA (174,129) and 76%
of the dbEST human data (4,325,853), a 5% increase relative to the Nov04
version. We also use 23,895 RefSeq.
Quality of the alignments remains excellent.
Between 1.1% and 2.2% of the sequences map ambiguously, in about 1500 fully or
partly repeated genes.
The
4.5 million cDNA sequences aligned on the genome define 312,932
different standard introns, all well supported (i.e. a cDNA sequence exactly
matches the genome over 16 bp, equally split on both sides of the intron
boundary). 97.7% are gt-ag, 2.2% gc-ag (a bit high?) and 0.08% at-ac (265 cases). An
additional 4485 introns are anomalous, and most probably correspond to defects
in the cDNA clones (or the genome).
We
currently annotate 65% of the introns as alternative, but this is an upper
bound because we over-count a little, due to a newly introduced bug.
Introns
are easy to define, and very reliable markers of expression and of the splicing
and alternative splicing patterns. Long primers sitting across exon boundaries
could be excellent tools for microarray building, or RT-PCR experiments. We
plan to add such sequences, possibly 80
or 100 bp (40/50 bp on each side?), taken on the reconstructed mRNA model, and
to count the support and alternative nature of the primer. We are still
thinking about how to present these data, in each gene or mRNA model as well as
in bulk, so that they are useful to our users. If you have comments or ideas, please
tell us.
Using
this simple measure of standard intron counts, we may ask what is the added
value of each type of data:
-
The NM and NR RefSeqs see a
total of 171,864 standard introns (and 442 non-standard), or 55% of the
total number of introns in the human cDNAs from the public databases.
Therefore, despite the small size of the RefSeq database, they already touch
more than half of the known standard introns. Their pattern is: 99.07% gt-ag,
0.80% gc-ag, 0.11% U12 at-ac (182 cases).
-
If we
considered only the 174,129 GenBank mRNAs, they define a total of 226,014
standard introns, 72% of the total (98.7% gt-ag, 1.17% gc-ag, 0.10% at-ac (228 cases); in addition to 1623
non-standard introns).
About the genes:
AceView clusters the 4.524 million reads into 56,491 genes
with standard introns or well-coding (in addition to 40,567 “putative” genes,
more partial or dubious, and 251,183 “cloud” genes, not supported by many
clones, unspliced and not clearly coding, which may represent intermediates in
the transcription process).
-
35,413
genes have fully-supported standard introns: 16,497 of these genes (or 47%) have at least one RefSeq, 22,280
genes with standard introns encode putative proteins above 100 aminoacids, the
remaining 13,133 may encode short proteins, be non-coding, or partial.
-
22,469
genes supported by mRNAs or ESTs do not have introns, yet encode putative CDS
of more than 100 aminoacids (1005 are
above 300 aminoacids). These intronless genes are possibly functional in the
proteome. 1297 of these genes (5.8%) are represented in RefSeq (Aug 11, 05).
-
If we consider
the potential for protein-coding of all genes, with and without introns, we
find that 44,749
genes in this release encode a putative CDS of more than 100 aa, of which 14,152 encode CDS above 300 residues.
-
But
alternative splicing generates on average of the order of 5 variants, 4 protein
isoforms per gene with introns. So the variety of proteins expected to be
present in the cells is much greater: AceView annotates 133,744 different protein isoforms
of more than 100 aminoacids, 35,536 are above 300 aminoacids. 69% of the AceView CDS above 100 aa, and above
300 aa are fully encoded within a single identified
clone in GenBank/dbEST (91,738 different proteins > 100 aa;
24,367 distinct proteins > 300 aa have the coding
part fully covered by a known cDNA), bringing the strictly non-dubious protein
isoforms to 3.1 per gene with introns on average. The remaining 31% AceView
proteins could still result from an inappropriate concatenation of two clones
(26%) or more (5%), because, if no conflicts arise, we merge partial mRNAs.
-
6% of the
proteins of more than 300 aminoacids are encoded by more than one alternative
mRNA variants (38,013 mRNA variants encode only 35,536 different proteins).
This is true for 9% of the proteins above 100 aa
(185,073 mRNA give only 168,366 distinct proteins). That is because a vast
majority (91 to 94%) of the alternative introns, alternative promoter or last
exon affect the coding region.
-
Using
stringent criteria to define antisense genes (part of our
standard annotation, under “regulation”), we identify 10,960 genes with
standard introns antisense to another gene with standard introns, i.e. 31% of all genes with standard
introns might undergo this type of
flip-flop regulation. Also, the function of up to one third of the non-coding
genes with introns could be to regulate the gene on the other strand by
antisense.
Elementary alignments
|
RefSeq
|
mRNA
|
EST
|
Total available in public DB apr/aug05
|
23,973
|
191,058
|
5,699,664
|
Number aligned
%
|
23,895
99.67%
|
174,129
91.14%
|
4,325,853
75.9%
|
Alignment quality: total length aligned, # diff from genome
Ave. bp aligned per sequence, %
% bp diff relative to genome
|
67.927 Mb,
66,529 diff
(2830 bp)
99.9%
0.10%
|
343.198 Mb
748,102 diff
(1929 bp)
99.3%
0.22%
|
|
Multi-aligned sequences (%)
In # (repeated) genes
|
227 (0.95%)
364
|
2466
(1.42%)
1712
|
94543
(2.19%)
19807
|
|
|
April 10:
We updated
the worm version to the current freeze of the genome (known as WS140).
March 23:
We noticed to our dismay that the
lists of genes or clones with specific properties were no longer accessible
from our GOLD work, and
re-established the links from the supplementary material. We ignore how long
the outage lasted, but are very sorry for it.
January
18: We fixed the PFAM and AceView
mRNA Blast searches, which were not properly connected to the current InterPro
definition files for PFAM, and to the current AceView mRNA files for tBlastn.
January 16: On the download site, we added a file providing The relation
between all the aligned ESTs/mRNAs, the genes, the locusIds
and the alternative variants, with indication of quality of match, tissue
of origin, type of AceView gene, for all genes (main, putative and cloud). Help on the
format of the accession to gene table has been updated as five more data
columns have been added to enrich the data in this table. Meanwhile, we got rid
of the 0.5% worse quality alignments (30,403
alignments/6,428,329) by removing the ESTs or mRNAs aligned at AceView
quality 10 or 11 (see the help
here). Two files (The
way all the ESTs/mRNAs are aligned in each gene, for all non-cloud genes (55.3
Mb) and the Same file as above, but for cloud
genes (5.7 MB) ) thus became redundant and were
removed. The links to download the GOLD alignments was broken and has been
restored.
|
-
In nematode,
we are doing manual edition of all the genes to complement and enhance the
WormBase view. Manually, it is easy to distinguish the case “complex locus” and
the case “concatenated genes touching through their 3’UTR/5’UTR”. Complex
loci were initially defined on the basis of complementation tests in
phage or Drosophila. In such a gene, two complete alternative variants may not
share a base, although a third variant “bridges” them and overlaps both. By
convention, we denote this property by adding the letter C for complex behind the name (e.g., 1C777C) or by a
+ sign
if the gene had acquired two independent official gene names (e.g., mai-1+gpd-2+gpd-3).
This convention follows that chosen by Ed Lewis for the Drosophila complex
genes (e.g. BXC for bithorax complex locus, of which bx, Ubx, Cbx
and pbx for example are alleles). In the case of
nearby genes expressed from the same strand but with overlap between the 3’ end
of the first gene and the 5’ end of the second, we use the suffix Co (for sequence in COmmon)
appended after the gene name (e.g., cul-1Co or 5K225Co). We may also use the
sign AND if both genes have a name (e.g., mev-1ANDced-9).
|
|
The Pfam motifs used to annotate the proteins from the human build 35/July
2004 are version 14.0 of Pfam, downloaded from
Statistics for this release (build 35/July 2004):
The
percentage of mRNAs and ESTs that we align increased only by 0.55% and 0.64%
respectively, because the genome sequence gained in quality more than in
quantity over the past year, from build 34 to 35. Yet, thanks to the increase in mRNA and EST sequences
submitted to the public databases, we now identify a total of 96,797
well supported transcribed genes, a net increase of 11,839 relative to the
release of July 2003 (on build 34). In particular, we gained 1,248 genes with
standard introns, bringing the total number of independent genes with
standard introns to 32,748.
On the current genome sequence, we reliably map 4,297,980 cDNA sequences present in GenBank August 2,
2004, a net increase of 269,318 mRNA or EST sequences from last year. That
includes 4,131,646 ESTs (73.37% of all ESTs in GenBank/dbest) and 166,334 mRNAs from GenBank (94.96%). In addition, this build includes alignments
for 21,565 NCBI RefSeq: the increase of
1,622 RefSeq mRNA mostly corresponds to “creations” by the NCBI RefSeq group, which generates
this useful resource.
Although we do not mask for repeats, only 1.1% of the mRNAs and 2.2% of the ESTs are ambiguous and
match the genome with indistinguishable quality in more than one place. The
percentages of clones did not change since last time, but AceView has greatly
improved in its ability to identify truly repeated genes. To make this more
visible graphically, we now draw the clones aligning in multiple genes at the
same quality in blue (rather than black), but keep the same color code for
point variations relative to the genome: a thin red line for a single base
change, transition or transversion, blue line for a single base insertion or
deletion. (from build 34, to be updated: As for the current quality of the
alignments, mRNAs on average measure 2,132 bp and align over 98.8% of their
length with 99.74% accuracy. ESTs, especially prone to sequencing errors near
the end of the read, on average measure 532 bp, and align over 93.3% of their
length with 98.16% accuracy.)
AceView clusters the 4.30 million reads into a total of
96,797 genes (and 248,728 “cloud” genes, not supported by many clones,
unspliced and not clearly coding, which may represent intermediates in the
transcription process). If we push in the direction of the current trend for
low numbers of genes, there are a total of 46,220 genes
in this release that either encode a putative CDS of more than 100 aa or are spliced with standard introns. 32,618 genes
encode a putative CDS of more than 100 aminoacids and 32,748 genes have at
least one validated gt-ag or gc-ag spliced intron.
Using stringent criteria
to define antisense genes, we find
that 8,572 genes with standard introns in this release are antisense to another
gene with standard introns, i.e. 26% of all genes with standard introns might
undergo this type of flip-flop regulation. Also, the function of up to one
third of the non-coding genes with introns could be to regulate the gene on the
other strand by antisense. Finally, 13,472 genes have no standard introns, yet
they encode a putative CDS of more than 100 amino-acids (346 are above 300 aa), hence are possibly functional intronless genes.
The genes with introns
have on average 5.37 alternative variants per gene, of which 4.51 have introns.
We annotate two proteins
in 3.2% of the AceView mRNAs (7,607/239,991), but do not for now make public
the uORF annotations, so as not to scare the users
away!
In this AceView release,
69% of the CDS >100 aa are fully supported by a
single identified clone covering the entire CDS (64,656 CDS from 30,110 genes
encoding > 100 aa), bringing the strictly
non-dubious protein isoforms to 2.8 per gene with introns on average. The
remaining 31% AceView proteins could still result from an inappropriate
concatenation of two clones (26%) or more (5%, down from 8% last time!), because,
if no conflicts arise, we merge partial mRNAs, hoping that more of you will
sequence the entire insert of clones that we have to concatenate now, by lack
of more complete data...
|
|
June 11, 2004: A file describing the association of each mRNA/EST
accession satisfactorily aligned in AceView build 34 to the corresponding
AceView gene is now available from the downloads page.
January 12,
2004: We released our new AceView genes,
built over the human genome sequence from July 2003 (build 34/golden path
hg16).
The Pfam motifs used for human December 2003 (build 34) are from
version 10.0 downloaded September 24th, 2003.
We have improved the
alignments, the clustering, the analysis and the presentation of the results,
in part following reports or requests from our users, be they thanked
wholeheartedly for their comments!
-
We have given unique
names to the mRNA models, so that there will be no confusion from one
release to the next: for this release, Dec03 was appended to each mRNA name. It
should be expected that variants may change sequence from one release to the
next: variant aDec03 will possibly differ from variant aMar04. Indeed, let us
recall that the gene names are tracked in AceView, but the variant names are
release dependent.
-
In this
release, we have allowed multiple proteins per mRNA, although we limited this to cases where
o The two proteins do not overlap and have similar
sizes; this may reflect an actual operon-like
precursor (we have noticed a number of confirmed ones even in human), or a bug
of AceView missing a 3’ end or a 5’ end and concatenating cDNAs that only exist
separately
o the two proteins are only marginally overlapping
(possible reasons for this could be: the genome has a single base deletion or
insertion relative to the mRNA creating a frameshift, or more rarely there is a
translational frameshift, or else there is a bug in the chaining of cDNA clones
by AceView, and a real 3’ or 5’ end was missed)
o the two proteins overlap, but they have about equal
length so we cannot decide which frame is the biological one
o In all cases, looking at the “Annotated mRNA”
usually allows to “see” if one or both proteins are real, since both have been
annotated on the graphical view (but the text explanation in the annotated page
has not yet been modified to clearly treat this case and only one protein (the
“best”) is annotated).
-
Sequences
of the mRNAs (derived from the genome or from the cDNA consensus .AM), 5’UTR, 3’UTR,
coding regions (nucleotide and protein sequences), introns and exons and
upstream genomic area (some containing the promotor) are now available by
clicking from the corresponding tables (mRNA, Proteins, Introns and exons). We
have also added sequences containing the putative promotors.
Most sequences are now available from a natural place: the “mRNAs” table, the “Proteins” table or the “introns and exons” table. We were
also asked to provide couples of primers to amplify each and every exon, and
that will come for the next release.
-
The GO
terms, in the summary table in the
top of the gene page, are now comprehensively connected to all genes with the
same terms.
-
We have
changed the “Table of contents” in both its
look and the organization of the data, especially in the “Gene summary”
section. Please tell us if you
notice some information is missing.
-
We have fixed
the bug that used to split some genes vertically (hence the slightly reduced
number of spliced genes in this release) and a bug that in 90 genes was
creating introns with negative length (thanks to our nice users).
-
We have also
fixed a few bugs in the analysis: our counts of alternative promotors
were sometimes inaccurate; the length of sequence involved in an antisense was
cumulative over the alternative mRNAs; our table of “Main cDNA clones”, which
should allow you to get all clones required to fully support all variants, was
sometimes incomplete; a sentence about operons was there irrespective of the
data.
Statistics on
Human build 34 Sorry we forgot to update this part until
February 23…
Thanks to the increase in mRNA and EST sequences
submitted to the public databases and to the amelioration of the genome
sequence, we align on this new build 7% more mRNAs and ESTs in 547 more genes
with standard introns. On this genome sequence, we map 4,028,662 cDNA
sequences: 3,910,845 ESTs (73% of all ESTs in GenBank/dbest,
a 7% increase relative to last time) and 117,817 mRNAs from the public
databases (94.4%), as well as 19,943 NCBI RefSeqs,
representing 99.8% of the current RefSeq collection.
Although we do not mask for repeats, only 1.0% of the
mRNAs and 2.2% of the ESTs are ambiguous and match the genome with
indistinguishable quality in more than one place. As for the current
quality of the alignments, mRNAs on
average measure 2,132 bp and align over 98.8% of their length with 99.74% accuracy. ESTs, especially
prone to sequencing errors near the end of the read, on average measure 532 bp,
and align over 93.3% of their length with 98.16% accuracy.
AceView clusters the 4.02
million reads into 31,500 genes with at least one validated gt-ag or gc-ag spliced intron and on average 5.36 alternative
variants per gene, 5.57 protein isoforms (because we accept a limited number of
mRNAs where we annotate two proteins). Another 14199 intronless genes do not have
confirmed introns, yet have open reading frames of more than 300 amino-acids
(of which 596 are above 300 aa) and are hence likely
functional intronless genes. The remaining 223,435 genes are unspliced,
partial, or non-coding and will require further investigation. The reason why
we have so many genes in this build is because we used to filter the last
category to keep only the genes supported by at least six clones, or long. But
people were sometimes searching AceView to locate the best alignment of a clone,
and failing to find it because of the filter. We call this class of “genes” the
cloud, and indicate this in the gene title. Their biological relevance remains
to be established.
In this release, 98,629
(68%) of the spliced mRNAs are supported by a single identified clone covering
the entire CDS; the remaining could represent an inappropriate concatenation of
two clones (24%) or more (8%), because, if no conflicts arise, we merge partial
cDNAs.
We describe below the new releases
and a few new features for
- human (June 5th,
2003)
- the worm Caenorhabditis
elegans (June 23rd, 2003)
- Arabidopsis
(May 15th, 2003)
News on Human
Build 33 (June 5th) posted
September 20th |
October 3: Sorry we had to remove human build 31, from Nov 2002, because of
lack of space. These models should still be available from UCSC . If this causes a problem, please tell us.
We released our new Acembly genes, built over the
human genome sequence from April 2003 (build 33 / golden path hg15), on June
5th.
Most significant improvements: Progress for this build (September 20th,
2003) |
The most significant improvements
for this build are:
-
Tables
of genes with a given Psort motifs are now available from the “Table of Psort motifs”. They used to be kind
of accessible by clicking on the red boxes in the mRNA diagram, but most of the
time, we could not show the whole lot because we had put a limit to the maximal
number of genes in a single table. The table of genes with a coiled-coil, which
are putatively involved in protein interactions, has been most appreciated by
our loudest users.
-
Any AceView
table now comes with an indication of the level of expression of each gene
in the table, through the number of cDNA clones that Acembly assigns to that
gene. We also re-ordered the genes, not by their alphabetical name, but
according to their map position on the chromosomes. In human, we now provide in
the tables both the known cytolocation and the actual
chromosome on which Acembly assembled it, for a quick comparison.
-
Minor
detail fixed to avoid spaghetti genes in Acembly. We unplug a very distant first or last exon if it has too many
errors at an early step in the program. We may have lost a few small good
looking genes in the process. We also have a bug in this release: some genes
with very many clones are split into pieces and appear as a cluster of
overlapping small genes. This will be fixed in release 34.
-
The
query system has been
bettered. We fused what we used to call extended search and fast search into a
new single query box which tries to implement the notoriously impossible 'what
you get is what you wish' paradigm. By observing the queries we receive, and
the data in the various fields of the database that we want accessible, we
ended up ordering and classifying the data to be searched, and have drawn a set
of reasonable rules (= based on common sense) for extending what users type in
the query box. The algorithm nearly always gives us a reasonable answer, but we
would really appreciate users’ feedback if they encounter problems. In practice
and to simplify, we first try to recognize exact gene symbols, locusID, Genbank identifiers and
cDNA clones: if we do, we stop the search. If we don’t, we proceed by searching
all other fields in the database using a general search, allowing word
completion and extension both ways. All objects in the database point to genes,
we then collect the lists of genes. The inconvenient is that we may bring genes
somewhat distantly related to the query. To compensate, we have added the
possibility of recursive searches. The reply to a query comes as a table giving
the gene names, their position and a few words describing their function or
phenotype (we have done that systematically for worm genes), plus a new query
box that will query only inside this list. One possibly irritating feature with
the query answer is that it may be difficult to trace back the term among the
many documents attached to a gene; often the terms are found in the abstract of
one of the papers, and the user has to open and read them all to find why the
gene belongs to the list. Yet the system is not bugged: computers have many
qualities, but they lack imagination and creativity: we can guarantee the words
typed were found somewhere in this genes’ information.
-
A new
indexing system has been
developed to speed up the answers to the queries over the entire database. It
indexes on strings of three characters and has been a lot of fun to develop,
maybe it is original, maybe not. In any event, we hope you enjoy the outcome.
Too few of you use the box to query AceView. Please try it… you will be
impressed!
Statistics
on Human build 33 (June 5th, 2003)
Note: build 32 was never made public
Thanks to the increase in mRNA and EST sequences submitted to the public
databases and to the amelioration of the genome sequence, we align on this new
build 20% more mRNAs and ESTs in 7% fewer genes. On this genome sequence, we
map 3,449,800 cDNA sequences: 3,313,569 ESTs (67% of all ESTs in GenBank/dbest, a 9% increase relative to last time) and 117,475
mRNAs from the public databases (93%), as well as 18,756 NCBI RefSeq,
representing 99.5% of the current RefSeq set. Although we do not mask for
repeats, only 1.5% of the mRNAs and 2.75% of the ESTs are ambiguous and match
the genome with indistinguishable quality in more than one place. As for
the current quality of the alignments, mRNAs on average measure 2,017 bp and
align over 98.3% of their length with 99.7% accuracy. ESTs, especially prone to
sequencing errors near the end of the read, on average measure 531 bp, yet
align over 93% of their length with over 98% accuracy.
AceView clusters the 3.5
million reads into 30,953 genes with at least one validated [gt-ag] or [gc-ag] spliced intron and on average 4.9 alternative
variants per gene (altogether 152,371 different mRNAs), exactly as we had
reported in the Lander
et al. main genome paper! Another 515 genes do not have confirmed introns,
yet encode proteins of more than 300 amino-acids and are hence likely
functional intronless genes. The remaining 48,122 genes are unspliced, partial,
or non-coding and will require further investigation. Altogether in this build,
we annotated 79,590 genes, with 201,359 alternative mRNA variants.
In this release, 74,114
(64%) of the spliced mRNA models are supported by a single identified clone
covering the entire CDS; the remaining could represent an inappropriate
concatenation of two clones (28%) or more (8%), because, if no conflicts arise,
we merge partial cDNAs.
The Pfam motifs used are from
version 8.0, downloaded May 2003.
Caenorhabditis elegans: New RefSeq release of genes and chromosomes July
16th |
Reference sequences of the
nematode genes have been updated
at NCBI on July 16th (and on the web on June 23rd).
We describe below
1.
The statistics of this release
2.
A glossary of C.elegans Gene Names, as found in
RefSeq, LocusLink and AceView
3.
AceView gene representation, gene
annotation and biology
C.elegans researchers,
we count on you to help us in the NCBI annotation process, please send us text descriptions for the genes you studied. Note that for us, texts are better than keywords,
they are more precise, and because we index all texts and abstracts for searches,
the language can evolve and we will not miss new concepts. Also please check the bibliography in the bottom of the “Gene on
Genome” page, and send us corrections: missing papers, and above all papers
attributed to a gene that they do not describe create a lot of nuisance. Thank
you!
Statistics of C.elegans
genes
The complete genome of
release WS97 (a fragment of about 10 kb
was added since) was provided by the Genome Sequencing Consortium through
WormBase. This version of the genome consists of 100 264 081 bp and 22725
CDS/non coding RNA models, recognizable by their cosmid.number
name. We have used all Genbank mRNAs as of Feb 10,
2003 to replace, whenever possible, models of genes by complete mRNAs: this
happened for 1271 ‘reviewed’ RefSeq mRNAs, from 1106 genes. Through
collaboration with Yuji Kohara and the transcriptome project, we were able to
indicate which of the predictions are fully and exactly supported by cDNAs,
even when the full length cDNA is not yet in Genbank
as an mRNA. This happened for 5186 ‘validated’ RefSeqs
mRNA from 4783 genes. The fully confirmed reviewed and validated RefSeqs are drawn in pink to signify their trustworthiness.
Another 11555 mRNAs are approximately or partially supported, and 4442 are not
supported by expression data or phenotype so far: all these mRNAs are drawn in
dubious blue rather than pink.
In
term of genes, this release includes
. 19173 would
produce mRNA(s): 14731 have been shown
to be expressed, 3065 are associated to a phenotype, by mutation or gene
specific RNA interference.
. 1886 produce
other RNAs, including 733 tRNA, rRNA,
snRNA, scRNA, miRNA,
pseudogenes, non coding polyadenylated RNAs, and unpredicted
transcribed genes. 54 have an associated phenotype.
A glossary of C.elegans
Gene Names
6029 genes now have an official
name such as tra-1 or ced-4 (from the CGC).
All molecularly identified genes also have a positional Worm Transcriptome
Project name, which gives unambiguously the gene order and the strand: for
example 1C94 is the gene on chromosome number 1, megabase “section” C (3rd),
at or near kilobase 94: In addition, its strand information is encoded as
follows: odd genes run downstream, on the direct strand, even genes run
upstream, on the reverse strand.
Some nematode genes appear
to belong to a “complex locus” as initially defined by complementation
tests in phage or Drosophila. In such a gene, two complete alternative variants
may not share a base, although a third variant “bridges” them and overlaps
both. By convention, we denote this property by adding the letter C for complex behind the name (e.g. 1C777C), or by a
dot if the gene had acquired two independent gene names (e.g. mai-1.gpd-2.gpd-3).
Another frequent case is that of nearby genes expressed from the same strand,
but with overlap between the 3’ end of the first gene and 5’ end of the second:
disputably, but for technical reasons, these are represented as a single gene,
with the suffix Co (for common
UTR sequence) appended after the gene name (e.g. cul-1Co or 5K225Co), or with
the sign AND if both genes have a name (e.g. mev-1ANDced-9).
Gene representation
In AceView, the extent of the nematode genes (represented by the named
turquoise bar in the graphs on genome) is based on the actual mRNA and EST
supporting data rather than on the predicted sequences from WormBase. Most of
the RefSeq genes also contain this information, except, for technical reasons,
for the genes mispredicted and too long. The
advantage is that, although WormBase changed 2620 of their predictions over the
last 7 months, the position or extent of only 118 of the AceView genes had to
be changed. The same stability is seen in the nematode LocusLink.
As explained above, the sequences of pink mRNAs in AceView are fully supported,
unlike sequences of blue mRNAs.
Gene
annotation and biology
Our annotation of the worm genes
initially relied on a selection from WormBase. We have started to re-annotate
the genes, with the help of the Worm Community and of the Worm Transcriptome
Project led by Yuji Kohara.
1.
Expression data: For this release, we annotated the level and
developmental pattern of expression for 11,000 genes (from Acembly
analysis of cDNA libraries
representation). We list all clones in each
gene, point to the best clone and describe the anomalies we see in the
anomalous clones. We provide direct access to NextDB,
the extraordinary resource provided by Kohara and Shin’I to visualize the in situ expression patterns on a partition of all developmental
stages. Note that
this data is still unpublished: Users of the data presented on the NextDB web pages should not publish the information without
Kohara permission and appropriate acknowledgment. The database is still
in embryonic phase, so, if you find any discrepancies in the data, please let
them know straightaway. They would also appreciate any suggestions and
comments.
2.
mRNA, protein and gene titles were generated. All gene titles now bear an
indication of their function: for all genes where a phenotype has been
described, the word essential or phenotype appears in the gene title. Proteins
and genes titles are in the process of hand edition. In this release, we have hand annotated from
the literature 581/2 gene-gene interactions
and 694/2 protein-protein interactions, and
have worked on identifying and renaming all collagen
genes (187) in the worm (with J Kramer and the genome consortium).
Finally, we have hand edited the entire file of phenotypic descriptions from C.elegans II, because it described
the phenotypes with a coded vocabulary that made it difficult for the neophyte
to understand.
The Pfam motifs used for this lot are from version 8.0, downloaded
May 2003.
May 15, 2003
Arabidopsis thaliana new release |
On May 15th, we released our reconstruction of the
Arabidopsis genes, using all 178,464 ESTs and 28,721 mRNAs available from
GenBank on March 5th, 2003 and aligning them on the TIGR genome release
(downloaded from the TIGR site November 2002).
AceView aligned 168,076 ESTs (94.2%) and 28,653 mRNAs
(99.8%), and from this, reconstructed 21,431 genes, giving rise, by alternative
splicing or alternative promoters, to 26,687 mRNAs. The genes look like
nematode or fly genes in their overall organization and dimensions. All the
proteins were annotated by our pipeline.
The level of alternatives was strikingly and
significantly lower than in worm, and a
fortiori human, taking into account the high level of coverage: 25% of the
mRNAs are alternative forms. Among the genes with at least 3 cDNA clones, only
20% contain alternative variants (2,728 genes out of 13,502). That may be an
interesting biological difference.
Another remarkable feature is the extremely high
quality of many cDNA and even EST sequences: many do not have a single base
different from the genome, and we see very few cDNA clones needing to be tagged
as abnormal (only 307 of 192,351, 20 to 40 times less than in the other
species). In other terms, this project just does not “look” like the other
projects we analyzed: worm, Drosophila, and human. It could be that plants are
really different, or that the excellent sequences of cDNAs in GenBank are
predictions, or at least sequences corrected by genome alignment. In the same
vein, there is also a low level of polymorphisms (which came as a surprise to
us) and of genome sequencing errors, as monitored by discrepancies between sets
of cDNAs and the genome sequence.
The next standard step
would have been to get the official gene names and the biology, and we
contacted Tair to this effect, but we did not yet
find time to proceed. So this Arabidopsis release is rudimentary; it can be
queried by sequence accessions or by protein annotation, but that is about all.
The
Pfam motifs used for this release of
Arabidopsis are from version 8.0, downloaded May 2003.
February 20, 2003
(corresponds to human build 31) |
We just released our new
Acembly genes, built over the human genome sequence from November 2002 (build
31 / golden path hg13).
We improved the Acembly clustering algorithms to make
the genes more precise. We reinforced the mRNAs that already had a fully
sequenced clone and reduced the combinatorics. As a
result, we may end up with incomplete alternative variants that should trigger
scientists to obtain the clones for full length re-sequencing. We also display
more validated alternatively spliced variants.
The Pfam motifs used are from version 7.7, downloaded
December 2002.
The most significant interface improvements for this
build are:
Statistics
on Human build 31 (February 20th, 2003)
Thanks to code amelioration, we now map unambiguously
2,763,401 ESTs (representing 58% of the ESTs currently in GenBank/dbest) and 83,872 mRNAs from the public databases, as well
as 18,000 NCBI RefSeqs, representing 99.3% of the
current RefSeq set. AceView clusters these into 83,874 genes, with altogether
210,122 alternative mRNA variants. 33,286 genes have at least one validated
gt-ag or gc-ag spliced intron, and on average 4.6
alternatively spliced variants.
November 2002 |
|
We added Caenorhabditis elegans data to our system. This involved
revising the web service to correctly display data for multiple species.
August 2002
corresponds to human build 30 |
In this release we replaced our complicated positional
identifiers (such as G_t1_Hs1_4478_30_0_2551) by the official Locus name, if
possible, else by a PFAM name.number, when the
protein contains a PFAM motif and otherwise by an artificially generated name.
We compose either a pseudo Japanese sounding name, such as sayuri,
kimu and nowara, to label
genes where one of the principal clones is Japanese, else a pseudo English
sounding name, like jawker or sneery.
Statistics
on Human build 30 (August 2002)
In this human genome
release, known as NCBI build 30, after filtering, we present on this site
66,830 genes, containing 138,040 reconstructed mRNAs, supported by 1,898,911
mRNAs and ESTs. We currently align 14,970 (95.0%) of the 15,748 NM reference
sequences, confirming the near completion of the human genome sequence.
12/12/00
AceView shows the alignment of cDNAs to the genome sequence (see Contig Assembly
Process), and the genes and mRNAs reconstructed from these alignments,
using the Acembly program developed by Jean and Danielle Thierry-Mieg on top of
the acedb object oriented database manager.
Protein BLAST hits to the
genomic sequence and conserved motifs identified by RPS-BLAST are displayed,
but they are not used to generate the model. Complementary views of the genes
are shown in LocusLink (All info about the gene), OMIM (Phenotypes), MapView (All maps) and Blink (protein homologies).
In this first release,
only the 10,544 RefSeq mRNA available October 13th were aligned to
NCBI contigs (October 5 freeze). A model is displayed if the mRNA aligns on the
genomic sequence, finished or draft, over at least 700bp or half its length,
with less than 3% discrepancy. A discrepancy is either a single base
substitution, a base insertion or a base deletion.
Of the 10,544 RefSeqs, 9,409 (89%) aligned as well or better than defined
above, in 9,045 genes, containing 9,299 mRNAs (1.03 RefSeq per gene).
Quality of alignment:
Some 5,423 RefSeq mRNAs
(58%) produced an excellent alignment to 5,224 genes (by excellent, we mean
more than 98% of the length aligned with less than 1% discrepancies from the
underlying genome sequence). These models are expected to be complete.
Other RefSeqs
match less well: if we allow for 3% base
mismatches relative to the current genome (3% single base insertion, deletion
or variation between the RefSeq and the underlying genome sequence), 1718 (18%)
align over 90 to 98% of their length, 690 (7%) over 80% to 90%, and the
remaining 1578 (17%), over 50% to 80% or 700 bp.
These results, showing
that 89% of the genes represented in RefSeqs can be
mapped at good stringency in the genome, confirm the quality of both the genome
sequence and the RefSeqs, and may be used to
ameliorate both.
The few redundant RefSeqs revealed by alignment to the genome are currently
being reviewed.
Gene duplications or
assembly errors: After selecting the best hits and keeping only hits of similar
quality, 179 RefSeqs map to more than one place in
the genome, hence contributing to multiple gene models. Some correspond to
genome assembly errors, recently being reviewed, some to actual paralogs and some to retroposons.
Note that there are fewer gene duplications than was expected (max 2%), and
most genes are uniquely assigned.
On the other hand, more
genes than expected map uniquely in single exons. This happens for 7% of the RefSeqs, and the length of the gene is up to 3477 bp
(Nuclear receptor interacting protein).
408 genes in this sample
have more than one mRNA because they are alternatively spliced. The proportion
is largely below the number of actually alternatively spliced genes, since the
average number of protein products from a gene is above 2 per gene (from
aligning all mRNA and EST in GenBank).
Unfortunately, a technical
problem prevented export of 164 genes, some excellent, often in draft regions,
to ASN1, so that, in this release, only 9305 RefSeqs
(NM accessions) provided models for 8881 genes (Locus ID), containing 9222
mRNAs and proteins (XM and XP accessions).
CAUTION:
Using draft genome sequence and only RefSeqs mRNAs in
this release introduces natural limitations: because of the draft nature of the
genomic sequence, of polymorphisms, of imperfections in the RefSeq sequences or
in the Acembly program, many models are partial. 42% are shorter than the
RefSeq they originated from, usually by failure to align at the 3’ or 5’ end,
while 13% have internal gaps.
On the positive side, 77%
of the model mRNAs (7296 of the 9409 aligned) appear to contain the entire CDS.
Draft
quality of the genome sequence (end of contigs as well as pieces of sequence
not fully ordered and oriented (see Contig Assembly
Process)) is the major cause of problems:
Aceview provides a graphical display of both
elements, facilitating evaluation of the quality of the area.
3’UTRs align less well
than the coding region, due to a higher level of polymorphisms and to less
careful cDNA sequencing. Acembly may then align the mRNA but refuse to add the
divergent sequence in the model, if too many discrepancies accumulate locally.
5’ missing exons have two
major causes: the first is the draft, the second affects about 100 genes in
which a bug prevented the program from finding the first exon. This bug has
been solved now.
Future releases will
include all mRNAs and ESTs from Genbank. The RefSeqs will be
improved and probable genome sequencing errors will be signaled. Using the
entire mRNA/EST set from GenBank allows Acembly to reconstruct 77518 mRNAs
encoding more than 80 amino acids, from 36453 genes, on the October 5 draft.
The AceView web pages
present three bubbling graphics of each gene model:
The AceView gene models
can be accessed from the Map Viewer, LocusLink (from the av
symbol) and the XM record.
Data Input In the current release, the sequence data
used to develop the models included:
The mRNA sequences were aligned to genomic contigs
using the Acembly program, described in the next section. The genomic contigs
were also BLASTed against the vertebrate, non-human
proteins and RPS-Blasted against a motif database to annotate the homologies.
Modeling Method
The Acembly program
was used to produce the gene models. It may align all human mRNA and EST
sequences to the genomic sequence. If there are ESTs or mRNAs that produce
different models because of alternative splicing, all models will be displayed.
If an mRNA aligns to multiple locations on the genomic sequence, Acembly keeps
only the best alignment. But if two of the alignments are of similar quality,
Acembly keeps both, and shows that mRNA in bold (as described in item 6 under
Data displayed).
A model might be
incomplete, if the available mRNA and EST sequence data for a gene is
incomplete or if it does not match the genomic sequence over its entire length.