A. Intron and Exon Length
Figure 1
.
Intron and exon length distribution. Each bar represents the number of
introns or exons in each size class. The survey includes 669 introns and
862 exons. (Inset) Expanded plot of the small introns.
Each bar represents introns of a specific length: 0–20,
21–40, 41–60, etc.
C. elegans introns have some unusual properties (
Blumenthal and Thomas 1988). First of
all, they tend to be much shorter than vertebrate, or even yeast, introns. shows the size distribution of
introns and exons in
C. elegans. More than half of all
C. elegans introns in the survey are shorter than 60
nucleotides, much too small to be spliced in vertebrates (
Weiringa et al. 1984;
Ogg et al. 1990). The inset in (top) shows the size distribution of the very short
introns. It appears to represent a skewed distribution with a peak at 48
nucleotides but with greater than expected numbers of introns at 44 and 52
nucleotides. The reason for this uneven distribution is unknown, but it may
reflect some constraint imposed by spliceosome formation. The shortest intron in
the survey is 30 nucleotides long (in the α2[IV] collagen gene [
Sibley et al. 1993]), but only 4 of 659
are shorter than 40 nucleotides. It is also worth noting that the
C.
elegans splicing machinery does retain the ability to splice very
long introns. For example, the first intron of the
unc-7 gene is 18 kb (
Starich et al.
1993).
C. elegans introns are not the shortest among free-living
nematodes. Caenorhabditis briggsae introns tend to be somewhat
shorter than those of C. elegans. For example, the introns of
the C. elegans
ges-1 gene are much longer than those of the C. briggsae
homolog (Kennedy et al. 1993). An
informal survey of 39 homologous introns from the two species shows that 17 are
substantially longer in C. elegans, whereas only 4 are longer
in C. briggsae (T. Blumenthal, unpubl.), and the 4 known
introns of another rhabditid nematode, CEW1, are between 38 and 41 nucleotides
long (Winter et al. 1996). Other
organisms have also been reported to have short introns:
Schizosaccharomyces pombe introns have a median length of
63 bp (Zhang and Marr 1994);
Drosophila introns show a sharp distribution around a modal
length of 63 bp (Mount et al. 1992);
introns in the flatworm, Schistosoma mansoni, are
31−42 bp long (Craig et al.
1989); and Paramecium introns all seem to be between
20 and 33 nucleotides in length (Russell et
al. 1994). This is in sharp contrast to the much longer introns of
vertebrates, plants, and even Saccharomyces cerevisiae.
Figure 2
.
Intron/exon structure of C. elegans genes. Each bar
represents the number of genes having the number of introns shown. The
survey includes 117 genes.
C. elegans exons are most frequently about 80−250 bp
in length (, bottom), similar to
vertebrate exons. However, they can be much longer. Several are larger than 4
kb, and the longest identified exon is 9 kb, in
unc-22 (
Benian et al. 1989). The
survey also shows that the typical
C. elegans gene has a
relatively small number of introns and exons (; median = 3 introns/gene). The 100 fully characterized genes
in the survey range in their exon content from 21% to 95%
(median = 71%).
B. Sequences That Signal Splicing
Figure 3
.
C. elegans intron consensus sequences. Splicing occurs
at the vertical lines. Positions are numbered with respect to the splice
sites, and the percentage of each nucleotide at each position is given.
All 669 introns surveyed for are included in the calculations. The derived
C.
elegans consensus sequence is given below. For comparison,
the general consensus derived from a large data set including all
organisms (
Mount 1982) is
included below the
C. elegans consensus.
The vertebrate and
C. elegans splice site consensus sequences
are compared in .
C.
elegans introns obey the GU-AG rule (with very rare use of GC as a
5′splice site), and their 5′splice site consensus is
essentially the same as that for vertebrates. However, several interesting
differences exist between
C. elegans and other species.
C. elegans introns have an extended, very highly conserved
3′splice site consensus sequence, UUUUCAG, in which the U at the
−5 position is almost perfectly conserved, whereas introns from most
other organisms have only a combination of an upstream polypyrimidine tract and
a YAG consensus at their 3′boundaries. This suggests that the
3′intron boundary may be a more important element in
C.
elegans intron recognition than in other organisms. This
supposition has recently gained experimental support (see below). On the other
hand,
C. elegans introns have no obvious polypyrimidine tract
(other than the 3′splice site consensus), nor have they any convincing
branch site consensus similar to consensus sequences of yeast or mammalian
introns. Since branch sites mapped in other species generally occur at A
residues, it is reasonable to suppose that branching in
C.
elegans also tends to occur at A residues. It has been observed
that although the entire intron is rich in U residues, A residues are more
frequent at positions −16, −17, and −18 from the
3′splice site (T. Blumenthal and K. Steward, unpubl.), so it is
reasonable to propose that branching may occur at these A residues.
Although C. elegans introns conform to the GU-AG rule, the
C. elegans splicing machinery can recognize variants of
this sequence at both ends of the intron (Aroian et al. 1993). Several mutations in the let-23 and dpy-10 genes were caused by mutation of the AG at the 3′splice
site. Since these mutations resulted in only a partial loss-of-function
phenotype, the RNA products of these mutant genes were examined, and at least a
portion of the RNAs were found to be spliced at the normal site, now consisting
of AA in place of AG. Furthermore, splicing occurred at other non-AG sites in
the surrounding sequence, including UG, AU, and GG. In another study, it was
discovered that mutations caused by insertion of the transposon Tc1 often
produced weak phenotypes because the entire transposon was spliced out utilizing
cryptic sites at the ends of the transposon or in surrounding DNA (Rushforth et al. 1993; Rushforth and Anderson 1996). In several
instances, these sites did not conform to the GU-AG rule: UU and AU were used as
5′splice sites, and UG, AC, GC, and GG were used as 3′splice
sites.
These results demonstrate that AG is not an obligatory component of the
3′splice site recognition process in C. elegans,
although the fact that it is so highly conserved indicates that it certainly
contributes information to the process. To determine where the additional
information for 3′splice site choice is located, a mutational analysis
of the site was performed (Zhang and
Blumenthal 1996). When a 3′splice site, UUUCAG/ AAG, was
mutated to UUUCAAAAA, splicing occurred 100% of the time at the
second A in the string of five A residues, which is the position where splicing
occurs in the wild-type sequence. When additional single-base changes were
introduced into the UUUC portion of the sequence, splicing failed to occur, or
it occurred at a different site. Thus, the highly conserved UUUC that precedes
the AG also provides important information to the spliceosomal machinery. It may
be that this short stretch of pyrimidines replaces the polypyrimidine tract
(which is typically 15−20 nucleotides long in vertebrates). If so, one
would expect that it would serve as the recognition site for U2AF. It is worth
noting that U to C changes in this short sequence clearly reduce its
effectiveness, suggesting that if this is a polypyrimidine tract, it is not just
a random sequence of pyrimidines (the strong UUUC consensus also supports this
observation). However, U and C do not have equivalent functions in vertebrate
polypyrimidine tracts either (Roscigno et al.
1993). Furthermore, the optimal binding sequence for mammalian U2AF,
which acts at the polypyrimidine tract, has recently been shown to require
several contiguous U residues (Singh et al.
1995).
Table 1
Base composition of C. elegans introns and
exons
| Introns | Exons |
|---|
| G | 14 | 16 | 24 | 18 | 14 |
| A | 32 | 33 | 30 | 37 | 29 |
| U | 42 | 35 | 24 | 24 | 39 |
| C | 12 | 16 | 22 | 21 | 18 |
| | | | | |
| A+T | 74 | 68 | 54 | 61 | 68 |
C. elegans introns are very A + U-rich (~70%)
compared with surrounding exons (~54%) (
Table 1), a property they share with
introns of other invertebrates and plants (
Goodall and Filipowicz 1989;
Csank
et al. 1990). In plants, A + U richness has been demonstrated to
represent an important aspect of intron recognition: Insertion of an A + U-rich
sequence within an exon, even without splice sites, has been shown to result in
splicing of the inserted sequence utilizing fortuitous matches to the splice
site consensus sequences present in the surrounding exon (
Luehrsen and Walbot 1994). In worms, A + U richness has
also been shown to be an important feature of 3′splice site
recognition. In one case where two alternative 3′splice sites 20
nucleotides apart were available, the downstream site was always utilized;
however, when the region between the two splice sites was made more G + C-rich,
the upstream site was chosen, indicating that the border between A + U-rich and
more G + C-rich RNA is one of the criteria
C. elegans
spliceosomes use in choosing splice sites (
Conrad et al. 1993a). The data in
Table 1 also suggest that recognition of short and long introns may
not be identical processes. The short introns appear to be richer in U
nucleotides than are large introns. Furthermore, an analysis of information
content of short and long introns of
C. elegans showed that
they were significantly different at both splice boundaries (
Fields 1990). The data in
Table 1 also show that the regions of the
exons specifying untranslated regions tend to be A + U-rich compared with
protein-coding regions.
In summary, intron recognition in C. elegans appears to involve
recognition of the general boundaries of the intron based on A + U richness,
recognition of the 5′splice site (by U1 snRNP, as in all other
organisms), and recognition of the 3′splice site by currently
undefined components of the spliceosome that interact with the UUUUCAG sequence.
Since there is neither a good match to the branch-site consensus, which would
base pair with U2 snRNA, nor a polypyrimidine tract that would interact with
U2AF, it seems most likely that the information provided in vertebrates by those
consensus sequences, resulting in U2AF-assisted binding of U2-snRNP to the
branch site, is provided in C. elegans by the highly conserved
UUUUCAG instead. Tight binding of U2AF (or some other component) could result in
recruitment of U2 to a very loosely defined branch-site sequence and
simultaneous definition of the 3′splice site.
C. The Splicing Machinery
Those portions of the splicing machinery that have been characterized so far are
very highly conserved throughout the eukaryotes. This includes C.
elegans, which has been found to have all the spliceosomal snRNAs
(Thomas et al. 1988). The sequences
and lengths of these RNAs are quite similar to those of other animals. In
particular, the sequence in U1 that interacts with the 5′splice site
and the sequence in U5 that interacts with both exon borders have both been
perfectly conserved between vertebrates and C. elegans (Thomas et al. 1990). The sequence in U2
that base pairs with the branch site has been perfectly conserved in C.
elegans, even though the branch-site sequence in C.
elegans introns is so loosely defined that it is unrecognizable.
When it was found that C. elegans introns were unusually short,
it was thought that the snRNAs which catalyze their splicing might also be
shorter than in other species. However, the C. elegans snRNAs
are about the same lengths as their homologs in vertebrates. It may be that
constraints imposed by snRNA size do not determine minimum or optimal intron
length.
The genes that encode the snRNAs are also similar to those found in other species
(Thomas et al. 1990). The
spliceosomal snRNAs are each encoded by small multigene families of
6−12 members each. A few of the genes are clustered, but in general
they are spread throughout the genome. These genes are transcribed in other
animals, and probably C. elegans as well, by RNA polymerase II,
except U6, which is transcribed by RNA polymerase III. The snRNA genes are each
preceded at their 5′ends by a very highly conserved sequence called
the proximal sequence element (PSE). In vertebrates, the PSE has been shown to
be the site where a transcriptional activation complex called SNAPc forms. The
C. elegans PSE sequence has diverged totally from the
vertebrate PSE, but it is very highly conserved between the different C.
elegans snRNA genes, and it occupies the same position as the
vertebrate PSE, and thus it is likely to perform the same function.
Several genes that encode protein components of the C. elegans
splicing machinery have been identified. These genes encode U2AF (Zorio et al. 1997); PRP21, a
U2-associated protein (Spikes et al.
1994); PRP8, a U5 snRNP-associated protein involved in
3′splice site recognition (Hodges et
al. 1995; Umen and Guthrie
1995); and several SR proteins (M.L. Morrison et al., pers. comm.).
Presumably, C. elegans homologs to the many other proteins
involved in splicing also exist, and many have been identified in the partial
C. elegans genomic sequence and in sequenced cDNA
clones.
The pre-mRNA for the U2AF large subunit is alternatively spliced in a potentially
interesting way (see Zorio et al.
1997). In addition to the RNA that encodes the full-length protein, an
alternatively spliced RNA results from choice of a more proximal
3′splice site and removal of a short intron. This alternative splicing
results in insertion of an approximately 300-bp exon, containing an in-frame
stop codon, and a similar alternative splice occurs in the C.
briggsae U2AF homolog. The C. briggsae gene is
quite highly conserved in the exons surrounding the alternative splice, but it
is essentially unrelated in the introns and in the exon inserted by the
alternative splice. However, downstream from the alternatively used splice site,
the C. elegans insert contains 10 good matches to the
3′splice site consensus, UUUUCAG/(A or G), and the C.
briggsae insert contains 18 such matches. There is no evidence that
these sequences serve as splice sites, so they may instead play a part in
regulating U2AF levels. This autogenous regulation might occur by binding of
excess U2AF to the 3′splice-site-like sequences in the pre-mRNA or in
the alternative mRNA.
D. Alternative Splicing
Figure 4
.
The unc-17/cha-1 polycistronic cluster. The two genes share a noncoding first
exon, and the two products are generated by alternative splicing. The
gene structures are redrawn to scale from the data of Alfonso et al. (1994b). Exons are
shown as bars joined as shown.
Alternative splicing is a frequently used mechanism for producing multiple mRNAs
and protein products from a single gene. In
C. elegans, there
are numerous examples of alternative splicing, in which alternative splice sites
are chosen (e.g.,
xol-1 [
Rhind et al. 1995],
flp-1 [
Rosoff et al. 1992],
hlh-1 [
Krause et al. 1990],
IFa
1 [
Dodemont et al.
1994]), entire exons are skipped (e.g.,
lin-14 [
Wightman et al. 1991],
tmy-1 [
Kagawa et al. 1995]), or
groups of exons are skipped (e.g.
unc-52 [
Rogalski et al. 1993,
1995] and
bli-4 [
Thacker et al. 1995]). One
particularly interesting example is the case of
unc-17, which encodes a synaptic vesicle-associated acetylcholine
transporter, and
cha-1, which encodes choline acetyltransferase (see
Rand and Nonet, this volume). These
two independently identified genes are encoded by a single polycistronic cluster
(
Alfonso et al. 1994b). They share
a first noncoding exon, but the coding regions of the two genes are entirely
separate (). The coding exons of
unc-17 are all contained within the unusually long first intron of
cha-1. Hence, a given transcript can encode either the
unc-17 product or the
cha-1 product, but never both. This appears to be a unique use of
alternative splicing to ensure that the products of the two genes are not
produced simultaneously. This kind of arrangement also has been reported for the
unc-60 gene, which encodes two separate, but homologous, actin-depolymerizing
proteins (
McKim et al. 1994). The two
share a first exon, which encodes only the methionine at which translation
initiates.
The mechanisms by which these documented cases of alternative splicing are
regulated remain mysterious. In one case, however, a
trans-acting factor that regulates a variety of alternative
splicing events has been identified genetically (Lundquist and Herman 1994). The mec-8 gene was originally identified by mutations that cause defects in
mechanosensation. Strong mutations at this locus resulted in a variety of
seemingly unrelated defects including larval lethality. These mutations were
shown to enhance the phenotype of mutations in the unc-52 gene (which encodes perlecan), causing paralysis and embryonic arrest
(see Fire and Moerman, this
volume). The unc-52 pre-mRNA undergoes a complex array of alternative splicing events
involving exon skipping (Rogalski et al.
19931995). Recent results
have demonstrated that mec-8 mutations alter the ratio of the unc-52 alternatively spliced products (Lundquist et al. 1996). Specifically, mec-8 mutations appear to reduce the level of exon skipping in the unc-52 pre-mRNA. In addition, mec-8 itself produces several RNA products, and mec-8 mutations affect levels of these products as well, suggesting that its
product is a splicing factor that autogenously regulates its own expression at
the level of splicing. This idea is supported by characterization of the MEC-8
protein, which has two RNA recognition motifs (RRMs). These highly conserved
sequences have been found previously in many proteins that bind RNA, especially
proteins that are involved in catalysis of splicing or alternative splice site
choice. Thus, it is reasonable to suppose that the MEC-8 protein is involved in
splice site choice in C. elegans and that based on its
pleiotropic mutant phenotype, its targets may include pre-mRNAs from a wide
variety of genes.
Alternative cis-splicing is not the only means by which
C. elegans creates diverse products from single genes.
Several genes have been reported to have alternative first exons that may arise
through promoters contained within introns or by alternative
trans-splicing (see below) (e.g., unc-87 [Goetinck and Waterston
1994b]; unc-5 [Leung-Hagesteijn et al.
1992]; and sdc-1 [Nonet and Meyer 1991]). The unc-33 gene produces a shortened transcript due to either a second promoter
or trans-splicing following a long fourth intron (Li et al. 1992). In the case of the her-1 gene, the existence of a second promoter within the large second
intron has been demonstrated (Perry et al.
1993). So far, no function has been ascribed to this shorter
transcript. The tmy-1 gene, which encodes several different tropomyosin isoforms, contains a
promoter within the large third intron, and this promoter has a developmental
specificity different from the upstream promoter (Kagawa et al. 1995). It is also possible for genes to be
contained entirely within other genes: The spe-26 gene, which is itself composed of six exons, is contained entirely
within the first intron of a gene encoded on the other strand (Varkey et al. 1995).
ǀ