Section II Cis -Splicing in Worms

Publication Details

A. Intron and Exon Length

C. elegans introns have some unusual properties (Blumenthal and Thomas 1988). First of all, they tend to be much shorter than vertebrate, or even yeast, introns. Figure 1 shows the size distribution of introns and exons in C. elegans. More than half of all C. elegans introns in the survey are shorter than 60 nucleotides, much too small to be spliced in vertebrates (Weiringa et al. 1984; Ogg et al. 1990). The inset in Figure 1 (top) shows the size distribution of the very short introns. It appears to represent a skewed distribution with a peak at 48 nucleotides but with greater than expected numbers of introns at 44 and 52 nucleotides. The reason for this uneven distribution is unknown, but it may reflect some constraint imposed by spliceosome formation. The shortest intron in the survey is 30 nucleotides long (in the α2[IV] collagen gene [Sibley et al. 1993]), but only 4 of 659 are shorter than 40 nucleotides. It is also worth noting that the C. elegans splicing machinery does retain the ability to splice very long introns. For example, the first intron of the unc-7 gene is 18 kb (Starich et al. 1993).

Figure 1. Intron and exon length distribution.

Figure 1

Intron and exon length distribution. Each bar represents the number of introns or exons in each size class. The survey includes 669 introns and 862 exons. (Inset) Expanded plot of the small introns. (more...)

C. elegans introns are not the shortest among free-living nematodes. Caenorhabditis briggsae introns tend to be somewhat shorter than those of C. elegans. For example, the introns of the C. elegans ges-1 gene are much longer than those of the C. briggsae homolog (Kennedy et al. 1993). An informal survey of 39 homologous introns from the two species shows that 17 are substantially longer in C. elegans, whereas only 4 are longer in C. briggsae (T. Blumenthal, unpubl.), and the 4 known introns of another rhabditid nematode, CEW1, are between 38 and 41 nucleotides long (Winter et al. 1996). Other organisms have also been reported to have short introns: Schizosaccharomyces pombe introns have a median length of 63 bp (Zhang and Marr 1994); Drosophila introns show a sharp distribution around a modal length of 63 bp (Mount et al. 1992); introns in the flatworm, Schistosoma mansoni, are 31−42 bp long (Craig et al. 1989); and Paramecium introns all seem to be between 20 and 33 nucleotides in length (Russell et al. 1994). This is in sharp contrast to the much longer introns of vertebrates, plants, and even Saccharomyces cerevisiae.

C. elegans exons are most frequently about 80−250 bp in length (Fig. 1, bottom), similar to vertebrate exons. However, they can be much longer. Several are larger than 4 kb, and the longest identified exon is 9 kb, in unc-22 (Benian et al. 1989). The survey also shows that the typical C. elegans gene has a relatively small number of introns and exons (Fig. 2; median = 3 introns/gene). The 100 fully characterized genes in the survey range in their exon content from 21% to 95% (median = 71%).

Figure 2. Intron/exon structure of C.

Figure 2

Intron/exon structure of C. elegans genes. Each bar represents the number of genes having the number of introns shown. The survey includes 117 genes.

B. Sequences That Signal Splicing

The vertebrate and C. elegans splice site consensus sequences are compared in Figure 3. C. elegans introns obey the GU-AG rule (with very rare use of GC as a 5′splice site), and their 5′splice site consensus is essentially the same as that for vertebrates. However, several interesting differences exist between C. elegans and other species. C. elegans introns have an extended, very highly conserved 3′splice site consensus sequence, UUUUCAG, in which the U at the −5 position is almost perfectly conserved, whereas introns from most other organisms have only a combination of an upstream polypyrimidine tract and a YAG consensus at their 3′boundaries. This suggests that the 3′intron boundary may be a more important element in C. elegans intron recognition than in other organisms. This supposition has recently gained experimental support (see below). On the other hand, C. elegans introns have no obvious polypyrimidine tract (other than the 3′splice site consensus), nor have they any convincing branch site consensus similar to consensus sequences of yeast or mammalian introns. Since branch sites mapped in other species generally occur at A residues, it is reasonable to suppose that branching in C. elegans also tends to occur at A residues. It has been observed that although the entire intron is rich in U residues, A residues are more frequent at positions −16, −17, and −18 from the 3′splice site (T. Blumenthal and K. Steward, unpubl.), so it is reasonable to propose that branching may occur at these A residues.

Figure 3. C.

Figure 3

C. elegans intron consensus sequences. Splicing occurs at the vertical lines. Positions are numbered with respect to the splice sites, and the percentage of each nucleotide at each position is given. (more...)

Although C. elegans introns conform to the GU-AG rule, the C. elegans splicing machinery can recognize variants of this sequence at both ends of the intron (Aroian et al. 1993). Several mutations in the let-23 and dpy-10 genes were caused by mutation of the AG at the 3′splice site. Since these mutations resulted in only a partial loss-of-function phenotype, the RNA products of these mutant genes were examined, and at least a portion of the RNAs were found to be spliced at the normal site, now consisting of AA in place of AG. Furthermore, splicing occurred at other non-AG sites in the surrounding sequence, including UG, AU, and GG. In another study, it was discovered that mutations caused by insertion of the transposon Tc1 often produced weak phenotypes because the entire transposon was spliced out utilizing cryptic sites at the ends of the transposon or in surrounding DNA (Rushforth et al. 1993; Rushforth and Anderson 1996). In several instances, these sites did not conform to the GU-AG rule: UU and AU were used as 5′splice sites, and UG, AC, GC, and GG were used as 3′splice sites.

These results demonstrate that AG is not an obligatory component of the 3′splice site recognition process in C. elegans, although the fact that it is so highly conserved indicates that it certainly contributes information to the process. To determine where the additional information for 3′splice site choice is located, a mutational analysis of the site was performed (Zhang and Blumenthal 1996). When a 3′splice site, UUUCAG/ AAG, was mutated to UUUCAAAAA, splicing occurred 100% of the time at the second A in the string of five A residues, which is the position where splicing occurs in the wild-type sequence. When additional single-base changes were introduced into the UUUC portion of the sequence, splicing failed to occur, or it occurred at a different site. Thus, the highly conserved UUUC that precedes the AG also provides important information to the spliceosomal machinery. It may be that this short stretch of pyrimidines replaces the polypyrimidine tract (which is typically 15−20 nucleotides long in vertebrates). If so, one would expect that it would serve as the recognition site for U2AF. It is worth noting that U to C changes in this short sequence clearly reduce its effectiveness, suggesting that if this is a polypyrimidine tract, it is not just a random sequence of pyrimidines (the strong UUUC consensus also supports this observation). However, U and C do not have equivalent functions in vertebrate polypyrimidine tracts either (Roscigno et al. 1993). Furthermore, the optimal binding sequence for mammalian U2AF, which acts at the polypyrimidine tract, has recently been shown to require several contiguous U residues (Singh et al. 1995).

C. elegans introns are very A + U-rich (∼70%) compared with surrounding exons (∼54%) (Table 1), a property they share with introns of other invertebrates and plants (Goodall and Filipowicz 1989; Csank et al. 1990). In plants, A + U richness has been demonstrated to represent an important aspect of intron recognition: Insertion of an A + U-rich sequence within an exon, even without splice sites, has been shown to result in splicing of the inserted sequence utilizing fortuitous matches to the splice site consensus sequences present in the surrounding exon (Luehrsen and Walbot 1994). In worms, A + U richness has also been shown to be an important feature of 3′splice site recognition. In one case where two alternative 3′splice sites 20 nucleotides apart were available, the downstream site was always utilized; however, when the region between the two splice sites was made more G + C-rich, the upstream site was chosen, indicating that the border between A + U-rich and more G + C-rich RNA is one of the criteria C. elegans spliceosomes use in choosing splice sites (Conrad et al. 1993a). The data in Table 1 also suggest that recognition of short and long introns may not be identical processes. The short introns appear to be richer in U nucleotides than are large introns. Furthermore, an analysis of information content of short and long introns of C. elegans showed that they were significantly different at both splice boundaries (Fields 1990). The data in Table 1 also show that the regions of the exons specifying untranslated regions tend to be A + U-rich compared with protein-coding regions.

Table 1. Base composition of C. elegans introns and exons.

Table 1

Base composition of C. elegans introns and exons.

In summary, intron recognition in C. elegans appears to involve recognition of the general boundaries of the intron based on A + U richness, recognition of the 5′splice site (by U1 snRNP, as in all other organisms), and recognition of the 3′splice site by currently undefined components of the spliceosome that interact with the UUUUCAG sequence. Since there is neither a good match to the branch-site consensus, which would base pair with U2 snRNA, nor a polypyrimidine tract that would interact with U2AF, it seems most likely that the information provided in vertebrates by those consensus sequences, resulting in U2AF-assisted binding of U2-snRNP to the branch site, is provided in C. elegans by the highly conserved UUUUCAG instead. Tight binding of U2AF (or some other component) could result in recruitment of U2 to a very loosely defined branch-site sequence and simultaneous definition of the 3′splice site.

C. The Splicing Machinery

Those portions of the splicing machinery that have been characterized so far are very highly conserved throughout the eukaryotes. This includes C. elegans, which has been found to have all the spliceosomal snRNAs (Thomas et al. 1988). The sequences and lengths of these RNAs are quite similar to those of other animals. In particular, the sequence in U1 that interacts with the 5′splice site and the sequence in U5 that interacts with both exon borders have both been perfectly conserved between vertebrates and C. elegans (Thomas et al. 1990). The sequence in U2 that base pairs with the branch site has been perfectly conserved in C. elegans, even though the branch-site sequence in C. elegans introns is so loosely defined that it is unrecognizable. When it was found that C. elegans introns were unusually short, it was thought that the snRNAs which catalyze their splicing might also be shorter than in other species. However, the C. elegans snRNAs are about the same lengths as their homologs in vertebrates. It may be that constraints imposed by snRNA size do not determine minimum or optimal intron length.

The genes that encode the snRNAs are also similar to those found in other species (Thomas et al. 1990). The spliceosomal snRNAs are each encoded by small multigene families of 6−12 members each. A few of the genes are clustered, but in general they are spread throughout the genome. These genes are transcribed in other animals, and probably C. elegans as well, by RNA polymerase II, except U6, which is transcribed by RNA polymerase III. The snRNA genes are each preceded at their 5′ends by a very highly conserved sequence called the proximal sequence element (PSE). In vertebrates, the PSE has been shown to be the site where a transcriptional activation complex called SNAPc forms. The C. elegans PSE sequence has diverged totally from the vertebrate PSE, but it is very highly conserved between the different C. elegans snRNA genes, and it occupies the same position as the vertebrate PSE, and thus it is likely to perform the same function.

Several genes that encode protein components of the C. elegans splicing machinery have been identified. These genes encode U2AF (Zorio et al. 1997); PRP21, a U2-associated protein (Spikes et al. 1994); PRP8, a U5 snRNP-associated protein involved in 3′splice site recognition (Hodges et al. 1995; Umen and Guthrie 1995); and several SR proteins (M.L. Morrison et al., pers. comm.). Presumably, C. elegans homologs to the many other proteins involved in splicing also exist, and many have been identified in the partial C. elegans genomic sequence and in sequenced cDNA clones.

The pre-mRNA for the U2AF large subunit is alternatively spliced in a potentially interesting way (see Zorio et al. 1997). In addition to the RNA that encodes the full-length protein, an alternatively spliced RNA results from choice of a more proximal 3′splice site and removal of a short intron. This alternative splicing results in insertion of an approximately 300-bp exon, containing an in-frame stop codon, and a similar alternative splice occurs in the C. briggsae U2AF homolog. The C. briggsae gene is quite highly conserved in the exons surrounding the alternative splice, but it is essentially unrelated in the introns and in the exon inserted by the alternative splice. However, downstream from the alternatively used splice site, the C. elegans insert contains 10 good matches to the 3′splice site consensus, UUUUCAG/(A or G), and the C. briggsae insert contains 18 such matches. There is no evidence that these sequences serve as splice sites, so they may instead play a part in regulating U2AF levels. This autogenous regulation might occur by binding of excess U2AF to the 3′splice-site-like sequences in the pre-mRNA or in the alternative mRNA.

D. Alternative Splicing

Alternative splicing is a frequently used mechanism for producing multiple mRNAs and protein products from a single gene. In C. elegans, there are numerous examples of alternative splicing, in which alternative splice sites are chosen (e.g., xol-1 [Rhind et al. 1995], flp-1 [Rosoff et al. 1992], hlh-1 [Krause et al. 1990], IFa1 [Dodemont et al. 1994]), entire exons are skipped (e.g., lin-14 [Wightman et al. 1991], tmy-1 [Kagawa et al. 1995]), or groups of exons are skipped (e.g. unc-52 [Rogalski et al. 1993, 1995] and bli-4 [Thacker et al. 1995]). One particularly interesting example is the case of unc-17 , which encodes a synaptic vesicle-associated acetylcholine transporter, and cha-1 , which encodes choline acetyltransferase (see Rand and Nonet, this volume). These two independently identified genes are encoded by a single polycistronic cluster (Alfonso et al. 1994b). They share a first noncoding exon, but the coding regions of the two genes are entirely separate (Fig. 4). The coding exons of unc-17 are all contained within the unusually long first intron of cha-1 . Hence, a given transcript can encode either the unc-17 product or the cha-1 product, but never both. This appears to be a unique use of alternative splicing to ensure that the products of the two genes are not produced simultaneously. This kind of arrangement also has been reported for the unc-60 gene, which encodes two separate, but homologous, actin-depolymerizing proteins (McKim et al. 1994). The two share a first exon, which encodes only the methionine at which translation initiates.

Figure 4. The unc-17 / cha-1 polycistronic cluster.

Figure 4

The unc-17 / cha-1 polycistronic cluster. The two genes share a noncoding first exon, and the two products (more...)

The mechanisms by which these documented cases of alternative splicing are regulated remain mysterious. In one case, however, a trans-acting factor that regulates a variety of alternative splicing events has been identified genetically (Lundquist and Herman 1994). The mec-8 gene was originally identified by mutations that cause defects in mechanosensation. Strong mutations at this locus resulted in a variety of seemingly unrelated defects including larval lethality. These mutations were shown to enhance the phenotype of mutations in the unc-52 gene (which encodes perlecan), causing paralysis and embryonic arrest (see Fire and Moerman, this volume). The unc-52 pre-mRNA undergoes a complex array of alternative splicing events involving exon skipping (Rogalski et al. 19931995). Recent results have demonstrated that mec-8 mutations alter the ratio of the unc-52 alternatively spliced products (Lundquist et al. 1996). Specifically, mec-8 mutations appear to reduce the level of exon skipping in the unc-52 pre-mRNA. In addition, mec-8 itself produces several RNA products, and mec-8 mutations affect levels of these products as well, suggesting that its product is a splicing factor that autogenously regulates its own expression at the level of splicing. This idea is supported by characterization of the MEC-8 protein, which has two RNA recognition motifs (RRMs). These highly conserved sequences have been found previously in many proteins that bind RNA, especially proteins that are involved in catalysis of splicing or alternative splice site choice. Thus, it is reasonable to suppose that the MEC-8 protein is involved in splice site choice in C. elegans and that based on its pleiotropic mutant phenotype, its targets may include pre-mRNAs from a wide variety of genes.

Alternative cis-splicing is not the only means by which C. elegans creates diverse products from single genes. Several genes have been reported to have alternative first exons that may arise through promoters contained within introns or by alternative trans-splicing (see below) (e.g., unc-87 [Goetinck and Waterston 1994b]; unc-5 [Leung-Hagesteijn et al. 1992]; and sdc-1 [Nonet and Meyer 1991]). The unc-33 gene produces a shortened transcript due to either a second promoter or trans-splicing following a long fourth intron (Li et al. 1992). In the case of the her-1 gene, the existence of a second promoter within the large second intron has been demonstrated (Perry et al. 1993). So far, no function has been ascribed to this shorter transcript. The tmy-1 gene, which encodes several different tropomyosin isoforms, contains a promoter within the large third intron, and this promoter has a developmental specificity different from the upstream promoter (Kagawa et al. 1995). It is also possible for genes to be contained entirely within other genes: The spe-26 gene, which is itself composed of six exons, is contained entirely within the first intron of a gene encoded on the other strand (Varkey et al. 1995).