![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||
Copyright © 2008 Murray et al.; licensee BioMed Central Ltd. Identification of motifs that function in the splicing of non-canonical introns 1Institute of Molecular Biology and Department of Chemistry, University of Oregon, Eugene, Oregon, USA Corresponding author.Jill I Murray: jill.i.murray/at/gmail.com; Rodger B Voelker: rvoelker/at/molbio.uoregon.edu; Kristy L Henscheid: henscheid/at/molbio.uoregon.edu; M Bryan Warf: mwarf/at/molbio.uoregon.edu; J Andrew Berglund: aberglund/at/molbio.uoregon.edu Received September 20, 2007; Revised December 27, 2007; Accepted June 12, 2008. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract Background While the current model of pre-mRNA splicing is based on the recognition of four canonical intronic motifs (5' splice site, branchpoint sequence, polypyrimidine (PY) tract and 3' splice site), it is becoming increasingly clear that splicing is regulated by both canonical and non-canonical splicing signals located in the RNA sequence of introns and exons that act to recruit the spliceosome and associated splicing factors. The diversity of human intronic sequences suggests the existence of novel recognition pathways for non-canonical introns. This study addresses the recognition and splicing of human introns that lack a canonical PY tract. The PY tract is a uridine-rich region at the 3' end of introns that acts as a binding site for U2AF65, a key factor in splicing machinery recruitment. Results Human introns were classified computationally into low- and high-scoring PY tracts by scoring the likely U2AF65 binding site strength. Biochemical studies confirmed that low-scoring PY tracts are weak U2AF65 binding sites while high-scoring PY tracts are strong U2AF65 binding sites. A large population of human introns contains weak PY tracts. Computational analysis revealed many families of motifs, including C-rich and G-rich motifs, that are enriched upstream of weak PY tracts. In vivo splicing studies show that C-rich and G-rich motifs function as intronic splicing enhancers in a combinatorial manner to compensate for weak PY tracts. Conclusion The enrichment of specific intronic splicing enhancers upstream of weak PY tracts suggests that a novel mechanism for intron recognition exists, which compensates for a weakened canonical pre-mRNA splicing motif. Background Pre-mRNA splicing is an essential processing step where non-coding intervening sequences (introns) are removed from the initial RNA transcript and coding sequences (exons) are ligated together to produce mature mRNA. Pre-mRNA splicing is mediated by the spliceosome, a multi-component complex composed of small nuclear ribonucleoproteins (snRNPs) and over 100 accessory proteins [1]. The splicing machinery assembles on the pre-mRNA in a highly regulated fashion to carry out the process of removing the intron and ligating the two adjoining exons [2,3]. Pre-mRNA splicing relies on the accurate recognition of the splice junctions that define introns and exons. This is underlined by the observation that incorrect pre-mRNA splicing is a major contributor to human genetic diseases [4-6]. Not only is splicing a crucial step in the accurate transfer of genetic information from DNA to RNA to protein, it is also a step that allows for regulation of gene expression as well as increased protein diversity through alternative splicing decisions [7]. Several canonical intronic sequences define an intron and recruit the spliceosome to the pre-mRNA: the 5' splice site (5'ss, AG/GURAGU), the branchpoint sequence (CURAY), the polypyrimidine (PY) tract (a run of polypyrimidines located between the 3' splice site and the branchpoint), and the 3' splice site (3'ss, YAG). These four canonical intronic sequences are recognized by specific components of the spliceosome or associated splicing factors. In the initial stage of splicing, when the decision to remove an intron is made, the U1 snRNP recognizes the 5'ss [8,9], splicing factor 1 (SF1, also known as BBP) recognizes the branchpoint sequence [10,11], and U2AF65 (U2AF (U2 snRNP auxillary factor), 65 kDa subunit) recognizes the PY tract [12,13] while its heterodimer partner U2AF35 (U2AF 35 kDa subunit) recognizes the 3'ss [14-16]. After these initial recognition events, U2AF65 interacts with the U2 snRNP in order to recruit it to the branchpoint sequence, where it displaces SF1 [17,18]. Although canonical splice elements are located within the intron, the exon is generally considered to be the unit that is first recognized and defined by the spliceosome. This is known as exon definition and is thought to be a dominant mode of recognition in human genes where the exons are small and the introns are large [19]. In the exon definition model, the exon and flanking upstream and downstream splice junctions are recognized and bridging interactions across the exon are important for accurate splicing. Conversely, according to the intron definition model, the splice junctions within the intron are recognized and bridging interactions across the intron mediate accurate splicing [19,20]. Intron definition is proposed to be the dominant mode of recognition for small introns [19]. It has become clear that the four canonical splice elements do not contain adequate sequence information to ensure accurate splicing [3]. Additional cis-elements appear to be essential for accurate identification of many splice sites, and various cis-splicing elements have been identified in both exonic and intronic regions. Based upon their locations and effects upon splicing, these have been categorized as exonic and intronic splicing enhancers (ESEs and ISEs, respectively) or exonic and intronic splicing silencers (ESSs and ISSs, respectively) (for reviews see [21-26]). We are interested in the question of how introns that lack a canonical splice element are recognized and spliced. We have focused on introns that lack a canonical PY tract. In humans, U2AF65 binding to the PY tract is believed to be critical for intron recognition and splicing. In vitro selection studies have determined that U2AF65 binds with highest affinity to continuous runs of uridines interrupted by cytidines [27]. This agrees with the general observation that good PY tracts contain runs of uridines. We have observed that many human introns lack these canonical PY tracts. This leads to the question of how introns lacking strong U2AF65 binding sites are recognized and are able to recruit the U2 snRNP. One model predicts that U2AF65 is not essential for the splicing of these introns. Several human introns have been shown to be spliced when U2AF65 levels are significantly reduced by RNA interference [28]. U2AF65 may not be required because another splicing factor is functioning to recognize the PY tract region. For example, PUF60 has been shown to substitute for U2AF65 in vitro for some substrates [29]. There is the potential that other, yet unidentified, U2AF65-like proteins may function to promote 3'ss selection of non-canonical PY tracts. In a second model, U2AF65 is required for splicing but strong U2AF65-PY tract interactions are not. It has recently been observed in fission yeast that introns lacking PY tracts require U2AF for splicing in vivo [30]. Alternative pathways for U2AF65 recruitment may function in introns lacking strong PY tracts. For example, additional cis-elements present in the intron could alleviate the need for strong U2AF65-RNA interactions. These cis-elements could include the branchpoint sequence and 3'ss, which recruit SF1 and U2AF35, respectively, both of which can bind U2AF65 cooperatively through protein-protein interactions [11,31,32]. Auxiliary cis-elements such as ESEs and ISEs could function in the recognition of introns containing weak PY tracts. Previous studies have indicated that ESEs located in the downstream exon are able to compensate for weak PY tracts [33,34]. In this model, the ESEs are recognized by SR (serine/arginine-rich) proteins that interact with the U2AF65/35 heterodimer to help recruit U2AF65 to the 3' end of the intron [34-36]. We propose that a similar mechanism exists where ISEs in the region upstream of the PY tract function to compensate for weak U2AF65 binding by helping to recruit either U2AF65 or U2AF65-recruiting proteins or bypassing the need for U2AF65 in recruiting the U2 snRNP to the intron. We have used a computational approach to classify human introns in terms of their U2AF65 binding site strength. We conclude that a significant population of human introns does not contain a strong U2AF65 binding site in the PY tract region. This classification of human PY tract strength enabled us to computationally identify intronic motifs over-represented upstream of weak PY tracts. We propose that these over-represented motifs are putative ISEs that are important for the splicing of introns containing weak PY tracts. LCAT (lecithin cholesterol acyltransferase) intron 4 is a short (83 nucleotide) constitutively spliced intron with a weak PY tract. Mutation of the branchpoint sequence U to C (CUGAC), is known to result in intron retention, causing familial LCAT deficiency (complete deficiency) or fish-eye disease (partial deficiency), which can lead to premature atherosclerosis [37]. Intron retention, rather than skipping, suggests an intron definition model of recognition [19]. Therefore, we expected that ISEs might be involved in the recognition of this intron. We present results showing that G-rich and C-rich motifs, similar to those predicted by our computational approach to be enriched upstream of weak PY tracts, are ISEs important for the splicing of LCAT intron 4, which has a weak PY tract. Furthermore, we have observed that the G-rich and C-rich ISEs function in a combinatorial manner to promote the recognition of a weak PY tract-containing intron. Finally, we show another example of an intron, GNPTG (N-acetylglucosamine-1-phosphotransferase gamma subunit) intron 2, in which C-rich ISEs again appear to be compensating for a weak PY tract. Results Computational analysis of human intron PY tracts using a U2AF65 binding site scoring method U2AF65 plays an important role during splicing and is known to bind to the PY tract region located between the branchpoint sequence and the acceptor splice junction [38]. Visual inspection of human introns reveals that, although the PY tract region is enriched in uridines in general, there is a great deal of sequence variation between introns. This degeneracy, at least in part, appears to reflect the low RNA site specificity that U2A65 displays compared to other RNA binding proteins that evolved to recognize highly specific targets. U2AF65 binds with high affinity to contiguous runs of uridines but appears to tolerate moderate interruptions of other nucleotides [27,39-41]. Despite the ability of U2AF65 to bind to degenerate sites, an effective binding site must still be composed primarily of uridines [40,41]. However, many thousands of human introns contain PY tracts that do not contain any sequences that are likely to be effective binding sites (shown below). Many of these PY tracts either contain contiguous runs of cytidines or contain numerous purines, neither of which are likely to represent binding sites for U2AF65 [40,41]. Therefore, it is likely that individual human intronic PY tracts possess a wide range of affinities towards U2AF65, and that many may possess only weak binding sites for it. It is possible that additional cis-sequence elements augment the role of the PY tract during splicing, and that such elements play crucial roles in splicing in the absence of a strong U2AF65 binding site. Many human introns have been shown to be enriched in motifs containing GGG in the region upstream of the PY tract [42,43] (Figure (Figure1a).1a
In order to carry out this analysis, we first needed to correlate the composition of the PY tract of introns with likely affinities towards U2AF65. Several theoretical models have been presented that describe the relationship between binding site composition and the ΔG of binding between nucleic acids and nucleic acid binding proteins [44,45]. These models require the use of a positional frequency model representing the preferred binding site. In vitro selection (SELEX) experiments using human U2AF65 did not reveal a well defined consensus motif shared by high affinity RNAs [27,39]. Several computational methods have been developed to define a degenerate consensus motif from a population of sequences that are thought to contain a common, but unknown, motif [46,47]. Though such methods have proven useful, each has its own weaknesses, and all such predictive methods introduce an added level of uncertainty. We decided to develop a computational method to predict the affinity between a short RNA sequence and U2AF65 that is independent of knowledge of a particular consensus binding motif. We refer to this score as an S65 score. The S65score, for a given intron, is the average degree to which all pentamers (using a sliding window) found in the PY tract region (-30 to -3 relative to the acceptor splice-junction) are themselves enriched within the SELEX derived sequences (see Materials and methods for a complete description). For this analysis, the PY tract was defined as the region from -30 to -3 (relative to the acceptor splice junction). This region is highly enriched in the pentamers that are most abundant within the U2AF65 selected sequences (Figure (Figure1a1a The S65 scores for the SELEX RNAs appear to be normally distributed with a mean of 1.5 (Figure (Figure1b).1b We chose to classify PY tracts that score below the median of 0.811 as 'weak' PY tracts and those above 0.811 as 'strong' PY tracts or likely to have high affinity U2AF65 binding sites. Using this designation, only a single SELEX-derived sequence scores as 'weak'. We are therefore asking whether there are statistically significant differences in the composition of the -80 to -30 region of two types of introns: ones that contain a PY tract with affinities similar to those derived using SELEX, and those with PY tracts with lower affinities. Binding of U2AF65 to low-scoring PY tracts In order to asses the relationship between the S65 score and observed U2AF65 binding affinities, we evaluated the binding of recombinant human U2AF65 to several human PY tracts of varying S65 scores using gel-shift mobility assays (Figure (Figure2).2
We expected the MBNL1 intron 6 PY tract to represent the weakest U2AF65 binding target and observed no detectable levels of U2AF65 binding at the protein concentrations tested (Figure (Figure2).2 Overall, there is a good agreement between the observed binding affinities for U2AF65 and the predicted affinities based upon the S65 score. Plotting the observed Kd values versus the predicted S65 score revealed that the ln of the Kd appears to be linearly related to the S65 score (Figure (Figure2c).2c Introns containing weak PY tracts are enriched in specific motifs upstream of the PY tract It is possible that introns containing weak U2AF65 binding sites might be enriched in specific sequences that can compensate for the lack of a well-defined PY tract. In order to identify such motifs, we first characterized the relative enrichment of all 4-7 nucleotide n-mers in the 50 nucleotide region from -80 to -30 (relative to the splice-junction) for introns with PY tracts categorized as 'weak' relative to the set of all introns (S65 scores less than 0.811; see Materials and methods). We were specifically interested in identifying sequences located in the region upstream of the branchpoint itself. Since most branchpoints are located between -17 and -30 (Figure (Figure1a),1a Human introns have been shown to fall into two classes based upon GC or AT content [50]. In order to be sure that we were not merely measuring compositional biases between AT-rich and GC-rich introns, we classified introns according to the GC content of the last 100 bases. Introns with greater than 50% GC content were categorized as GC-rich while those with less than 50% GC were categorized as AT-rich. As measured using our criteria, 37% of AT-rich introns were found to have 'weak' PY tracts, and 72% of GC-rich introns were determined to have 'weak' PY tracts. Enrichment of n-mers in the -80 to -30 region for introns with weak PY tracts versus all GC or AT-rich introns was determined (see Materials and methods). The entire list of enriched n-mers used in this study is available in Additional data files 2 and 3. According to this analysis, 99 n-mers were determined to be significantly enriched (P < 0.01) in the AT-rich class, and 349 n-mers were determined to be significantly enriched in the GC-rich class. For comparison, we drew random samples of the same size as the corresponding weak PY tract class for both the AT-rich and GC-rich introns, and determined enrichment using the same method as above. The average number of n-mers (for to seven nucleotides) that were determined to be significantly enriched in the randomly drawn samples was ten for the AT-rich and zero for the GC-rich class. Therefore, the enrichment measured appears to be strongly correlated with the composition of the PY tract as measured by the S65 score. It has been proposed that signals that govern splicing of shorter (<200 nucleotides) introns may differ from those governing splicing of longer introns [51]. Therefore, we also evaluated short (<200 nucleotides) and long (≥ 200 nucleotides) AT-rich and GC-rich introns as independent classes. We found that enrichment was similar for both short and long GC-rich introns as evidenced by the observation that the enrichment score for n-mers correlated between these groups (Additional data file 6a). Meanwhile, little correlation was seen between the enrichment scores for long versus short AT-rich introns (Additional data file 6b). This is likely due to the fact that few n-mers were actually determined to be significantly enriched in the short AT-rich population (Additional data file 6b, and data not shown). Together, these data suggest that the compositional biases seen in the region upstream of the PY tract correlate with the potential for U2AF65 binding, especially for GC-rich introns, and that the bias is similar for both long and short introns. To determine motifs, the enriched n-mers were clustered using the graph clustering method and software presented by Voelker and Berglund [52]. Clustering of the n-mers derived from the GC-rich introns yielded 25 clusters (Additional data file 4). These were manually separated into eight groups of compositionally similar motifs (Figure (Figure3a).3a
Motifs containing three to four contiguous guanidines are greatly enriched upstream of weak PY tracts for both AT-rich and GC-rich introns (Figure (Figure3,3 In addition, we observed that C-rich motifs (containing three to four contiguous cytidines) are enriched upstream of weak GC-rich PY tracts (Figure (Figure3,3 We also observed that AT-rich introns with weak PY tracts were enriched in motifs similar to a motif recognized by the protein CUG-BP1 (Figure (Figure3,3 These analyses demonstrate that certain motifs are statistically over-represented upstream of human introns containing weak PY tracts. We also wanted to assess how prevalent these motifs are among introns in general, and also determine the relative level of enrichment between introns with strong versus weak U2AF65 binding sites. Therefore, for each intron, we determined the percentage of the region from -80 to -30 that matched one or more of the n-mers determined to be enriched in introns with weak PY tracts relative to those with strong PY tracts (see above). We refer to this value as the percent coverage. As an example, 80% coverage indicates that 80% of the -80 to -30 region (or 40 of the 50 nucleotides) matches one or more of the enriched n-mers. This analysis (Additional data file 7) revealed that most introns have at least one match to an enriched n-mer. This is not surprising considering that the n-mers are only four to seven nucleotides in length, and, therefore, are expected to occur by chance with fairly high frequency. However, this analysis also revealed that introns with weak PY tracts are likely to have a greater coverage than introns with strong PY tracts. This is especially true of the GC-rich class of introns. For instance, while only 10% of GC-rich introns with strong PY tracts have 80-100% coverage, 23% of introns with weak PY tracts have this level of coverage (Additional data file 7). A smaller difference in coverage is seen between AT-rich introns with strong and weak PY tracts; however, the overall trend is the same (Additional data file 7). In both cases, the enriched n-mers tend to make up a greater portion of the -80 to -30 region for introns with weak PY tracts. Together, these observations indicate that the sequences represented by the enriched n-mers are rather common but they tend to cluster in introns with weak PY tracts. C-rich and G-rich motifs act as ISEs in an intron containing a weak polypyrimidine tract LCAT intron 4 contains both C-rich and G-rich motifs upstream of the PY tract similar to those we identified computationally that are also highly conserved. The PY tract of LCAT intron 4 is a low-scoring PY tract and is not well conserved. To investigate the role of C-rich and G-rich motifs present in LCAT intron 4, we used a mini-gene system. We created a mini-gene that contains the last 50 nucleotides of LCAT intron 3, LCAT exon 4, LCAT intron 4, LCAT exon 5 and the first 50 nucleotides of LCAT intron 5. We included the downstream and upstream flanking introns in order to allow exon definition to occur, although short introns are often observed to function by intron definition [19]. Mutation of the G-rich motifs We examined the role of two G-rich motifs (G-rich motif (GRM)1 and GRM2) present upstream of the PY tract of LCAT intron 4 (Figure (Figure4a).4a
Mutation of the C-rich motifs To determine whether the C-rich motifs function as ISEs, we mutated two C-rich motifs: C-rich motif (CRM)1 and CRM2 (Figure (Figure5a),5a
Cumulative mutation of the G-rich and C-rich motifs We hypothesized that the G-rich motifs and C-rich motifs could be functioning together in the recognition of LCAT intron 4. We have observed that there are many examples of introns where the G-rich and C-rich motifs are both present (data not shown). Mutation of both GRM1 and CRM1 (MUT 24, Figure Figure6a)6a
G-rich and C-rich motifs can functionally replace one another as ISEs We examined whether the C-rich motifs could function in the place of the G-rich motifs. Mutation of GRM1 to CCC (MUT 27, Figure Figure7a)7a
Strengthening the PY tract eliminates the role of the C-rich motifs We next investigated the role of the PY tract in LCAT intron 4 splicing. We mutated the PY tract to determine whether the C-rich sequences in the PY tract were also being recognized. Mutation of a C-rich sequence in the PY tract (CRM3, MUT 16B, Figure Figure8a)8a
C-rich motifs are ISEs in an additional intron containing a weak PY tract GNPTG intron 2 is an alternatively spliced (intron retention) short intron containing multiple C-rich motifs upstream of a low scoring PY tract (Figure (Figure9a,9a
Discussion The present model of pre-mRNA splicing is based on the recognition of the four canonical intronic motifs (5'ss, branchpoint sequence, PY tract and 3'ss) [3]. However, many introns lack one or more of these motifs and yet they are spliced. The diversity of human intronic sequences suggests that novel recognition pathways exist for non-canonical introns. Using an experimentally validated computational approach, introns lacking a canonical PY tract were isolated and analyzed to identify putative ISEs that functionally compensate in splicing when the PY tract is weak. U2AF65 binding to PY tracts confirms the U2AF65 SELEX scoring system Our U2AF65 binding studies using various human intron PY tracts (Figure (Figure2)2 For this analysis we also assume that the PY tract is located in the last 30 nucleotides of the intron. While this is a fair assumption for the vast majority of human introns, there are examples of introns where the PY tract and branchpoint sequence are located a further distance from the 3'ss AG [48,62-64]. Some of the human introns that score as having low scoring PY tracts may actually have high scoring PY tracts that are distally located. Although there are caveats to our scoring system, the S65 score generally distinguishes low and high affinity U2AF65 binding sites, allowing us to ask questions about the population of human introns with low affinity U2AF65 binding sites. Intronic motifs enriched upstream of weak PY tracts We have identified families of motifs that are over-represented upstream of weak PY tracts but not upstream of strong PY tracts (Figure (Figure3).3 The experimental work presented here has focused on two relatively short introns, but our computational analysis found that the same families of motifs were over-represented in both short and long human introns (Additional data file 6). Although LCAT intron 4 is constitutively spliced, expressed sequence tag data suggest that GNPTG intron 2 is alternatively spliced, with some expressed sequence tags containing a retained intron 2. We expect to find examples where these motifs may play important roles in both constitutive and alternative splicing for both short and long introns. Interplay of G-rich and C-rich ISEs in the splicing of LCAT intron 4 G-rich motifs have been shown to be enriched in short mammalian introns [20,65]. The G-rich motif GRM1 is the strongest ISE we have observed in LCAT intron 4 (Figure (Figure4).4 Our results also show that C-rich motifs can act as ISEs like the G-rich motifs, but that the C-rich motifs may play more of an ancillary role to the G-rich motifs, at least in the case of LCAT intron 4 (Figure (Figure5).5 Interestingly, the C-rich and G-rich motifs together function additively (Figure (Figure6).6 We have also observed that the C-rich motifs can only partially compensate in the place of G-rich motifs in LCAT intron 4 splicing, while G-rich motifs appear to fully compensate in the place of C-rich motifs (Figure (Figure7).7 An examination of the primary sequence and predicted secondary structure (using mfold [68]) of LCAT intron 4 suggests that the intron could be folding into a stem-loop structure with the G-rich and C-rich sequences base-pairing (data not shown). While this is an intriguing model, when the C-rich motifs are replaced with G-rich motifs (a mutation that would abolish stem-loop formation; MUT 29, Figure Figure7),7 Candidate protein factors for the G-rich and C-rich ISEs There are multiple candidate proteins that could be recognizing the G-rich and C-rich motifs present in LCAT intron 4. A G-rich motif trans-factor, hnRNP H, has been identified and shown to bind G-rich sequences and regulate splicing both positively and negatively [54-56]. Several additional hnRNP proteins, including hnRNPs A1, A2, and F, have also been shown to bind G-rich RNA sequences [54,57-59]. An alternative model for G-rich sequence recognition involves RNA-RNA interactions to promote U1 snRNP binding. G-triplets near the 5'ss have been shown to bind the U1 snRNP by interacting with the U1 snRNA and this interaction was shown to be important for human alpha globin splicing in vivo [69]. hnRNP K and the α-CP proteins are the major poly-C-binding proteins identified in mammalian cells [70,71]. Both hnRNP K and several α-CP isoforms have been implicated in post-transcriptional control [70]. There have also been two studies that have implicated these proteins in splicing. hnRNP K was shown to enhance the splicing of a chicken b-Tropomyosin intron by binding a C-rich motif near the 5'ss [66]. A recent study has shown that α-CP2 binds a C-rich patch upstream of a weak PY tract in the human α-globin intron 1 transcript and inhibits splicing of this intron in vitro [67]. This is in contrast to our results with LCAT intron 4 where the C-rich motifs function as splicing enhancers, not silencers. Several cis-elements, including G-rich ISEs, have been shown to act as both splicing enhancers and silencers [54-56,72]. The C-rich motifs and their trans-factor may also possess the flexibility to function as silencers and enhancers. Role of the PY tract in splicing Our results suggest a model where ISEs present upstream of a weak PY tract compensate for a weakened U2AF65-RNA interaction (Figure (Figure10).10
An alternative model to weakened U2AF65-RNA interactions is that a splicing factor other than U2AF65 binds the weakened PY tracts. Recent work has shown that PUF60 plays a role in splicing by interacting with the PY tract [29]. Our observation that many of the weak PY tracts are particularly C-rich leads us to propose that a C-rich binding protein may function in this region. When we mutated a C-rich motif in the LCAT intron 4 PY tract we observed a small effect on splicing, suggesting that the C-rich motifs in the PY tract itself are recognized by a trans-factor (Figure (Figure8).8 Conclusion Novel mechanisms of intron recognition promote splicing of introns with non-canonical PY tracts The pool of introns containing low-scoring U2AF65 binding sites represents a significant class of human introns lacking a canonical splicing element. The ISEs identified and validated here suggest that novel mechanisms exist in the cell for coping with weakened U2AF65-RNA interactions. Specifically, we have observed that the interplay of multiple cis-elements, in this case the G-rich and C-rich motifs, appears to be crucial for the recognition of non-canonical introns. In the future we plan to explore additional ISEs identified by this study to gain a broader picture of how the splicing machinery functions in the recognition of introns with weak PY tracts. While we have focused our attention here on a single key canonical splicing element, the PY tract, we plan on extending our analyses and expect to find that similar strategies exist for the recognition of other classes of non-canonical pre-mRNAs. Materials and methods Computational prediction of strength of U2AF65 binding sites in PY tracts The March 2006 human reference sequence (NCBI Build 36.1) in conjunction with the UCSC KnownGenes (hg18) annotation database (Release 8 April 2007) [73] was used to create a non-redundant database of human intronic sequences. After excluding annotated introns that did not begin with G [T/C] [A/G] and end with AG, and were less than 60 bases in length, we were left with 171,475 unique acceptor ends. In order to score PY tracts according to their likely affinity towards U2AF65, we developed a score that reflects the level of similarity of the PY tract sequence to the sequences that were enriched in RNAs derived from in vitro SELEX experiments using human U2AF65. In particular, if we let the frequency of occurrence of an n-mer n of length k within the SELEX sequences be represented by fn, then for a subject sequence of length L the S65 score is determined according to: where fn represents the frequency (within the SELEX population) of the n-mer found at position i in the subject sequence. The term Identification of intronic motifs over-represented upstream of weak PY tracts In order to avoid biases due to long interspersed repetitive elements (LINEs) and short interspersed repetitive elements (SINEs), repetitive elements in the intronic sequence database (obtained as described above) were masked using the masking coordinates associated with the UCSC hg18 annotation database (Release 8 April 2007) [73]. However, simple repeats (many of which resemble known hnRNP binding sites) were not masked. The intronic acceptor sequences were then separated according to their GC content within the last 100 bases (or last half if the intron was less than 200 bases in length). AT-rich introns were defined to be introns containing less than 50% GC content. GC-rich introns were defined to be those containing greater than or equal to 50% GC content. For each of these data sets, the occurrence of all n-mers (4-7 nucleotides) in the 50 nucleotide region from -80 to -30 (relative to the acceptor splice-junction) were determined using a sliding window. These counts were used to determine the background expectations for each n-mer. The occurrence of each 4-7 nucleotide n-mer within the equivalent region for all introns possessing 'weak' PY tracts (defined as above) was determined using a sliding window. From these values, n-mers that are enriched upstream of the branchpoint region for introns possessing weak PY tracts was determined using the binomial confidence interval method described in Voelker and Berglund [52]. For the AT-rich class, 99 n-mers were determined to be significantly enriched (P < 0.01), and 349 n-mers were determined to be significantly enriched for the GC-rich class. Enriched n-mers and corresponding counts and statistics are available in Additional data files 2 and 3. Enriched n-mers were used to construct motifs as in Voelker and Berglund [52]. All of the derived motifs and the identities and occurrences of all n-mers that were used to construct the motifs are available in Additional data files 4 and 5. U2AF65 binding RNA oligonucleotides (listed in Figure Figure2b,2b Cloning of mini-genes and mutants WT LCAT intron 4 mini-gene was cloned from HeLa genomic DNA using primers to amplify the region between the last 50 nucleotides of LCAT intron 3 to the first 50 nucleotides of LCAT intron 5 (502 nucleotides). The forward primer included a BamH1 site and the reverse primer included an EcoR1 site. The amplified genomic DNA was cut with BamH1 and EcoR1, inserted into pcDNA3 and sequenced. LCAT intron 4 mutants were made by PCR using the WT LCAT 4 mini-gene as template and primers containing the mutation of interest. LCAT intron 4 mutants were also cloned into pcDNA3 using BamH1 and EcoR1 and sequenced. The WT and mutant GNPTG [ENSEMBL: ENSG00000090581] intron 2 mini-genes were cloned using overlapping primers to create a sequence containing exon 2, intron 2 and exon 3. This sequence was flanked by cut sites HindIII and Not1, cloned into pcDNA3 and sequenced. In vivo splicing assays: cell culture, transfection, and harvesting HeLa cells were grown in monolayers in DMEM with GLUTAMAX (Invitrogen, Carlsbad, CA, USA) and supplemented with 10% fetal bovine serum (GIBCO). For the LCAT splicing experiments 1.5 (± 0.2) × 105 cells were plated in 6-well plates and transfected 18-20 h later at approximately 70% confluency. Plasmid (1 μg) was transfected into each well of cells using 5 μl of Lipofectin (Invitrogen, Carlsbad, CA, USA) and 10 μl of Plus reagent (Invitrogen) according to the manufacturer's protocols. For the GNPTG splicing experiments, 2 × 105 cells were plated in 6-well plates and transfected with 1 μg plasmid 18-20 h later using 5 μl of Lipofectamine 2000. Cells were harvested 24 h (LCAT experiments) or 16 h (GNPTG experiments) after transfection using TriplE (GIBCO) and then pelleted by centrifugation. RNA was isolated from the cell pellets using an RNeasy kit (QIAGEN, Valencia, CA, USA). In vivo splicing assays: DNAsing, reverse transcription, PCR, and quantifying percent mRNA Isolated RNA (500 ng) was incubated with 1 unit of RQI DNase (Promega, Madison, WI, USA) in a 10 μl reaction for 2 h (LCAT experiments) or 1 h (GNPTG experiments) according to the manufacturer's protocol. DNAsed RNA (2 μl (100 ng)) was reverse transcribed in a 10 μl reaction (1:5 dilution) using Superscript II and an LCAT-specific reverse primer or a reverse primer to the pCDNA3 SP6 sequence for the GNPTG experiments, according to manufacturer's protocols with the exception that we used half the recommended amount of Superscript II (Invitrogen, Carlsbad, CA, USA). For the LCAT splicing experiments, 2 μl of the reverse transcription reaction was subjected to 20 rounds of PCR amplification in a 20 μl reaction (1:10 dilution) using LCAT specific primers spiked with a kinased LCAT forward primer (0.4 nM). Twenty rounds of PCR were found to be within the linear range for this PCR experiment (data not shown). The resulting PCR products were run on an 8% (19:1) polyacrylamide native gel. For the GNPTG splicing experiments, 2 μl of the reverse transcription reaction was subjected to 27 rounds of PCR amplification in a 20 μl reaction (1:10 dilution) using primers specific to the T7 (forward) and SP6 (reverse) sequences of the pcDNA3 plasmid spiked with kinased T7 forward primer. Twenty-seven rounds of PCR were found to be within the linear range for this PCR experiment (data not shown). The resulting PCR products were run on a 10% (19:1) polyacrylamide gel. The gels were dried and exposed overnight to a phosphorimager screen. Quantification of the radioactive bands was performed using ImageQuant software (GE Healthcare, London, UK). The percent pre-mRNA was calculated by dividing the amount of the pre-mRNA band by the total amount of the pre-mRNA and mRNA bands and multiplying by 100%. Abbreviations ADML, adenovirus major late; CRM, C-rich motif; ESE, exonic splicing enhancer; ESS, exonic splicing silencer; GNPTG, N-acetylglucosamine-1-phosphotransferase gamma subunit; GRM, G-rich motif; hnRNP, heterogeneous nuclear ribonucleoproteins; ISE, intronic splicing enhancer; ISS, intronic splicing silencer; LCAT, lecithin cholesterol acyltransferase; PY tract, polypyrimidine tract; S65 score, U2AF65 binding site score; SF, splicing factor; snRNP, small nuclear ribonucleoprotein; ss, splice site; U2AF, U2 snRNP auxilliary factor; WT, wild-type. Authors' contributions JIM and RBV designed experiments, performed experiments, analyzed data and wrote the paper. KLH and MBW performed experiments and analyzed data. JAB designed experiments, analyzed data and wrote the paper. Additional data files The following additional data files are available with the online version of this paper. Additional data file 1 is a table listing the probability of occurrence for pentamers found in U2AF65 SELEX derived sequences. Additional data file 2 is a table listing the n-mers enriched upstream of weak PY tracts from GC-rich introns. Additional data file 3 is a table listing the n-mers enriched upstream of weak PY tracts from AT-rich introns. Additional data file 4 is a table listing the clusters enriched upstream of weak PY tracts from GC-rich introns. Additional data file 5 is a table listing the clusters enriched upstream of weak PY tracts from AT-rich introns. Additional data file 6 is a figure of the scatterplots of Z-scores for enrichment upstream of weak PY tracts for long versus short introns. Additional data file 7 is a histogram showing the percentage of introns possessing specific n-mers that are enriched upstream of weak PY tracts Additional data file 1 Table listing the count and the probability of occurrence (using a sliding window) for all pentamers found in the sequences reported in Singh et al. [27] and both SELEX experiments reported in Banerjee et al. [39]. Click here for file(34K, pdf) Additional data file 2 Associated statistics and listing of n-mers (4-7 nucleotides) determined to be enriched in the 50 nucleotide region upstream of weak PY tracts from GC-rich introns. Click here for file(73K, pdf) Additional data file 3 Associated statistics and listing of n-mers (4-7 nucleotides) determined to be enriched in the 50 nucleotide region upstream of weak PY tracts from AT-rich introns. Click here for file(40K, pdf) Additional data file 4 Listing of all clusters derived from n-mers enriched in the 50 nucleotide region upstream of weak PY tracts from GC-rich introns. Included are the individual n-mers and associated statistics used to produce each motif. Click here for file(39K, pdf) Additional data file 5 Listing of all clusters derived from n-mers enriched in the 50 nucleotide region upstream of weak PY tracts from AT-rich introns. Included are the individual n-mers and associated statistics used to produce each motif. Click here for file(28K, pdf) Additional data file 6 The Z-scores for enrichment of all 4-7 nucleotide n-mers in the intronic region upstream (-80 to -30 relative to the acceptor splice-junction) of PY tracts with low S65 scores for short (<200 nucleotide) introns is plotted versus long (≥ 200 nucleotide) introns. (a) Data for GC-rich introns. (b) Data for AT-rich introns. Click here for file(5.1M, pdf) Additional data file 7 The portion of the sequence corresponding to the -80 to -30 region matching one or more of the n-mers enriched in the same region for introns with weak PY tracts (Additional data files 2 and 3) was determined. These values (referred to as the percent coverage) were binned as indicated along the x-axis. Click here for file(249K, pdf) Acknowledgements JIM was supported by an AHA Pacific Mountain Affiliate pre-doctoral fellowship. KLH was supported by a NSF graduate research fellowship. MBW was supported by NIH training grant GM-07759, to the Institute of Molecular Biology at the University of Oregon. This work was supported by NSF grant 0616264-MCB to JAB. We thank members of the Berglund lab for helpful discussion. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||
Cell. 1998 Feb 6; 92(3):315-26.
[Cell. 1998]Hum Genet. 1992 Sep-Oct; 90(1-2):41-54.
[Hum Genet. 1992]Nat Rev Genet. 2007 Oct; 8(10):749-61.
[Nat Rev Genet. 2007]Annu Rev Biochem. 2003; 72():291-336.
[Annu Rev Biochem. 2003]Cell. 1983 Jun; 33(2):509-18.
[Cell. 1983]Cell. 1989 Oct 20; 59(2):349-58.
[Cell. 1989]Cell. 1997 May 30; 89(5):781-7.
[Cell. 1997]Genes Dev. 1998 Mar 15; 12(6):858-67.
[Genes Dev. 1998]Proc Natl Acad Sci U S A. 1989 Dec; 86(23):9243-7.
[Proc Natl Acad Sci U S A. 1989]J Biol Chem. 1995 Feb 10; 270(6):2411-4.
[J Biol Chem. 1995]Mol Cell Biol. 1997 Aug; 17(8):4562-71.
[Mol Cell Biol. 1997]Trends Biochem Sci. 2000 Mar; 25(3):106-10.
[Trends Biochem Sci. 2000]J Biomed Sci. 2004 May-Jun; 11(3):278-94.
[J Biomed Sci. 2004]Science. 1995 May 26; 268(5214):1173-6.
[Science. 1995]Mol Cell Biol. 2006 Nov; 26(21):8183-90.
[Mol Cell Biol. 2006]PLoS One. 2007 Jun 20; 2(6):e538.
[PLoS One. 2007]Mol Cell Biol. 2007 Oct; 27(20):7334-44.
[Mol Cell Biol. 2007]Genes Dev. 1998 Mar 15; 12(6):858-67.
[Genes Dev. 1998]EMBO J. 2002 Oct 15; 21(20):5516-26.
[EMBO J. 2002]J Clin Invest. 1996 Jul 15; 98(2):358-64.
[J Clin Invest. 1996]J Biol Chem. 1995 Feb 10; 270(6):2411-4.
[J Biol Chem. 1995]Cell. 1988 Jan 29; 52(2):207-19.
[Cell. 1988]Science. 1995 May 26; 268(5214):1173-6.
[Science. 1995]RNA. 2004 Feb; 10(2):240-53.
[RNA. 2004]Nucleic Acids Res. 1997 Feb 15; 25(4):888-96.
[Nucleic Acids Res. 1997]J Biol Chem. 1993 May 25; 268(15):11222-9.
[J Biol Chem. 1993]J Theor Biol. 1988 Jul 8; 133(1):73-84.
[J Theor Biol. 1988]Genome Res. 2003 Dec; 13(12):2637-50.
[Genome Res. 2003]Mol Cell Biol. 1997 Aug; 17(8):4562-71.
[Mol Cell Biol. 1997]Science. 1995 May 26; 268(5214):1173-6.
[Science. 1995]RNA. 2004 Feb; 10(2):240-53.
[RNA. 2004]J Mol Biol. 1987 Feb 20; 193(4):723-50.
[J Mol Biol. 1987]Bioinformatics. 2000 Jan; 16(1):16-23.
[Bioinformatics. 2000]Science. 1995 May 26; 268(5214):1173-6.
[Science. 1995]RNA. 2004 Feb; 10(2):240-53.
[RNA. 2004]PLoS Comput Biol. 2006 Apr; 2(4):e36.
[PLoS Comput Biol. 2006]Genome Biol. 2006; 7(1):R1.
[Genome Biol. 2006]Biochemistry. 2008 Jan 8; 47(1):449-59.
[Biochemistry. 2008]Biochim Biophys Acta. 2005 Mar 10; 1727(3):197-207.
[Biochim Biophys Acta. 2005]Genome Res. 2005 Jun; 15(6):768-79.
[Genome Res. 2005]Proc Natl Acad Sci U S A. 2005 Nov 8; 102(45):16176-81.
[Proc Natl Acad Sci U S A. 2005]Genome Res. 2007 Jul; 17(7):1023-33.
[Genome Res. 2007]J Theor Biol. 1988 Jul 8; 133(1):73-84.
[J Theor Biol. 1988]Genome Res. 2003 Dec; 13(12):2637-50.
[Genome Res. 2003]J Immunol. 2006 Feb 15; 176(4):2381-8.
[J Immunol. 2006]Nucleic Acids Res. 2007; 35(1):132-42.
[Nucleic Acids Res. 2007]PLoS Biol. 2005 May; 3(5):e158.
[PLoS Biol. 2005]Proc Natl Acad Sci U S A. 2004 Nov 2; 101(44):15700-5.
[Proc Natl Acad Sci U S A. 2004]Biochem J. 2006 Dec 1; 400(2):291-301.
[Biochem J. 2006]J Biol Chem. 1995 Feb 10; 270(6):2411-4.
[J Biol Chem. 1995]Genome Biol. 2006; 7(1):R1.
[Genome Biol. 2006]Nucleic Acids Res. 1989 Jul 25; 17(14):5633-50.
[Nucleic Acids Res. 1989]EMBO J. 1990 Jan; 9(1):241-9.
[EMBO J. 1990]Mol Cell Biol. 1997 Aug; 17(8):4562-71.
[Mol Cell Biol. 1997]Proc Natl Acad Sci U S A. 2001 Sep 25; 98(20):11193-8.
[Proc Natl Acad Sci U S A. 2001]Nucleic Acids Res. 2007; 35(1):132-42.
[Nucleic Acids Res. 2007]J Biol Chem. 2002 May 10; 277(19):16614-23.
[J Biol Chem. 2002]Mol Cell Biol. 2007 May; 27(9):3290-302.
[Mol Cell Biol. 2007]Mol Cell Biol. 1997 Aug; 17(8):4562-71.
[Mol Cell Biol. 1997]J Biol Chem. 2005 Jun 17; 280(24):22641-50.
[J Biol Chem. 2005]Nucleic Acids Res. 2007; 35(1):132-42.
[Nucleic Acids Res. 2007]PLoS Biol. 2005 May; 3(5):e158.
[PLoS Biol. 2005]Nucleic Acids Res. 2003 Jul 1; 31(13):3406-15.
[Nucleic Acids Res. 2003]PLoS Biol. 2005 May; 3(5):e158.
[PLoS Biol. 2005]Nucleic Acids Res. 2007; 35(1):132-42.
[Nucleic Acids Res. 2007]Biochem J. 2006 Jan 1; 393(Pt 1):361-71.
[Biochem J. 2006]Neuron. 2007 Aug 16; 55(4):565-71.
[Neuron. 2007]Mol Cell Biol. 2000 Dec; 20(24):9225-35.
[Mol Cell Biol. 2000]RNA. 2002 Mar; 8(3):265-78.
[RNA. 2002]J Biol Chem. 2001 May 18; 276(20):17484-96.
[J Biol Chem. 2001]J Biol Chem. 2002 May 10; 277(19):16614-23.
[J Biol Chem. 2002]Mol Cell Biol. 2007 May; 27(9):3290-302.
[Mol Cell Biol. 2007]PLoS Biol. 2005 May; 3(5):e158.
[PLoS Biol. 2005]Mol Cell Biol. 1997 Aug; 17(8):4562-71.
[Mol Cell Biol. 1997]PLoS One. 2007 Jun 20; 2(6):e538.
[PLoS One. 2007]Science. 1995 May 26; 268(5214):1173-6.
[Science. 1995]RNA. 2004 Feb; 10(2):240-53.
[RNA. 2004]Genome Res. 2007 Jul; 17(7):1023-33.
[Genome Res. 2007]Biochim Biophys Acta. 2005 Mar 10; 1727(3):197-207.
[Biochim Biophys Acta. 2005]Science. 1995 May 26; 268(5214):1173-6.
[Science. 1995]RNA. 2004 Feb; 10(2):240-53.
[RNA. 2004]