![]() | ![]() |
Formats:
|
||||||||||||||||||||
Copyright © 2008 The Author(s) Position-dependent motif characterization using non-negative matrix factorization 1Center for Genome Dynamics, The Jackson Laboratory, Bar Harbor, ME 04609 and 2Bioinformatics Program, Boston University, Boston, MA 02215, USA *To whom correspondence should be addressed. Associate Editor: Limsoon Wong Received May 16, 2008; Revised September 12, 2008; Accepted October 6, 2008. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Motivation: Cis-acting regulatory elements are frequently constrained by both sequence content and positioning relative to a functional site, such as a splice or polyadenylation site. We describe an approach to regulatory motif analysis based on non-negative matrix factorization (NMF). Whereas existing pattern recognition algorithms commonly focus primarily on sequence content, our method simultaneously characterizes both positioning and sequence content of putative motifs. Results: Tests on artificially generated sequences show that NMF can faithfully reproduce both positioning and content of test motifs. We show how the variation of the residual sum of squares can be used to give a robust estimate of the number of motifs or patterns in a sequence set. Our analysis distinguishes multiple motifs with significant overlap in sequence content and/or positioning. Finally, we demonstrate the use of the NMF approach through characterization of biologically interesting datasets. Specifically, an analysis of mRNA 3′-processing (cleavage and polyadenylation) sites from a broad range of higher eukaryotes reveals a conserved core pattern of three elements. Contact: joel.graber/at/jax.org Supplementary information: Supplementary data are available at Bioinformatics online. 1 INTRODUCTION Regulatory sequence identification and characterization remains an important and challenging problem. Many classes of functional sequences, for instance those involved in processing of precursor mRNA (pre-mRNA), are constrained both by sequence content and positioning with respect to a functional site. Standard pattern recognition tools for sequence analysis typically either ignore positioning effects altogether, or else force the positioning to fit a predefined model, such as a Gaussian distribution (Ao et al., 2004). Recent reviews of computational approaches to pattern recognition have pointed out that positioning preference remains an underutilized means of specifying motifs (Li and Tompa, 2006; Tompa et al., 2005). Specific positioning of motifs, especially relative positioning, can be evolutionarily conserved (Vardhanabhuti et al., 2007). Characterization of positioning dependencies have most commonly been approached in variants of positional word counting (PWC) (Fairbrother et al., 2004; Hu et al., 2005; Salisbury et al., 2006), tracking both occurrence and relative position of all of the sequence words of a given size k (k-mers). PWC results in a two-dimensional matrix (V), in which rows are indexed by k-mer and columns are indexed by position relative to a functional site. The rows of such a matrix can be interpreted as unnormalized conditional probability distributions, reflecting the probability of observing the specific k-mer at a given position relative to the functional site (and conditional on its presence). Previous analysis based on PWC relied on standard clustering approaches, such as hierarchical or k-means, identifying groups of k-mers with similar positioning probabilities near functional sites (Hu et al., 2005; Loke et al., 2005). Several approaches have been developed to merge the clusters of related k-mers into a single representative positional weight matrix (PWM) (Hu et al., 2005; Loke et al., 2005). These approaches produce reasonable results in the instance of independent, non-overlapping motifs, however, functional motifs often overlap significantly in positioning and/or sequence content. A particularly relevant case can be found in the UG- and U-rich elements that occur downstream of vertebrate 3′-processing sites. These elements are quite degenerate in sequence content (Hu et al., 2005; Salisbury et al., 2006; Zhao et al., 1999, and have overlapping but distinct positioning distributions (Salisbury et al., 2006), conditions that result in a number of U-rich k-mers with multi-modal positioning distributions, reflecting the likelihood that the k-mer can appear as a part of either the U- or UG-rich motif (e.g. Supplementary Fig. 1).
This defines the challenge to be addressed: given the set of positioning distributions for all k-mers (as defined in the PWC matrix), derive the underlying motifs that are most likely to have generated it. To solve this problem, we adapted the non-negative matrix factorization (NMF) algorithm, a dimension reduction technique first developed for image processing (Lee and Seung, 1999) that has subsequently gained popularity in processing high-dimensional biological data, such as microarrays (Carmona-Saez et al., 2006; Kim and Tidor, 2003; Pascual-Montano et al., 2006b). NMF, similar to principal components analysis, generates Òbasis vectorsÓ (referred to here as ‘elements’) to reduce the dimensionality of large datasets. The original authors found that the non-negativity constraint resulted in elements that reflected distinct components of the underlying data, for instance eyes or a nose in facial images. In our analysis, this is analogous to elements representing specific sequence patterns or putative regulatory motifs. We first define terminology: in the abstract analysis of NMF, we refer to a set of features that are measured in samples. In tangible examples, we have pixels measured in images, gene probesets measured in microarray experiments, or sequence words (k-mers) counted in positions or windows. The successful application of NMF relies on the validity of the following assumptions: (1) the measurement of feature i in sample j can be approximated as a linear superposition of r patterns [equation (1)]; (2) the m-th term in the sum is the product of two probabilities: the probability of observing feature i in pattern m and the probability of observing pattern m in sample j and (3) the relative probabilities of observing each feature in the m-th pattern are constant across all samples.
In this article, we investigate the parameters of the NMF method, specifically investigating the robustness of the solutions with variation in the analysis parameters. We demonstrate the use of the NMF algorithm for the detection and characterization of sequence motifs that include constrained positioning, producing models of both sequence content and positioning. We show how variation in the residual sum of squares (RSS) of the approximated data matrix provides a robust estimate of the appropriate number of elements, while simultaneously revealing whether or not an NMF analysis is appropriate for the dataset. Analysis of synthetic datasets demonstrates that NMF can accurately describe both positioning and sequence characteristics of implanted patterns, and that NMF finds patterns that are missed by other approaches to pattern recognition. Finally, we apply the NMF approach to two interesting biological datasets: 3′-processing sites from a phylogenetically broad sampling of eukaryotic organisms and transcription start sites from the fruit fly, Drosophila melanogaster (available in Supplementary Materials). 2 MATERIALS AND METHODS 2.1 Synthetic test matrices and sequences To test the theoretical basis for the use of the RSS to establish the number of elements r, we generated two artificial matrices. The first matrix was generated from pseudo-random draws from a Gaussian distribution. The second matrix was generated to precisely match the conditions that NMF models. Starting with a random background, patterns to be added were generated by pseudo-random draws that assigned (1) the probability of each feature occurring in each pattern (corresponding to the W matrix, defined below), and (2) the probability of each pattern occurring in each measurement (the H matrix defined below). To test the complete motif characterization procedure, we generated randomized sequences (based on fixed known dinucleotide frequencies) seeded with known motifs (Table 1). In most tests, motifs were inserted into 95% of the test sequences, however, we also explicitly also generated sets of size T = 300 in which all motifs were put into 50% and 25% of the sequences, respectively. In order to better mimic true conditions, we used sequences with length 400 nt, defining the center point as position 0, and used distinct dinucleotide frequencies for the negative and positive portions of the sequences. To test the performance with changes in number of sequences, we generated independent training sets with sizes T = 30, 100, 300, 1000, 3000 and 10 000, respectively. Motifs were placed within these sequences according to independent pseudo-random draws from a positioning distribution and a sequence content PWM. All sequence files used in this analysis are available at http://harlequin.jax.org/nmf/.
2.2 Biological training sequences We obtained putative 3′-processing sites from our database PACdb, which uses EST-to-genome alignments to assign probable sites as described previously (Brockman et al., 2005). Putative transcription start sites and surrounding sequences for D.melanogaster were obtained from the Supplementary Material from Gershenzon et al. (2006). 2.3 NMF decomposition Starting with a set of training sequences all containing, and aligned on, a common functional site, we generate the PWC matrix V, a two-dimensional matrix of counts that is indexed by k-mer and relative position, respectively. In practice, k-mers are counted in contiguous windows of size w, where each k-mer is assigned to the window in which it begins. As such, k-mers can span the boundary between adjacent windows, but will only be assigned to one. In the case of small datasets (less than a few hundred training sequences), we find it advantageous to smooth each row of V independently, maintaining the total counts across the entire row (for real or artificial data). Pseudocounts are also an available option to compensate for small datasets, however they have not explicitly been investigated in this study. Using the same update and objective function as the original NMF publication (Lee and Seung, 1999), we decompose the PWC matrix V according to Equation (2), where Vij = count of the i-th sequence word in the j-th position window, Wim = the weight of the i-th k-mer in the m-th element, Hmj = activity of the m-th element in the j-th position window, M is the number of k-mers considered, N is the number of positioning windows, and r is the number of elements created.
A critical issue in any dimensional reduction analysis is the selection of the reduced number of dimensions (r). Previous attempts to optimize selection of r in NMF analysis have focused primarily on the cophenetic correlation coefficient (CCC) (Brunet et al., 2004; Pascual-Montano et al., 2006a). We find that variation in the RSS between the data matrix (V) and the NMF-estimated matrix (WH) provides a natural means for estimation of r, in as much as this plot shows an inflection when r matches the proper number of dimensions. In practice, while an optimal NMF solution requires several hundred random restarts, the determination of r can be made with a significantly smaller number of solutions, typically 20–30 for tested each value of r (data not shown). We first investigated variation of RSS with variation of r in two artificial datasets, with and without patterns inserted. The RSS of the purely random matrix shows a roughly linear decrease with increased number of elements (Fig. 1
Further confirmation of this phenomenon was obtained through NMF analysis of three sequence sets, artificial sequences with and without inserted patterns and 8500 sequences with length 400 nt centered on putative human 3′-processing sites (Salisbury et al., 2006). PWC matrices were generated from each sequence set, then subsequently analyzed via NMF, largely reproducing the results of the artificial matrices (Fig. 1 The complete NMF analysis requires the selection of a number of free parameters. Of particular note are the selection of window size (w) and k-mers size (k). Larger windows give increased statistical robustness afforded by aggregation but at the cost of less specific positioning. Longer words enable better characterization of long patterns, but at the cost of increased execution time as well as an increased number of sequences needed for training. In the ideal case of a large number of training sequences, k would be at least as long as the smallest expected motif, and w = 1, counting each position individually. However, datasets are finite and can number only a few tens to hundreds. Previous studies have provided estimates of the minimum number of sequences required for reasonable estimation of k-mer probabilities (Fairbrother et al., 2002). In practice, we typically select w and k jointly such that the product wT is at least five times greater than 4k, the number of k-mers under consideration. 2.4 Converting NMF word lists (W.j) to motifs As we reported previously (Graber et al., 2007), we assume that the weighted list of k-mers for each element can be interpreted as the expected distribution of k-mer counts for each element. The optimal motif is inferred as the motif with the maximum multinomial probability of observing the expected distribution. The specifics of the probability calculation were described previously (Graber et al., 2007). Briefly, the probability of observing any specific k-mer within a motif is the sum of the probabilities of observing the motif at all positions within the motif, including those that span the boundaries between the motif and the background. This summation specifically enables characterization of motifs of nearly arbitrary size regardless of the choice of k. To optimize the search of potential motifs, we use a Markov Chain Monte Carlo (MCMC) approach (Gelman et al., 1995). The length of the motif is first sampled from a user-specified range and then nucleotide probabilities at each position are filled randomly. The multinomial distribution for the initial motif is scored against the expected distribution, and then the following procedure is used to update the motif. A specific k-mer (wi) is sampled according to the distribution returned by the NMF analysis. Then a position j within the current motif is sampled according to the match of the k-mer within the motif. Finally, the motif is either made more or less similar to the k-mer at motif positions j to j+k − 1, depending upon whether or not wi is over- or underrepresented in the current model compared with the expected distribution. Following a standard MCMC approach, updated motifs are automatically accepted if the multinomial probability increases, and randomly accepted according to a Boltzmann distribution if the multinomial probability decreases (Gelman et al., 1995). This process iterates until no improvement is observed for a user-specified number of iterations. In the current implementation of the NMF pattern finder, we have made no attempt to optimize performance. A full analysis, including PWC creation, NMF decomposition and motif construction typically lasts tens to hundreds of minutes, depending on values of T, k, w and r. Preliminary tests incorporating the web-based bioNMF resource (Mejia-Roa et al., 2008) showed significant time savings, while also confirming the analysis results. 2.5 Comparison with other pattern-finding approaches NMF pattern identification was compared with several other approaches, including the Gibbs Sampler (Lawrence et al., 1993), the Improbizer (Ao et al., 2004), YMF (Sinha and Tompa, 2003), Weeder (Pavesi et al., 2004) and oligo-analysis tools (van Helden, 2003). As with previous studies of pattern identification tools (Li and Tompa, 2006; Tompa et al., 2005), an artificial dataset (T = 300) was used as the basis for comparison, specifically to remove uncertainty of the exact patterns present in the test sequence set. The Gibbs sampler and the Improbizer were set to find six nucleic acid motifs, searching on single strand only. The Gibbs sampler was allowed to vary motif size between 5 nt and 8 nt. The Improbizer automatically determines optimal motif size. Background frequency estimates (required for YMF, Weeder and oligo-analysis) were generated through analysis of the randomized sequences with no motifs inserted. Hexamers were counted for YMF and oligo-analysis, and the ‘SMALL’ setting was used in Weeder. All other settings were left as defaults. 3 RESULTS AND DISCUSSION 3.1 Characterizing the method and exploring parameters with artificial sequence sets We previously used the NMF method to characterize sequences involved in controlling 3′-processing and trans-splicing in the nematode, Caenorhabditis elegans (Graber et al., 2007). As described in that work, the complete NMF method is dependent upon several free parameters, including but not limited to, the size of the k-mers in the initial count (k), the position window size (w), the number of elements (r) and the number of training sequences available (T). To investigate the effects of these parameters and specifically to test the robustness of the results with variation in the parameters, we generated an artificial dataset in which motifs with defined sequence content and positioning were inserted into sequences generated from a defined dinucleotide background. Six different motifs were created with varying levels of specificity in both positioning and sequence (Table 1). To reduce computational complexity, we varied only one parameter at a time, while holding all others fixed. Figure 2
3.1.1 Variation in training set size (T) The number of training sequences (T) is of critical importance, since it is often out of the researcher's control, yet explicitly determines the statistical power to discern subtle signals. Training sets were generated of varying sizes (T = 30, 100, 300, 1000, 3000 and 10 000) and NMF analysis was performed with parameters fixed at w = 3, k = 4 (tetramers) and r = 8. Note that while six motifs were inserted, the best results are obtained with r = 8. As noted previously (Graber et al., 2007), the NMF analysis identifies both specific motifs and changes in background composition along the length of the sequences. The additional two elements in the NMF decomposition compensate for the change in background at the center point of the test sequences. The use of an artificial dataset allows explicit comparison of NMF-derived patterns with the motif models. The rows of the W matrix, which represent the frequency of occurrence for each k-mer in each motif, were compared with an exact count of the tetramers that were inserted as part of each motif (including those that span the motif-background boundary) using a Pearson's correlation. The best match of each NMF motif to an inserted motif is reported in Table 2(Panel A). Similarly, the columns of the H matrix were compared with the exact distribution of starting positions for each motif by Pearson's correlation, with the best match reported in Table 2(Panel B). The NMF approach successfully identifies the positioning for all motifs in the sets with T ≥ 300 (r ≥ 0.80), with nearly perfect reproduction for T ≥ 3000 (r ≥ 0.88). Even in the smallest datasets, the worst match (motif 4, r = 0.31) is still statistically significant. The sequence content determination varies in a similar fashion. The strongest motif, with consensus AATAAA, is unambiguously matched in all datasets (r ≥ 0.9). As with positioning, the tetramer content of all motifs are well-matched for all sets with T ≥ 300 (r ≥ 0.62). It is interesting to note that motif 2, which has identical sequence information content to motif 3, but a less specific positioning distribution, is less well identified, indicating that both the sequence and positioning specificity can affect our ability to successfully characterize motifs. Full output from these analyses are available in Supplementary Figure 2.
3.1.2 Variation in other parameters and conditions As described in Section 2, the choice of word size (k) and window size (w) are compromises between conflicting needs. Tests on the T = 300 dataset with variations of w and k show that the identified motifs are robust with parameter variation. Separate analyses were also performed on datasets with patterns inserted into either half or a quarter of the sequences in the training set. Motifs 3–5 (Table 1) were clearly identified in both sets, motif 2 was identified the 50% set, while the weakest patterns (1 and 6) were poorly matched at best in both sets. Details of these tests are available in Supplementary Figures 3–5.
3.1.3 Comparison with existing tools To further assess the utility of the NMF approach, we compared it with a number of popular pattern recognition tools, focusing on the T = 300 test set. The disparate nature of the results returned from these tools complicates comparison of the specific results, therefore we present a summary (Table 3) that focuses primarily on whether or not a reasonable match to each pattern was found. For tools that produce a PWM representation, the match is reported as the sum of the Euclidean distances between columns in the known and inferred motifs (seq d in Table 3) (Gupta et al., 2007), aligned as to produce the minimum distance. For tools that produce positioning data or models, the match is reported as the Pearson's correlation between the actual and inferred positioning distributions. Best match consensus patterns are reported for pattern-based tools. The NMF approach identifies significant matches in both positioning and sequence content for all six inserted motifs in our test set. The best alternative method was the oligo analysis tools, which return patterns that reasonably match the consensus patterns of all six inserted motifs, however, only consensus sequences are returned, with no distinct positioning information. No other tool produces matches to more than four of the inserted motifs. Further comparisons are available in Supplementary Materials.
The principal benefits of the NMF approach come from the focus on specific positioning of motifs relative to functional site. Nearly all pattern recognition tools focus on the identification of patterns that occur more frequently than expected according to a background model that is generated either from the input sequences themselves, or from a known background (typically putative promoter regions reflecting the bias of these studies towards transcription factor binding site identification). The test sequences used here were generated from a relatively AT-rich background. Motifs 3 and 4 (Table 1) do not differ significantly from this background. This is reflected in the poor rate at which these motifs were identified by the alternative tools tested here (Table 3). From these results, we conclude that, for the specific problem of identifying signals with positioning constraint, NMF outperforms alternative approaches. 3.2 Analysis of biologically interesting sequence sets 3.2.1 3′-processing (cleavage and polyadenylation) sites Thirty years of study have demonstrated that the sequences that control 3′-processing in eukaryotic organisms show very distinct positioning about the 3′-processing site. One recent study postulated up to 15 functional elements in the human 3′-processing signal (Hu et al., 2005). Based on our analysis of mouse (Fig. 3
Previous comparative studies of 3′-processing sites have shown both common and distinct features (Lee et al., 2007; Salisbury et al., 2006) between organisms. We closely examined a number of organisms, and show the results for a phylogenetically broad sampling here (Fig. 3 We can also use our analysis to highlight where the 3′-processing control sequences diverge. As we previously reported (Salisbury et al., 2006), the downstream UG-rich element is specific to metazoans other than nematode. The organisms that have incorporated a UG-rich element also appear to have a corresponding shift in the positioning of the downstream U-rich element. The downstream G-rich element (mouse element 8 in Fig. 3 Finally, it is informative to contrast our results with the previous large-scale analysis of human 3′-processing sites (Hu et al., 2005). As stated above, the authors of that study identified 15 distinct motifs determining 3′-processing site placement, whereas we find five or six. The previous study used a less specific positioning classification than ours, initially identifying putative functional sequence words on the basis of their overrepresentation in one of four positioning regions, spanning positions −100 to −40, −40 to 0, 0 to +40 and +40 to +100 relative to the 3′-processing site. These words were then subsequently clustered on the basis of sequence similarity, and positional weight matrix representations were generated from the clusters. A number of their final elements, however, show similar positioning distributions, and sequence patterns that are arguably close enough to represent variants of a single, more general, pattern [e.g. Fig. 4 in Hu et al. (2005)]. In contrast, our approach first groups sequence words based on the similarity of positioning at a significantly finer resolution (typically windows of 3–5 nt). In addition, we benefit from the ‘fuzzy clustering’ nature of NMF, specifically in that each sequence word can contribute to the modeling of multiple elements in a weighted manner. Based on these differences, we believe that our analysis generates a more general, inclusive picture of the constraints of the elements of the 3′-processing site control sequence. We nonetheless leave open the possibility that the subdivision of these elements, such as generated previously (Hu et al., 2005), may represent valid subclassifications of elements. 3.2.2 Fruitfly transcription start sites The elements that comprise the core promoter in the fruit fly (D.melanogaster) have been the subject of several recent studies (Gershenzon et al., 2006; Ohler et al., 2002). Many of the identified elements displayed positioning specificity, indicating an appropriate problem for NMF. We analyzed 2561 core promoter sequences (Gershenzon et al., 2006) extending from 250-nt upstream to 100-nt downstream of the transcription start site. Our analysis (Supplementary Fig. 8) identifies a number of patterns that are consistent with the previous studies, while also highlighting possible antagonistic relationships. 4 CONCLUSIONS AND FUTURE DIRECTIONS We have described here a novel approach to the characterization of putative regulatory motifs, using NMF to simultaneously determine both positioning and sequence content. In contrast with other dimension reduction algorithms, the NMF decomposition produces component matrices with direct intuitive interpretations, reflecting the positioning and sequence content of the resulting elements. We also demonstrated that variation of the RSS between the actual and reduced data provides a means of estimating the proper number of elements. Finally, in contrast with other pattern detection algorithms, our analysis is explicitly geared to the detection of motifs with constrained positioning, and consequently outperforms these algorithms for this specific problem. The motifs generated by the NMF approach are well suited for inclusion in probabilistic predictive models, such as those for transcription start site, splice site, 3′-processing site, or full gene prediction. To this end, we are working to cast the NMF output into a form that can directly be incorporated with the open-source Genie software (Kulp et al., 1997; Reese et al., 1997). We discuss a number of potential improvements to the NMF in Supplementary Materials. The NMF approach provides a robust and novel approach to characterization of putative regulatory elements with positioning specificity. Given the importance of this additional constraint, our approach will provide benefits to analyses beyond the few examples presented here. Funding NSF 2010 Project (grant DBI-0331497); NIH/NCRR INBRE Maine (grant 2 P20 RR16463); NIH/NIGMS (grant 1R01GM072706). Conflict of Interest: none declared. [Supplementary Data]
ACKNOWLEDGEMENTS The authors thank Yong Woo, Gary Churchill, Elissa Chesler and members of the Graber Group for helpful comments and critiques. The authors thank Michael Brockman, Carol Bult, Hyuna Yang and Joel Richardson for critical review of the paper. REFERENCES
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||
Science. 2004 Sep 17; 305(5691):1743-6.
[Science. 2004]Algorithms Mol Biol. 2006 May 19; 1():8.
[Algorithms Mol Biol. 2006]Nat Biotechnol. 2005 Jan; 23(1):137-44.
[Nat Biotechnol. 2005]Nucleic Acids Res. 2007; 35(10):3203-13.
[Nucleic Acids Res. 2007]Nucleic Acids Res. 2004 Jul 1; 32(Web Server issue):W187-90.
[Nucleic Acids Res. 2004]RNA. 2005 Oct; 11(10):1485-93.
[RNA. 2005]Plant Physiol. 2005 Jul; 138(3):1457-68.
[Plant Physiol. 2005]BMC Genomics. 2006 Mar 16; 7():55.
[BMC Genomics. 2006]Microbiol Mol Biol Rev. 1999 Jun; 63(2):405-45.
[Microbiol Mol Biol Rev. 1999]Nature. 1999 Oct 21; 401(6755):788-91.
[Nature. 1999]BMC Bioinformatics. 2006 Feb 17; 7():78.
[BMC Bioinformatics. 2006]Genome Res. 2003 Jul; 13(7):1706-18.
[Genome Res. 2003]IEEE Trans Pattern Anal Mach Intell. 2006 Mar; 28(3):403-15.
[IEEE Trans Pattern Anal Mach Intell. 2006]Bioinformatics. 2005 Sep 15; 21(18):3691-3.
[Bioinformatics. 2005]BMC Genomics. 2006 Jun 21; 7():161.
[BMC Genomics. 2006]Nature. 1999 Oct 21; 401(6755):788-91.
[Nature. 1999]RNA. 2007 Sep; 13(9):1409-26.
[RNA. 2007]Proc Natl Acad Sci U S A. 2004 Mar 23; 101(12):4164-9.
[Proc Natl Acad Sci U S A. 2004]BMC Bioinformatics. 2006 Jul 28; 7():366.
[BMC Bioinformatics. 2006]BMC Genomics. 2006 Mar 16; 7():55.
[BMC Genomics. 2006]Science. 2002 Aug 9; 297(5583):1007-13.
[Science. 2002]RNA. 2007 Sep; 13(9):1409-26.
[RNA. 2007]Nucleic Acids Res. 2008 Jul 1; 36(Web Server issue):W523-8.
[Nucleic Acids Res. 2008]Science. 1993 Oct 8; 262(5131):208-14.
[Science. 1993]Science. 2004 Sep 17; 305(5691):1743-6.
[Science. 2004]Nucleic Acids Res. 2003 Jul 1; 31(13):3586-8.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2004 Jul 1; 32(Web Server issue):W199-203.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2003 Jul 1; 31(13):3593-6.
[Nucleic Acids Res. 2003]RNA. 2007 Sep; 13(9):1409-26.
[RNA. 2007]Nucleic Acids Res. 1990 Oct 25; 18(20):6097-100.
[Nucleic Acids Res. 1990]Genome Res. 2004 Jun; 14(6):1188-90.
[Genome Res. 2004]Nucleic Acids Res. 1990 Oct 25; 18(20):6097-100.
[Nucleic Acids Res. 1990]Genome Res. 2004 Jun; 14(6):1188-90.
[Genome Res. 2004]RNA. 2007 Sep; 13(9):1409-26.
[RNA. 2007]Genome Biol. 2007; 8(2):R24.
[Genome Biol. 2007]RNA. 2005 Oct; 11(10):1485-93.
[RNA. 2005]Nucleic Acids Res. 2007 Jan; 35(Database issue):D165-8.
[Nucleic Acids Res. 2007]BMC Genomics. 2006 Mar 16; 7():55.
[BMC Genomics. 2006]Genes Dev. 2005 Jun 1; 19(11):1315-27.
[Genes Dev. 2005]Nature. 1976 Sep 16; 263(5574):211-4.
[Nature. 1976]Cell. 1987 May 8; 49(3):399-406.
[Cell. 1987]BMC Genomics. 2006 Mar 16; 7():55.
[BMC Genomics. 2006]RNA. 2005 Oct; 11(10):1485-93.
[RNA. 2005]BMC Genomics. 2006 Jun 21; 7():161.
[BMC Genomics. 2006]Pac Symp Biocomput. 1997; ():232-44.
[Pac Symp Biocomput. 1997]J Comput Biol. 1997 Fall; 4(3):311-23.
[J Comput Biol. 1997]