• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of bmcbioiBioMed Centralsearchsubmit a manuscriptregisterthis articleBMC Bioinformatics
BMC Bioinformatics. 2005; 6: 73.
Published online Mar 24, 2005. doi:  10.1186/1471-2105-6-73
PMCID: PMC1090553

Accelerated probabilistic inference of RNA structure evolution

Abstract

Background

Pairwise stochastic context-free grammars (Pair SCFGs) are powerful tools for evolutionary analysis of RNA, including simultaneous RNA sequence alignment and secondary structure prediction, but the associated algorithms are intensive in both CPU and memory usage. The same problem is faced by other RNA alignment-and-folding algorithms based on Sankoff's 1985 algorithm. It is therefore desirable to constrain such algorithms, by pre-processing the sequences and using this first pass to limit the range of structures and/or alignments that can be considered.

Results

We demonstrate how flexible classes of constraint can be imposed, greatly reducing the computational costs while maintaining a high quality of structural homology prediction. Any score-attributed context-free grammar (e.g. energy-based scoring schemes, or conditionally normalized Pair SCFGs) is amenable to this treatment. It is now possible to combine independent structural and alignment constraints of unprecedented general flexibility in Pair SCFG alignment algorithms. We outline several applications to the bioinformatics of RNA sequence and structure, including Waterman-Eggert N-best alignments and progressive multiple alignment. We evaluate the performance of the algorithm on test examples from the RFAM database.

Conclusion

A program, Stemloc, that implements these algorithms for efficient RNA sequence alignment and structure prediction is available under the GNU General Public License.

Background

As our acquaintance with RNA's diverse functional repertoire develops [1-5], so does demand for faster and more accurate tools for RNA sequence analysis. In particular, comparative genomics approaches hold great promise for RNA, due to the well-behaved basepairing correlations in an RNA gene family with conserved secondary structure (at least, well-behaved compared to protein structures). Whereas the structural signal encoded in a single RNA gene is rather weak and may be barely (if at all) distinguishable from the secondary structure of a random sequence [6], the covariation signal increases with every additional sequence considered.

Many programs for comparative analysis of RNA require the sequences to be prealigned [7-9]. This can be a source of error, since misaligned bases can add noise that swamps the covariation signal. The most recent of these methods allows for some uncertainty in the alignment [7]. More generally, one can view the alignment and structure prediction as a combined problem, to be solved simultaneously. This is the approach taken in this paper, and by earlier programs such as FOLDALIGN [10], DYNALIGN [11], CARNAC [12], QRNA [9] and our dart library, introduced in a previous paper [13] and extended here. In this framework, fixing of the alignment can be viewed as a partial constraint on the simultaneous alignment/folding problem.

A powerful, general dynamic programming algorithm for simultaneously aligning and predicting the structure of multiple RNA sequences was developed by David Sankoff [14]. The energy-based folding of Zuker et al [15] and recent approaches based on Stochastic Context-Free Grammars (SCFGs) [9,13,16-20] are both closely related to Sankoff's algorithm. The method takes time O(L3N) and memory O(L2N) for N sequences of length L. This is prohibitively expensive at the time of writing, except for fairly short sequences, which has motivated the development of various constrained versions of these algorithms [9-11,13,21].

The purpose of this paper is to report our progress on general pairwise constrained versions of Sankoff's algorithm (or, more precisely, constrained versions of some related dynamic programming algorithms for SCFGs). The overall aim is the simultaneous alignment and structure prediction of two RNA sequences, X and Y, subject to an SCFG-based scoring scheme and user-supplied constraints. Additionally, we wish to be able to parameterize the model automatically from training data. Without constraints, the above tasks are addressed by the resource-intensive CYK and Inside-Outside algorithms; here, we present constrained versions of these algorithms that work in reduced space and time (the exact complexity depends nontrivially on the constraints).

Our system of constraints is quite general. Previous constrained versions of Sankoff-like algorithms, such as the programs DYNALIGN [11] and FOLDALIGN [10], have been restricted to "banding" the algorithm e.g. by constraining the maximum insertion/deletion distance between the two sequences or the maximum separation between paired bases. Alternately, constraints on the accessible structures [13] or alignments [9] have been described.

The algorithms described here can reproduce nearly all such banding constraints and, further, can take advantage of more flexible sequence-tailored constraints. Specifically, the fold envelopes determine the subsequences of X and Y that can be considered by the algorithm, while the alignment envelope determines the permissible cutpoints in the pairwise alignment of X and Y. The fold envelopes can be used to prune the search over secondary structures (e.g. by including/excluding specific hydrogen-bonded base-pairings), while the alignment envelopes can be used to prune the search over alignments (e.g. by including/excluding specific residue-level homologies). The fold envelopes can be precalculated for each sequence individually (e.g. by an energy-based folding or a single-sequence SCFG), and the alignment envelope by comparing the two sequences without regard for secondary structure (e.g. using a pairwise Hidden Markov Model); both types of pre-comparison are much more resource-friendly than the unconstrained Sankoff-like algorithms. The design of the constrained algorithms is discussed using concepts from object-oriented programming: the dynamic programming matrix can be viewed as a sparsely populated container, whereas the main loop that fills the matrix is a complex iterator [22]. The algorithms have been implemented in a freely available program for RNA sequence alignment, stemloc, which also includes algorithms to determine appropriate constraints in an automatic fashion. Results demonstrating the program's efficient resource usage are presented.

The stemloc program also implements various familiar extensions to pairwise alignment, including local alignment [23], Waterman-Eggert N-best suboptimal alignments [24] and progressive multiple alignment [25]. Although the envelope framework, rather than these extensions, is the main focus of this paper, implementation of the extensions is straightforward within this framework, and is briefly described.

Results

To investigate the comparative resource usage of the various different kinds of constraint that can be applied using fold and alignment envelopes, stemloc was tested on 22 pairwise alignments taken from version 6.1 of RFAM [37], spanning 7 different families of functional noncoding RNA. Each chosen test family had a consensus secondary structure published independently in the literature, and no two sequences in the test set had higher than 60% identity.

The EMBL accession numbers and co-ordinates of all sequences are listed in Table Table5.5. The table shows the performance of stemloc using the 1000-best fold envelope and the 100-best alignment envelope. The various RFAM families are S15, the ribosomal S15 leader sequence; the U3 and U5 spliceosomal small nucleolar RNAs; IRE, the iron response element from UTRs of genes involved in vertebrate iron metabolism; glmS, the glucosamine-6-phosphate activated mRNA-cleaving ribozyme; Purine, the prokaryotic purine-binding riboswitch; and 6S, the E.coli polymerase-associated transcriptional repressor. The following three test regimes were used, each representing a different combination of fold and alignment envelopes:

Table 5
The subset of RFAM used to test the constrained SCFG algorithms.

N-best alignments, all folds

The alignment envelope containing the N best primary sequence alignments, with the unconstrained fold envelopes (stemloc options: '--nalign N --nfold -1'). This is the red curve in Figures Figures8,8, ,9,9, ,10,10, ,11,11, ,12,12, ,1313

Figure 8
Alignment sensitivity as a function of envelope size parameter N for three different test regimes.
Figure 9
Alignment specificity as a function of envelope size parameter N for three different test regimes.
Figure 10
Fold sensitivity as a function of envelope size parameter N for three different test regimes.
Figure 11
Fold specificity as a function of envelope size parameter N for three different test regimes.
Figure 12
Total running time of stemloc (including envelope generation phases) as a function of envelope size parameter N for three different test regimes.
Figure 13
Peak memory usage of stemloc (i.e. the size of the principal CYK matrix) as a function of envelope size parameter N for three different test regimes.

N-best folds, all alignments

The unconstrained alignment envelope, with the fold envelopes containing the N best single-sequence structure predictions (stemloc options: '--nalign -1 --nfold N'). This is the green curve in Figures Figures8,8, ,9,9, ,10,10, ,11,11, ,12,12, ,1313

N-best folds, 100-best alignments

The alignment envelope containing the 100 best primary sequence alignments, with the fold envelopes containing the N best single-sequence structure predictions (stemloc options: '--nalign 100 --nfold N'). This is the blue curve in Figures Figures8,8, ,9,9, ,10,10, ,11,11, ,12,12, ,1313

In the first two tests, N was varied from 1 to 100; in the latter test, N was varied from 1 to 10000. The lower ceiling for N in the first two tests was imposed by resource limitations. Note that the endpoint of the red curve ("N-best alignments, all folds"), which occurs at N = 100, coincides with the asymptotic limit of the blue curve ("N-best folds, 100-best alignments") at high N.

A range of different values for the parameter N was used to test the above three strategies. As N was increased over the range, the size of the corresponding fold or alignment envelopes was found to be strongly correlated (Figures (Figures6,6, ,7).7). However, the actual size of the fold/alignment envelopes in each particular test case varies widely (see large error bars in Figures Figures6,6, ,7),7), perhaps due to variable factors such as the sequence lengths, compositions and/or identities. Since it is easier to control the envelope construction parameter N than to control the envelope sizes directly, the following section will report performance indicators as a direct function of N, rather than as a function of the strongly-correlated but widely-varying envelope sizes. We report performance indicators for stemloc as follows. Let A and B be alignments of a given pair of sequences, each represented as a set of aligned residue-pairs {(i, k)}. Suppose that A is the alignment according to RFAM, and B is the alignment predicted by stemloc. Then define the alignment sensitivity to be |A B|/|A| and the alignment specificity to be |A B|/|B|. Further, let S and T be possible secondary structures for a given sequence, each represented as a set of base-pairs {(i, j)}. Suppose that S is the published structure, and T is the structure predicted by stemloc. Then the basepair sensitivity is |S T|/|S| and the basepair specificity is |S T|/|T|.

Figure 6
Fold envelope size is highly correlated with N in the N-best fold test, although the variance is large due to the diversity of alignments in the test.
Figure 7
Alignment envelope size is highly correlated with N in the N-best alignment test, although the variance is large due to the diversity of alignments in the test.

These performance indicators are averaged over all 22 pairwise alignments and plotted for the three test regimes in Figure Figure88 (alignment sensitivity), Figure Figure99 (alignment specificity), Figure Figure1010 (basepair sensitivity) and Figure Figure1111 (basepair specificity). As can be seen, the N -best alignment regime empirically seems to achieve an asymptotic maximum performance around N [similar, equals] 100 (possibly even around N [similar, equals] 10), while the N -best fold envelope underperforms compared to the unconstrained fold envelope up to around N [similar, equals] 1000. The tests were performed on a 2.3 GHz Apple PowerPC G5. The resource usages of the test regimes are plotted in Figure Figure1212 (user-mode running time) and Figure Figure1313 (memory usage). The resource usage of the constrained algorithms is substantially reduced when the envelopes are smaller (i.e. at lower N ). This is especially notable when contrasting the resource usage of the "N -best folds, all alignments" test with the more constrained "N -best folds, 100-best alignments" test.

Three main conclusions can be drawn from these data. First, allowing the search to consider more than a single alignment greatly improves structure prediction (the red curve). Second, constraining the alignment search while exhaustively scanning fold space (the red curve) outperforms constraining the fold search while exhaustively scanning alignment space (the green curve). Third, the hybrid strategy (the blue curve), which partially constrains both searches, approaches the alignment-constrained, fold-unconstrained strategy (the red curve) in performance, with a significant saving in CPU and memory resources. Memory is the limiting factor in pairwise RNA alignment, and the primary motivation for constraints. For example, without constraints, alignment of two 16S ribosomal subunits using the stemloc grammar would take approximately 500 terabytes. (Using fold envelope constraints with structures fully specified, it can be done in under 5 gigabytes.)

Based on the results of these tests, the default envelope options for stemloc were chosen to be the 100-best alignment envelope and the 1000-best fold envelope. The performance of stemloc with these envelopes on each of the pairwise test alignments is given in Table Table55.

Discussion

The algorithms presented here include constrained versions of Pair-SCFG dynamic programming algorithms that run in significantly reduced space and time. The primary advance over previous work is the simultaenous imposition of fold and alignment constraints, including alignment constraints that are more general than others previously described. Thes constraints lead to significant reductions in requirements for processor and memory usage, which will increase the length of RNA sequences that can be analyzed on mainstream computer hardware.

These algorithms have been used to implement stemloc, a fast, efficient software tool for multiple RNA sequence alignment implementing numerous extra features such as local alignment, Waterman-Eggert N -best suboptimal alignment and progressive multiple alignment. The source code for the program is freely available from http://www.biowiki.org.

The results given here should be regarded as preliminary. For example, we have only tested the pairwise alignment functionality; full evaluation/optimisation of the multiple alignment algorithm remains. Rather than using the CYK algorithm, one could use the Inside-Outside algorithm with a decision-theoretic dynamic programming step to maximize expected performance [38,39]. As noted in the Parameterization section, it might also be possible to improve on the training procedure. We are also considering ways of elaborating the grammar to include basepair stacking terms. These and other improvements we hope to address in future work.

Conclusion

RNA sequence analysis has generated considerable interest over recent years, as many new roles for RNA in the cell have come to light. RNA genes and regulatory elements are components of many molecular systems and comparative genomics is a powerful way to probe this function, perhaps even more so for RNA than for protein (due to the "well-behaved" statistical correlations found in RNAs with conserved secondary structure). Furthermore, statistical modeling of RNA evolution continues to play a fundamental role in the phylogenetic classification of new forms of life.

These biological motives have driven a demand for RNA sequence analysis tools that are faster, slimmer and more scaleable. It is hoped that the algorithms and approaches described here, together with development and analysis of RNA evolutionary models [36], may expand the applications of RNA informatics.

Methods

We begin our description of the envelope method with an explanatory note regarding our decision to present these constraints in terms of SCFGs, rather than other scoring schemes such as those based solely on energies [15] or on energy/information-theoretic hybrids [11].

The reason for our choice of SCFGs is simple: stochastic grammars are, in our opinion, the most theoretically well-developed of the scoring schemes used for RNA. They come with well-documented algorithms for sequence alignment, structure prediction, parameterization by supervised learning from various kinds of training data, and calculation of posterior probabilities [20]. Discussion of these algorithms is facilitated by a well-developed and widely-understood probabilistic vocabulary. Stochastic grammars are actively researched outside bioinformatics, principally in natural language processing [26]. Crucially, SCFGs are sufficiently general to express virtually all of the features offered by other scoring schemes [19]. We also acknowledge the appeal of free energy-based scoring schemes, which have the advantage that the parameters can be determined experimentally. Energy-based scores can also be used to find posterior probabilities of base-pairings using a partition function [27]. However, to extend energy-based methods to two or more sequences, one must incorporate substitution scores. These are information-theoretic in nature and so are measured in bits, rather than kilocalories-per-mole [28]. Reconciling these two units of score (in a principled way) is an open problem. However, in an SCFG framework, all scores are information-theoretic and so there is no conflict of units.

Despite these arguments, many people continue to find calories preferable to bits as a unit of score. For such readers, we note that the system of constraints described here is entirely applicable to the general score-attributed grammar. This includes energy-based and heuristic scoring schemes as well as (for example) grammars whose rule "probabilities" actually represent log-odds ratios, or which are conditionally normalized with respect to one sequence.

Notation

To implement SCFG dynamic programming algorithms efficiently for RNA, it is convenient to define a simplified (but universal) template for grammars, similar in principle to "Chomsky normal form" [29,30]. Our "RNA normal form" preserves the RNA-optimized efficiency of the Pair SCFG form presented in an earlier paper [13] (based on a single-sequence form due to Durbin et al [20]) by introducing different types of production rule to minimize bifurcations and collect emissions. The form defined here is slightly different from the above forms, in that it classifies only production rules, and not nonterminals, into different types. Let An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i1.gif be the "ungapped RNA alphabet", i.e. the set of four possible nucleotides in RNA. Let An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i2.gif be the "gapped RNA alphabet", i.e. the ungapped RNA alphabet Ω plus the gap symbol An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i3.gif. Finally, let Ψ = Ω' × Ω' be the "gapped-pair RNA alphabet", i.e. a Cartesian product of two gapped RNA alphabets. We write Ψ-symbols by vertically stacking pairs of Ω'-symbols, like this An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i4.gif or this An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i5.gif.

A pairwise stochastic context-free grammar An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i6.gif in RNA normal form consists of a nonterminal symbol alphabet Φ, a terminal symbol alphabet that is the gapped-pair RNA alphabet Ψ, and (for each nonterminal L) a probability distribution over a set of transformation rules (or production rules), An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i7.gif(L R), where R = R1 ... RK is a sequence of nonterminal or terminal symbols, taking one of several stereotypical forms (see below). The nonterminal L is referred to as the left-hand side (LHS) of the production rule and the symbol sequence R as the right hand side (RHS).

Let S, U, V, W [set membership] Φ denote nonterminal symbols. The allowable forms for production rules include terminations, transitions, bifurcations and emissions. These are defined as follows

Terminations: rules of the form L ε.

Transitions: rules of the form L V. The directed graph formed by transition rules on nonterminals must be acyclic, and the list of nonterminals Φ must be topologically reverse-sorted with respect to this graph; i.e. if An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i7.gif(U V) > 0, then V appears before U in Φ.

Bifurcations: rules of the form L VW. There must be no transition-termination path from V or W to ε, i.e. neither V or W can have completely empty inside sequence-pairs (see next section for a formal definition of the "inside sequence-pair").

Emissions: rules of the form An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i8.gif where A', B', C', D' [set membership] Ω' are gapped RNA symbols, at least one of which is a non-gap symbol. For convenience, we also define A, B, C, D [set membership] Ω* to be the corresponding ungapped RNA sequences, as follows: A is the empty string if and only if A' is the gap character; otherwise, A = A'. Similar definitions apply for B, C and D.

The particular RNA normal form described in this section is chosen for ease of presentation. The implementation in the dart library uses the slightly more restrictive form for Pair SCFGs defined in an earlier paper [13].

For presentational purposes, we will generally omit all-gap columns from the pairwise alignment and the grammar. For example, an emission rule having the form An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i9.gif would be written as An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i10.gif instead. All-gap columns are not very interesting to a sequence analyst, and only arise in our formalism because all emission rules have the same form.

Table Table11 is an example of an RNA normal form grammar with two nonterminals, Stem and Loop. The grammar generates simple alignments of stems and loops, using two nonterminals (Stem and Loop); the starting nonterminal is Stem. The rule probabilities are functions of five scalar probability parameters (stemExtend, stemGap, bifurcate, loopExtend and loopGap) and four arrays of probability parameters (baseIndel[4], baseSubstitution[16], basepairIndel[16] and basepairSubstitution [256]). Here we introduce the notation X [N] for an array of N probability parameters normalized so that An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i11.gif. It is also convenient to introduce some notation for ungapped sequences at this stage. Let X, Y [set membership] Ω* denote ungapped RNA sequences, including (possibly) the empty string ε. Let Xi be the i'th symbol of X (counting from 1, so e.g. X3 is the third symbol), let Xij denote the subsequence from Xi+1 to Xj inclusive, or the empty string if i = j (so e.g. X0,3 contains the first three symbols of X, while X3,3 = ε) and let |X| denote the length of X.

Table 1
A stochastic context-free grammar for generating pairwise alignments of RNA structures.

The parse tree and the sequence likelihood

The grammar An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i6.gif is a probabilistic model for deriving sequences X, Y from a single nonterminal. This derivation proceeds as follows: start with an initial sequence containing one starting nonterminal, S, then repeatedly apply probabilistically-sampled transformations to the nonterminals in the sequence. Eventually the sequence will contain only terminals from Ψ. This process generates a parse tree, rooted at node S, in which internal nodes are labeled with nonterminals and leaf nodes with terminals, with children of each node ordered left-to-right (Figure (Figure1).1). Sequence X can be obtained by reading off ungapped RNA symbols from the top row of the output, and sequence Y by reading off the bottom row. Note that the subtree rooted at any internal W-labeled node describes a sub-process that generates some pair of subsequences (Xij, Ykl) starting from nonterminal W. We will refer to this subsequence-pair (Xij, Ykl) as the inside sequence-pair of W.

Figure 1
A parse tree for the grammar of Table 1. Each internal node is labeled with a nonterminal (Stem or Loop); additionally, the subsequences (Xij, Ykl) generated by each internal node are shown. The parse tree determines both the structure and alignment of ...

The parse tree likelihood is the product of all the rule probabilities corresponding to the internal nodes. Summing the likelihoods of all parse trees rooted in state S and generating sequences X, Y, one obtains P(X, Y | S,An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i6.gif), the sequence likelihood. This sum can be performed efficiently by the Inside algorithm, as will be described below.

Dynamic programming algorithms for Pair SCFGs

The following section describes the constrained and unconstrained dynamic programming (DP) algorithms used for Pair SCFGs.

The Inside algorithm

The Inside algorithm [26] computes P(X, Y | S,An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i6.gif) by recursive decomposition via intermediate probabilities An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i18.gif(i, j, k, l) [equivalent] P(Xij, Ykl|U,An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i6.gif). In RNA normal form, the time-limiting step in the Inside algorithm involves summing contributions to An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i18.gif(i, j, k, l) from bifurcation rules of the form U VW, such that the bifurcation splits the two sequences (X, Y) between bases (Xm, Yn) and (Xm+1, Yn+1)

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i19.gif

An asymptotically faster step involves summing contributions from matching emission rules of the form An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i20.gif (transition rules are represented as the special case A' = B' = C' = D' = An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i3.gif)

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i21.gif

(Recall that A is the ungapped version of A'. Thus |A| = 0 [left and right double arrow ] A = ε [left and right double arrow ] A' = An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i3.gif and similarly for B, C and D.)

The intermediate probabilities of the Inside algorithm can then be expressed as

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i18.gif(i, j, k, l) = Q1 + Q2

Termination of the recursion is provided by matching end rules, U ε, if and only if the subsequences (Xij, Ykl) are empty

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i18.gif(i, i, k, k) = An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i7.gif(U ε)

The sequence likelihood is obtained as

P(X, Y | S,An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i6.gif) [equivalent] An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i22.gif(0,|X|,0,|Y|)

In pseudocode, the Inside algorithm is

• Inputs: X, Y, S, An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i6.gif

• For i = |X| to 0 (descending)

    • For j = i to |X|

       • For k = |Y| to 0 (descending)

          • For l = k to |Y|

       • For each nonterminal U [set membership] ε

          • Set Q1 ← 0; calculate Q2

          • For m = i to j

             • For n = k to l

                • Calculate QB(m, n) and add to Q1

          • Calculate An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i18.gif(i, j, k, l) and store

• Return An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i22.gif(0, |X|, 0, |Y|)

The time-limiting step of the Inside algorithm (computing the QB) involves six indices (i, j, k, l, m, n) and the time complexity of the full recursion is O(|X|3|Y|3). However, the stored intermediate probabilities involve only four indices (i, j, k, l) and so the memory complexity is O(|X|2|Y|2).

In RNA normal form, the emission rules (Q2) account for homologous base-pairings between residues (Xi, Xj) and (Yk, Yl), or unpaired residues at Xi, Xj, Yk or Yl. This may also imply that Xi and Yk are aligned, or that Xj and Yl are (Figure (Figure2).2). The bifurcation rules (QB(m, n)) account for conserved multiloop structures in the RNA, i.e. one homology between substructures Xim and Ykn and another between substructures Xmj and Ynl (Figure (Figure33).

Figure 2
Parsing a pair of sequences (X, Y) using the Inside algorithm involves iterating over subsequence-pairs (Xij, Ykl) specified by four indices (i, j, k, l). In the constrained Inside algorithm, these indices are only valid if the fold envelopes (triangular ...
Figure 3
Bifurcation rules allow a subsequence-pair (Xij, Ykl) to be composed from two adjoining subsequence-pairs (Xim, Ykn) and (Xnj, Yni). For this to be permitted by the constraints, the X-fold envelope (upper triangular grid) must contain subsequences (i ...

Imposing constraints

The high time and memory cost of the Inside and related algorithms motivate the development of slimmer, faster versions. To begin with, we impose constraints that narrow the search space. For example, we might want to pre-parse the sequences individually (using a single-sequence SCFG, or other O(|X|3) RNA-folding method) and identify likely base-pairs (Xi, Xj) or (Yk, Yl), and/or likely unpaired nucleotides Xi or Yk. Even simpler, we could simply throw out basepairs (Xi, Xj) between distant residues (i.e. where j - i exceeds some cutoff) Alternatively, we might want to pre-align the sequences (using a pairwise hidden Markov model, or other O(|X||Y|) alignment method) and identify likely alignment columns (Xi, Yk) and/or likely indels Xi or Yk. Again, more simply, we could simply exclude columns (Xi, Yk) for which |k - i| is too large.

We can combine these various strategies into a generalized constraint on base-pairs, alignment-columns or both. We stipulate that An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i18.gif(i, j, k, l) = P(Xij, Ykl|U, An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i6.gif) = 0 unless the following conditions are satisfied

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i23.gif

Here An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i24.gif and An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i25.gif are sets of permissible co-ordinates for structurally discrete subsequences in X and Y. Fold-related features (basepairs and unpaired residues) can be included or excluded by this set, and so we refer to it as a fold envelope [13]. The set An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i26.gif is a set of possible cut-points in the alignment of X and Y, and is referred to as an alignment envelope [31]. Both types of envelope are illustrated in Figure Figure2.2. The fold and alignment envelopes satisfy the following set relations

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i27.gif

If equality holds in all three cases, then we recover the unconstrained Inside algorithm.

Note that the co-ordinates (i, j, k, l) for the cutpoints and subsequences lie between residues of X and Y.

There are (|X| + 1)(|Y| + 1) cutpoints in the maximal alignment envelope An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i26.gif and An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i28.gif(|X| + 1)(|X| + 2) subsequences in the maximal fold envelope An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i24.gif.

As an alternative to the unconstrained Inside algorithm, we can partially initialize the envelopes to limit the maximum subsequence length and/or the maximum deviation of the alignment from the main diagonal (Figure (Figure4).4). More flexibly, we can limit the recursion to a single alignment, a single structure, or a broadly-specified set of alignments or structures (Figure (Figure5).5). Applications such as alignment of two known structures [13,32], alignment of an unstructured sequence to a known structure [33] or structure prediction from a known alignment [9] all reduce to simple application of the appropriate constraints.

Figure 4
These fold envelopes (triangular grids) limit the maximum length of subsequences (black dots), while the alignment envelope (rectangular grid) limits the maximum deviation of cutpoints (short diagonal lines) from the main diagonal.
Figure 5
These fold envelopes (triangular grids) and alignment envelope (rectangular grid) limit the subsequences (black dots) and cutpoints (short diagonal lines) to those consistent with a given alignment and consensus secondary structure (shown). The alignment ...

Further possible constraints

The constraints given here allow the independent imposition of alignment or fold constraints. One can imagine further, even more general constraints. For example, one could exclude subsequence-pairs (Xij, Ykl) of radically different lengths, i.e. for which |(j - i) - (k - l)| exceeds some cutoff. This constraint is employed by the FOLDALIGN program. It is not expressible as a combination of independent alignment and fold constraints, and has not been implemented for the present work, though it would be relatively straightforward to combine it with the other constraints described here [10].

Accelerating the iteration

Simply setting some intermediate probabilities to zero is not sufficient to accelerate the Inside algorithm. We also need to redesign the iteration to avoid visiting zero-probability subsequence-pairs (Xij, Ykl). This is achieved by pre-indexing the fold envelopes An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i24.gif, An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i25.gif and the alignment envelope An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i26.gif so that we can quickly locate valid co-ordinates (i, j, k, l). The following is pseudocode for the algorithm with the redesigned iterator

• Inputs: X, Y, S, An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i6.gif

• For i = |X| to 0 (descending)

    • For each j satisfying (i, j) [set membership] An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i24.gif (ascending) (†)

       • For each k satisfying (i, k) [set membership] An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i26.gif (descending) (†)

          • For each l satisfying (k, l) [set membership] An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i25.gif (ascending) (†)

             • If (j, l) [set membership] An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i26.gif then

                • For each nonterminal U [set membership] Φ

                   • Set Q1 ← 0; calculate Q2

                   • For each m satisfying {(i, m), (m, j)} [subset or is implied by] An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i24.gif (†)

                      • For each n satisfying {(k, n), (n, l)} [subset or is implied by] An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i25.gif (†)

                         • If (m, n) [set membership] An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i26.gif then

                            • Calculate QB(m, n) and add to Q1

                   • Calculate An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i18.gif(i, j, k, l) and store

• Return An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i22.gif(0, |X|, 0, |Y|)

(†) These ordered subsets of An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i24.gif, An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i25.gif and An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i26.gif can be precomputed for speed.

Alternative designs for the algorithm are possible, and indeed different circumstances may affect the choice of optimal design (e.g. depending on which envelopes are most constrained).

Slimming the container

Memory is the most prohibitively expensive resource demand of the Inside algorithm. In its simplest form, the algorithm stores the intermediate probabilities An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i18.gif(i, j, k, l) using a five-dimensional array indexed by U, i, j, k and l. To get the most benefit out of imposing constraints, it is necessary to replace this multidimensional array with an efficiently-indexed reduced-space container.

This design decision involves a close trade-off between CPU and memory usage. Initially, we tested various combinations of generic containers with O(N)-storage and O(N log N) access-times, such as balanced search trees [22]. These were found to be unbearably slow, and so we settled on a configuration of nested preindexed arrays. This configuration wastes some space, but has the important advantage of constant access time for given indices (U, i, j, k, l).

For fold envelope An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i24.gif, we precompute and sort the list An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i29.gif = {(i, j)} [subset, dbl equals] An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i24.gif of subsequences starting at each position i. For each subsequence (i, j) [set membership] An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i24.gif, let An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i30.gif be its rank in An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i29.gif. These An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i30.gif are also precomputed and stored. Similar precomputed sets An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i31.gif and ranks An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i32.gif are stored for An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i25.gif.

Our DP matrix then uses an inner two-dimensional array nested inside an outer two-dimensional array.

The outer array has dimensions (|X|, |Y|) and is indexed by (i, k). The inner array has dimensions (|An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i29.gif|, |An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i31.gif|) and is indexed by (An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i30.gif, An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i32.gif). Cells of this inner array are further sub-indexed by nonterminal U using a standard fixed-length array, yielding An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i18.gif(i, j, k, l).

This particular configuration is efficient when the alignment envelope is densely populated and the fold envelopes are sparsely populated. As with the redesigned iterator, there may be alternative designs that are resource-optimal under various different circumstances, depending on the nature of the envelopes.

The CYK algorithm

The Cocke-Younger-Kasami (CYK) dynamic programming algorithm [20] is related to the Inside algorithm, but replaces "p + q" with "max(p, q)" in all formulae (that is, instead of summing probabilities, the maximum probability is always taken). The entries of the DP matrix, An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i33.gif(i, j, k, l), represent the maximum likelihood of any parse tree for (Xij, Ykl). A recursive/stack-based traceback algorithm can be used to recover this parse tree, beginning from subsequence (i, j, k, l) = (0, |X|, 0, |Y|) (for global alignment) or (i, j, k, l) = argmax An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i22.gif(i, j, k, l) (for local alignment) [13].

The Outside and KYC algorithms

The Outside and KYC algorithms widen the applications of probabilistic inference with SCFGs. The Outside algorithm, together with the Inside, can be used to recover posterior probabilities of given basepairs/columns, which can be used as alignment reliability indicators or as update counts in Expectation Maximization parameter training [20,26]. The KYC algorithm can be used with CYK to recover maximum-likelihood tracebacks from given co-ordinates. (We introduce the name "KYC" as a simple reversal of "CYK", reflecting the fact that KYC is to CYK as Outside is to Inside, i.e. the "reverse" version of the algorithm.)

These algorithms use dynamic programming recursions that are related to Inside and CYK. The Outside algorithm calculates intermediate probabilities of the form

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i34.gif(i, j, k, l) = P(X0,i, Y0,k, V, Yl,|Y|, Xj,|X| |S)

representing the sum-over-probabilities of all partial parse trees rooted at S and ending in V without having yet generated sequences Xij and Ykl. Then, for example, the posterior probability that some node V in the parse tree has inside sequence-pair (Xij, Ykl) is

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i35.gif

As CYK is related to Inside, the KYC algorithm is related to the Outside algorithm: the intermediate probabilities An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i36.gif(i, j, k, l) are found using similar recursions (see below), but with all sums "p + q" replaced by maxima "max(p, q)". The CYK and KYC DP matrices can be used to find the maximum likelihood of any parse tree containing a node V with inside sequence-pair (Xij, Ykl): this maximum likelihood is An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i36.gif(i, j, k, l) An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i33.gif(i, j, k, l). A recursive/stack-based traceback algorithm can be used to find the parse tree with this maximum likelihood.

As with the Inside algorithm, we sum contributions to An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i34.gif(i, j, k, l) from various matching production rules. In contrast to the Inside algorithm, the nonterminal V that indexes An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i34.gif (...) must now be matched on the right-hand-side, not the left-hand-side, of these production rules. We first consider bifurcations from a source nonterminal U that adjoin adjacent subsequences to left (U WV) or right (U VW), so that the inside sequence-pairs for U are (respectively) (Xmj, Ynl) or (Xim,Ykn)

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i37.gif

Next we consider emission rules, An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i20.gif, again representing transitions as the special case A' = B' = C' = D' = An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i3.gif

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i38.gif

The intermediate Outside probabilities are thus

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i34.gif(i, j, k, l) = An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i39.gif + An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i40.gif

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i41.gif(0, |X|, 0, |Y|) = 1 (Termination condition)

Note that the Inside probabilities An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i42.gif(i, j, k, l) are needed to compute the Outside probabilities. We supply the Inside matrix, I, as an input to the Outside algorithm (they are usually calculated at the same time anyway).

In terms of the underlying iteration, the key difference between the Inside and Outside algorithms is as follows. Suppose subsequence-pair INNER = (Xij, Ykl) is enclosed by subsequence-pair OUTER = (Xi'j', Yk'l') (that is, 0 ≤ i' i j j' ≤ |X| and 0 ≤ k' k l l' ≤ |Y|, and INNER ≠ OUTER). Then the Inside iterator visits INNER before OUTER, whereas the Outside iterator visits OUTER before INNER. The order in which the Outside iterator visits nonterminals is topologically forward-sorted with respect to the grammar's transition-rule graph (i.e. the reverse of the order used by the Inside iterator).

• Inputs: X, Y, S, An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i6.gif, I

• Initialize An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i41.gif(0, |X|, 0, |Y|)

• For i = 0 to |X| (ascending)

    • For each j satisfying (i, j) [set membership] An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i24.gif (descending) (†)

       • For each k satisfying (i, k) [set membership] An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i26.gif (ascending) (†)

          • For each l satisfying (k, l) [set membership] An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i25.gif (descending) (†)

             • If (j, l) [set membership] An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i26.gif then

                • For each nonterminal U [set membership] Φ (reverse order)

                   • Set An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i39.gif ← 0; calculate An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i40.gif

                   • For each m satisfying {(m, j), (m, i)} [subset or is implied by] An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i24.gif (†)

                      • For each n satisfying {(n, l), (n, k)} [subset or is implied by] An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i25.gif (†)

                         • If (m, n) [set membership] An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i26.gif then

                            • Calculate An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i43.gif(m, n) and add to An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i39.gif

                   • For each m satisfying {(i, m), (j, m)} [subset or is implied by] An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i24.gif (†)

                      • For each n satisfying {(k, n), (l, n)} [subset or is implied by] An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i25.gif (†)

                         • If (m, n) [set membership] An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i26.gif then

                            • Calculate An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i44.gif(m, n) and add to An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i40.gif

                   • Calculate An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i45.gif(i, j, k, l) and store

(†) These ordered subsets of An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i24.gif, An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i25.gif and An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-73-i26.gif can be precomputed for speed.

The reduced-space dynamic programming matrix that was developed above for the constrained Inside algorithm can be re-used for the constrained Outside algorithm.

Implementation

The above-described algorithms were implemented in the C++ dart library. One dart program in particular, stemloc, is an efficient general-purpose RNA multiple-sequence alignment program that can be flexibly controlled by the user from the Unix command line, including re-estimation of parameters from training data as well as a broad range of alignment functions.

The dart libraries provide Inside, Outside, CYK, KYC, traceback and training algorithms for any pairwise SCFG in RNA normal form, whose rule probabilities can be expressed as algebraic expressions of some set of probability parameters (with associated normalization constraints). The operator-overloading features of C++ are utilized in full, so that the syntax of initializing a grammar object involves very few function calls and is essentially declarative.

dart source code releases can be downloaded under the terms of the GNU Public License, from the following URL (which also gives access to the latest development code in the CVS repository)

http://www.biowiki.org/

The grammars and algorithms described in this paper specifically refer to release 0.2 of the dart package, dated October 2004 (although the algorithms are also implemented in release 0.1, dated 10/2003).

Selecting appropriate fold and alignment envelopes

This section offers a non-exhaustive list of possible strategies for choosing appropriate fold/alignment envelopes. (Italicized terms apply to fold envelopes, and bold terms to alignment envelopes.)

• Choose some appropriately simplified grammar, such as a single-sequence SCFG/pair HMM that models RNA folding/primary sequence alignment. Compute posterior probabilities of subsequences/cutpoints using the Inside-Outside/Forward-Backward algorithm for some single-sequence SCFG/pair HMM that models RNA folding/primary sequence alignment. Select all subsequences/cutpoints with posterior probability above some threshold (or select e.g. the top 10 percentile of the posterior probabability distribution). Ensure that each of these subsequences/cutpoints is on a valid traceback path, e.g. by running the CYK-KYC/Viterbi-forward-backward algorithm to find the maximum-likelihood traceback path from any given subsequence/cutpoint. Take the union of all subsequences/cutpoints on these traceback paths to obtain the required envelope.

• As above, choose some appropriate single-sequence SCFG/pair HMM that models RNA folding /primary sequence alignment. Compute the maximum traceback path-likelihood from all subsequences/cutpoints using the CYK-KYC/Viterbi-forward-backward algorithm for this grammar. Select all subsequences/cutpoints with maximum traceback likelihood above some threshold (or select e.g. the top 10 percentile of the max-traceback likelihood distribution) and do a traceback from each such cell. Take the union of all subsequences/cutpoints on these traceback paths to obtain the required envelope.

• As above, choose some appropriate single-sequence SCFG/pair HMM that models RNA folding /primary sequence alignment. Sample some number of RNA structures /pairwise alignments using the Inside/Forward algorithm with stochastic traceback. Take the union of all subsequences/cutpoints on these traceback paths to obtain the required envelope.

The latter two strategies have been implemented in the stemloc package described below. Empirically, the stochastic strategy appears to be less reliable than the deterministic strategies (although in theory the stochastic strategy will eventually find the globally optimal alignment given sufficiently many random repetitions, which may be a useful property).

Multiple sequence alignment

A heuristic algorithm for performing multiple alignment-and-folding of RNA sequences with a pairwise SCFG by progressive single-linkage clustering runs as follows

• Start by making pairwise alignments (with predicted secondary structures) for all pairs of input sequences.

• Mark the highest-scoring pair, and extract the two marked sequences with their predicted secondary structures. This highest-scoring pair is called the seed alignment.

• While some sequences remain unmarked:

    • For each newly-marked sequence:

       • Align the marked sequence, with a fold envelope constrained by its predicted structure, to each unmarked sequence in turn. (The fold envelope can be tailored to allow e.g. extension of local alignments.)

    • Select the highest-scoring of the pairwise (marked-to-unmarked) alignments. Use this alignment to merge the unmarked sequence into the seed alignment, and mark this sequence as newly aligned.

• Return the seed alignment.

The above algorithms have been implemented in stemloc. The multiple alignments produced by this algorithm lack well-defined probabilistic scores unless the pair SCFG is conditionally normalized. It is also straightforward to retrieve the N best non-overlapping alignments by repeatedly applying an incremental Waterman-Eggert-style mask to the alignment envelope [24]. This is implemented in stemloc for the pairwise case.

A grammar for pairwise RNA alignment and structure prediction

After some empirical experimentation, we developed the grammar of Tables Tables2,2, ,3,3, ,44 for the stemloc program. The grammar is split over three tables due to its considerable number of rules. Table Table22 contains rules describing the connectivity of stems, loops and multiloops, and contains the only bifurcation rule. Table Table33 describes the connectivity of bulges and Table Table44 handles emissions (basepaired, unpaired, aligned or gapped).

Table 2
The stemloc grammar, part 1 of 3: stem and loop structures.
Table 3
The stemloc grammar, part 2 of 3: bulges.
Table 4
The Stemloc grammar, part 3 pf 3: emissions.

The starting nonterminal is Start. The nonterminals representing higher-level units of RNA structure are Loop, Stem, LBulge, RBulge and LRBulge. Each of these has associated Match, Ins and Del states (e.g. StemMatch, StemIns and StemDel) and each of these states has an associated emission state, prefixed with x, y or xy superscripts (e.g. xyStemMatch, yStemIns and xStemDel). The Multi state models multiloops, using a bifurcation to LMulti and RMulti.

Tables Tables2,2, ,3,3, ,44 also refer to probabilistic parameters used by the models. Free probability parameters (allowed to range from 0 to 1) include startInStem (determining the probability of ending the alignment with a basepair), loopExtend, stemExtend and multiExtend (determining the geometric distribution over loop and stem lengths, or the number of branches in a multiloop), multiBulgeOpen (determining the probability of bulges in multiloops) and stemGapOpen, stemGapExtend, stemGapSwap, loopGapOpen, loopGapExtend and loopGapSwap (determining the geometric distribution over gap lengths in stems and loops). There are also five parameter arrays: postStem[4] (determining whether a stem is followed by a loop, a bulge, a double-stranded bulge or a multiloop); baseIndel[4] (a probability distribution for a single unaligned, unpaired nucleotide); baseSubstitution[16] (a joint probability distribution for two aligned, unpaired nucleotides); basepairIndel[16] (a joint probability distribution for two unaligned, basepaired nucleotides); and basepairSubstitution [256] (a joint probability distribution for four aligned, basepaired nucleotides). All of the above parameters were automatically estimated from training data by the dart software.

To summarize, the grammar models homologous stems, loops, multiloops and bulges in pairwise RNA alignments, with covariant substitution scores and affine gap penalties (geometric indel length distributions). It has the property that any combined alignment and structure prediction for two RNA sequences has a single, unambiguous parse tree. In our investigations, this unambiguity appeared to improve the accuracy of alignment and structure prediction substantially; see also writings on this topic by Giegerich [34] and Dowell, Eddy et al [35]. The grammar is also designed to minimize the number of bifurcation rules (the only bifurcation is Multi → LMulti RMulti).

The stemloc grammar does not model basepair stacking effects due to π-orbital overlap, nor does it allow low-cost insertion or deletion of whole stems or substructures. Since the algorithms are implemented for any SCFG, it is straightforward to modify the program to experiment with grammars that model such phenomena. An example of a grammar that models the latter type of mutation (whole-substructure indels), and is also fully derived from an evolutionary rate-based model, is presented in a companion paper [36].

Parameterization

Under the SCFG framework, the probability parameters for the grammar can be estimated directly from data using the Inside-Outside algorithm with appropriate constraints, which are easy to supply (e.g. to sum over all parses consistent with a given alignment, one simply uses an appropriate alignment envelope). The parameters were trained from 56376 (non-independent) pairwise alignments from RFAM [37], with each pairwise alignment making a weighted contribution of 1/N to the counts computed during Expectation Maximization training, where N is the number of sequences in the multiple alignment from which the pairwise alignments were derived. Furthermore, the pairwise alignments were binned according to sequence identity, providing four alternative parameterisations; the bin ranges were 30–40%, 50–60%, 70–80% and 90–100%.

stemloc allows the user to re-estimate all parameters from their own personal training set of trusted alignments. This may be a useful feature, since the training procedure described above is probably biased. Since training was performed using all kinds of sequence available in RFAM, including RNA sequences with computationally predicted secondary structure as well as those for which structures were experimentally confirmed, it is possible that the stemloc parameters may be skewed by the parameters of the computational methods used by the RFAM curators to predict structure. These include the homology modeling program INFERNAL [37] and the de novo structure prediction program PFOLD [8].

Authors' contributions

IH designed, programmed, tested and documented the algorithms.

Acknowledgements

The author thanks Sean Eddy for inspiring discussions and three anonymous reviewers for their helpful suggestions. The work was conceived during an NIH-funded workshop on RNA informatics organised by Elena Rivas and Eric Westhof in Benasque, Spain, 2003.

References

  • Eddy SR. Noncoding RNA genes. Current Opinion in Genetics and Development. 1999;9:695–699. doi: 10.1016/S0959-437X(99)00022-2. [PubMed] [Cross Ref]
  • Mandal M, Boese B, Barrick JE, Winkler WC, Breaker RR. Riboswitches Control Fundamental Biochemical Pathways in Bacillus subtilis and Other Bacteria. Cell. 2003;113:577–586. doi: 10.1016/S0092-8674(03)00391-X. [PubMed] [Cross Ref]
  • Sijen T, Plasterk RH. Transposon silencing in the Caenorhabditis elegans germ line by natural RNAi. Nature. 2003;426:310–314. doi: 10.1038/nature02107. [PubMed] [Cross Ref]
  • Ambros V. The functions of animal microRNAs. Nature. 2004;431:350–355. doi: 10.1038/nature02871. [PubMed] [Cross Ref]
  • Baulcombe D. RNA silencing in plants. Nature. 2004;431:356–363. doi: 10.1038/nature02874. [PubMed] [Cross Ref]
  • Rivas E, Eddy SR. Secondary structure alone is generally not statistically significant for the detection of noncoding RNAs. Bioinformatics. 2000;16:583–605. doi: 10.1093/bioinformatics/16.7.583. [PubMed] [Cross Ref]
  • Coventry A, Kleitman DJ, Berger B. MSARI: Multiple sequence alignments for statistical detection of RNA secondary structure. Proceedings of the National Academy of Sciences of the USA. 2004;101:12102–12107. doi: 10.1073/pnas.0404193101. [PMC free article] [PubMed] [Cross Ref]
  • Knudsen B, Hein J. RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics. 1999;15:446–454. doi: 10.1093/bioinformatics/15.6.446. [PubMed] [Cross Ref]
  • Rivas E, Eddy SR. Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics. 2001;2:8. doi: 10.1186/1471-2105-2-8. [PMC free article] [PubMed] [Cross Ref]
  • Gorodkin J, Heyer LJ, Stormo GD. Finding the most significant common sequence and structure motifs in a set of RNA sequences. Nucleic Acids Research. 1997;25:3724–3732. doi: 10.1093/nar/25.18.3724. [PMC free article] [PubMed] [Cross Ref]
  • Mathews DH, Turner DH. Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. Journal of Molecular Biology. 2002;317:191–203. doi: 10.1006/jmbi.2001.5351. [PubMed] [Cross Ref]
  • Perriquet O, Touzet H, Dauchet M. Finding the common structure shared by two homologous RNAs. Bioinformatics. 2003;19:108–116. doi: 10.1093/bioinformatics/19.1.108. [PubMed] [Cross Ref]
  • Holmes I, Rubin GM. Pairwise RNA structure comparison using stochastic context-free grammars. Pac Symp Biocomput. 2002:163–174. [PubMed]
  • Sankoff D. Simultaneous solution of the RNA folding, alignment, and protosequence problems. SIAM Journal of Applied Mathematics. 1985;45:810–825. doi: 10.1137/0145048. [Cross Ref]
  • Zuker M, Stiegler P. Optimal Computer Folding of Large RNA Sequences Using Thermodynamics and Auxiliary Information. Nucleic Acids Research. 1981;9:133–148. [PMC free article] [PubMed]
  • Eddy SR, Durbin R. RNA Sequence Analysis Using Covariance Models. Nucleic Acids Research. 1994;22:2079–2088. [PMC free article] [PubMed]
  • Sakakibara Y, Brown M, Hughey R, Mian IS, Sjölander K, Underwood RC, Haussler D. Stochastic Context-Free Grammars for tRNA Modeling. Nucleic Acids Research. 1994;22:5112–5120. [PMC free article] [PubMed]
  • Brown M, Wilson C. RNA Pseudoknot Modeling Using Intersections of Stochastic Context-Free Grammars with Applications to Database Search. 1995. http://www.cse.ucsc.edu/research/compbio/pseudoknot.html [PubMed]
  • Lefebvre F. A Grammar-Based Unification of Several Alignment and Folding Algorithms. In: States DJ, Agarwal P, Gaasterland T, Hunter L, Smith RF, Menlo Park, editor. Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology. CA: AAAI Press; 1996. pp. 143–154.
  • Durbin R, Eddy S, Krogh A, Mitchison G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK: Cambridge University Press; 1998.
  • Hofacker IL, Bernhart SH, Stadler PF. Alignment of RNA base pairing probability matrices. Bioinformatics. 2004;20:2222–2227. doi: 10.1093/bioinformatics/bth229. [PubMed] [Cross Ref]
  • Austern MH. Generic Programming and the STL: Using and Extending the C++ Standard Template Library. Addison-Wesley; 1999.
  • Smith TF, Waterman MS. Identification of Common Molecular Subsequences. Journal of Molecular Biology. 1981;147:195–197. doi: 10.1016/0022-2836(81)90087-5. [PubMed] [Cross Ref]
  • Waterman MS, Eggert M. A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. Journal of Molecular Biology. 1987;197:723–725. doi: 10.1016/0022-2836(87)90478-5. [PubMed] [Cross Ref]
  • Higgins DG, Sharp PM. Fast and Sensitive Multiple Sequence Alignments on a Microcomputer. Computer Applications in the Biosciences. 1989;5:151–153. [PubMed]
  • Lari K, Young SJ. The Estimation of Stochastic Context-Free Grammars Using the Inside-Outside Algorithm. Computer Speech and Language. 1990;4:35–56. doi: 10.1016/0885-2308(90)90022-X. [Cross Ref]
  • McCaskill JS. The Equilibrium Partition Function and Base Pair Binding Probabilities for RNA Secondary Structure. Biopolymers. 1990;29:1105–1119. doi: 10.1002/bip.360290621. [PubMed] [Cross Ref]
  • Altschul SF. Amino Acid Substitution Matrices from an Information Theoretic Perspective. Journal of Molecular Biology. 1991;219:555–565. doi: 10.1016/0022-2836(91)90193-A. [PubMed] [Cross Ref]
  • Chomsky N. Three Models for the Description of Language. IRE Transactions Information Theory. 1956;2:113–124. doi: 10.1109/TIT.1956.1056813. [Cross Ref]
  • Chomsky N. On Certain Formal Properties of Grammars. Information and Control. 1959;2:137–167. doi: 10.1016/S0019-9958(59)90362-6. [Cross Ref]
  • Holmes I. PhD thesis. The Sanger Centre; 1998. Studies in probabilistic sequence alignment and evolution.
  • Shapiro BA, Zhang KZ. Comparing multiple RNA secondary structures using tree comparisons. Computer Applications in the Biosciences. 1990;6:309–318. [PubMed]
  • Klein R, Eddy SR. Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics. 2003;4 [PMC free article] [PubMed]
  • Giegerich R. Explaining and Controlling Ambiguity in Dynamic Programming. In: Giancarlo R, Sankoff D, editor. Combinatorial Pattern Matching: 11th Annual Symposium. Vol. 1848. Springer-Verlag Heidelberg; 2000. pp. 46–59.
  • Dowell RD, Eddy SR. Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction. BMC Bioinformatics. 2004;5:71. doi: 10.1186/1471-2105-5-71. [PMC free article] [PubMed] [Cross Ref]
  • Holmes I. A probabilistic model for the evolution of RNA structure. BMC Bioinformatics. 2004;5 [PMC free article] [PubMed]
  • Griffiths-Jones S, Bateman A, Marshall M, Khanna A, Eddy SR. Rfam: an RNA family database. Nucleic Acids Research. 2003;31:439–441. doi: 10.1093/nar/gkg006. [PMC free article] [PubMed] [Cross Ref]
  • Holmes I, Durbin R. Dynamic programming alignment accuracy. Journal of Computational Biology. 1998;5:493–504. [PubMed]
  • Do CB, Brudno M, Batzoglou S. PROBCONS: Probabilistic Consistency-based Multiple Alignment of Amino Acid Sequences [PMC free article] [PubMed]

Articles from BMC Bioinformatics are provided here courtesy of BioMed Central
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...