- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# A statistical sampling algorithm for RNA secondary structure prediction

^{*}To whom correspondence should be addressed. Tel: +1 518 486 1719; Fax: +1 518 473 2900; Email: gro.htrowsdaw@gnidy

## Abstract

An RNA molecule, particularly a long-chain mRNA, may exist as a population of structures. Further more, multiple structures have been demonstrated to play important functional roles. Thus, a representation of the ensemble of probable structures is of interest. We present a statistical algorithm to sample rigorously and exactly from the Boltzmann ensemble of secondary structures. The forward step of the algorithm computes the equilibrium partition functions of RNA secondary structures with recent thermodynamic parameters. Using conditional probabilities computed with the partition functions in a recursive sampling process, the backward step of the algorithm quickly generates a statistically representative sample of structures. With cubic run time for the forward step, quadratic run time in the worst case for the sampling step, and quadratic storage, the algorithm is efficient for broad applicability. We demonstrate that, by classifying sampled structures, the algorithm enables a statistical delineation and representation of the Boltzmann ensemble. Applications of the algorithm show that alternative biological structures are revealed through sampling. Statistical sampling provides a means to estimate the probability of any structural motif, with or without constraints. For example, the algorithm enables probability profiling of single-stranded regions in RNA secondary structure. Probability profiling for specific loop types is also illustrated. By overlaying probability profiles, a mutual accessibility plot can be displayed for predicting RNA:RNA interactions. Boltzmann probability-weighted density of states and free energy distributions of sampled structures can be readily computed. We show that a sample of moderate size from the ensemble of an enormous number of possible structures is sufficient to guarantee statistical reproducibility in the estimates of typical sampling statistics. Our applications suggest that the sampling algorithm may be well suited to prediction of mRNA structure and target accessibility. The algorithm is applicable to the rational design of small interfering RNAs (siRNAs), antisense oligonucleotides, and *trans*-cleaving ribozymes in gene knock-down studies.

## INTRODUCTION

RNA molecules participate in a variety of important biological processes that include catalysis, RNA splicing, regulation of transcription, translation, and RNA:DNA, RNA:RNA and RNA:protein interactions. The function of an RNA molecule is determined by its structure. However, it is extremely difficult to crystallize large RNA molecules for structural determination. Crystal structure has been determined only for a small number of RNA molecules, although exciting progress has been made for rRNAs in recent years (1). Secondary structures are highly conserved in evolution for most functional RNAs, e.g. tRNAs (2). On the other hand, RNA tertiary structural motifs involve interactions between secondary structure elements. To a large extent, RNA folding is driven by secondary structural features. For these reasons, elucidation of RNA secondary structure is an important step toward determination of three-dimensional structure and function of an RNA. Computational methods are valuable, because determination of secondary structure, particularly for long-chain RNA molecules, is difficult by experimental means.

From the perspective of statistical mechanics, characterization of the full ensemble of secondary structures for a given RNA sequence is of great interest (3,4). For example, an mRNA may exist as a population of different structures (5). On the other hand, multiple structures are known to be involved in a variety of RNA regulatory functions. These include the functioning of the 5S rRNA during prokaryotic protein synthesis (6,7), regulation of translation initiation in prokaryotes (8,9), and attenuation of transcription in enteric bacteria (10).

Free energy minimization is the most popular method for the prediction of RNA secondary structure from a single sequence. Although free energy models for secondary structure motifs have undergone refinements for more accurate characterization of folding thermodynamics, there is still uncertainty in the experimental estimates of the parameters. The free energy computed for a structure is an approximation as well, because of the assumption of free energy additivity and the need to extrapolate to loop sequences and loop sizes in the absence of measured estimates. Slight deviations in the free energy parameters can lead to substantial differences in the computed optimal folding. This ill conditioning of the RNA folding problem by free energy minimization has been well noted (11). Furthermore, the stability of secondary structure motifs can be affected by potential tertiary interactions that are unaccounted for in secondary structure prediction, and little is known about the thermodynamic contributions of tertiary motifs. Hence, the minimum free energy (MFE) structure derived from a folding algorithm may not be the true structure, and the true structure may be a suboptimal folding. These considerations provide additional motivations for us to more fully explore the ensemble of all secondary structures. Existing algorithms have only partially addressed the ensemble. We discuss below three major solutions that have been described.

First, mathematical algorithms (12,13) can predict optimal folding and can present a designed set of suboptimal foldings within any prescribed *P*% (0 ≤ *P* ≤ 100) of the MFE. The suboptimal algorithm is efficient for mitigating the uncertainties in the predictions; however, it has limitations. For each admissible base pair, the suboptimal algorithm generates the constrained optimal folding, with this pair as the constraint. Thus, it will regenerate the optimal folding if a base pair in the optimal folding is the constraint. For a sequence of *n* nucleotides, and *n*_{0} base pairs in the optimal folding, at most *n*(n – 1)/2 – *n*_{0} suboptimal foldings are examined by this algorithm. This set is common for all choices of *P*, and those within *P*% of the MFE are returned by the algorithm. For large *P* and for even moderate *n*, this is a small subset of all the suboptimal foldings within *P*% of the MFE because, as *P* increases, the number of all suboptimal foldings increases exponentially with *n*. Furthermore, if the least stable structure from this set is *Q*% away from the MFE, then for *P* < *Q*, no new suboptimal foldings are produced. A structure that is not one of the constrained optimal foldings generated by its base pairs is in the complementary set of the ‘missing’ suboptimal foldings, i.e. the collection of suboptimal foldings excluded by the suboptimal algorithm. For example, structures specified by removal of one or more base pairs from the optimal folding fall into this set.

Secondly, a recent mathematical algorithm deals with the computation of all suboptimal foldings within any specified increment of the MFE (14), and thus does comprehensively address the issue of ensemble. This represents a more analytical treatment than an earlier attempt (15). However, for this algorithm, because the number of suboptimal foldings grows exponentially, the run time shows an exponential behavior, as the range of the energy interval increases (14). Thus, even for moderate sequence length and moderate free energy interval, enumeration and examination of this huge set of suboptimal foldings become prohibitive.

Thirdly, the calculation of equilibrium partition functions and base-pairing probabilities (3) is an important advance toward the characterization of the Boltzmann ensemble of secondary structures. However, this elegant algorithm does not generate any secondary structures.

The dilemma, that the presentation of suboptimal foldings through a designed set is limited, whereas complete enumeration and examination of suboptimal foldings is difficult, appears to be very hard to resolve by a mathematical treatment. However, as we suggested with prototype algorithms (16,17), the generation of a statistically representative sample of secondary structures may provide a resolution to this dilemma. Here we describe a secondary structure sampling algorithm that incorporates comprehensive structural features and the recent thermodynamic parameters. We show that the algorithm generates a sample that is guaranteed to be representative. We present applications of the algorithm to illustrate statistical characterization and representation of the Boltzmann ensemble of secondary structures. Also, we illustrate how the algorithm can be employed in predictions of accessible target regions for rational design of RNA-targeting nucleic acids.

## ALGORITHM

For an RNA molecule, the secondary structures in the Boltzmann ensemble are not all equally probable. The Boltzmann equilibrium probability of a secondary structure *I* for a sequence *S* is given by

*P*(*I*) = exp[–*E*(*S*, *I*)/*RT*]/*U***1**

where *E*(*S*, *I*) is the free energy of the structure for the sequence, *R* is the gas constant, *T* is the absolute temperature and *U* is the partition function for all admissible secondary structures of the RNA sequence, i.e. *U* = Σ_{I} exp[–*E*(*S*, *I*)/*RT*]. The Boltzmann equilibrium probability distribution gives the probability for every structure, and therefore statistically characterizes the ensemble. However, neither complete enumeration nor usual statistical sampling from a discrete distribution is feasible in this context, because the number of secondary structures grows exponentially with increasing length of the sequence. Here we describe a recursive algorithm that draws a representative sample from the Boltzmann equilibrium distribution.

The algorithm has two basic steps. Its forward step computes equilibrium partition functions for all substrings of an RNA sequence based on recent free energy parameters (18,19). The backward step takes the form of a recursive sampling algorithm to randomly draw secondary structures according to their probabilities given by equation **1**. The equilibrium partition functions are used for calculating conditional sampling probabilities. By applying the algorithm to biological RNA sequences of a wide range of length, we demonstrate that a statistically representative sample of secondary structures can be quickly generated after the calculation of the partition functions.

### Computing equilibrium partition functions

For an RNA molecule of *n* ribonucleotides, let the sequence fragment from the *i*th ribonucleotide from the 5′ end to the *j*th ribonucleotide be denoted by *R*_{ij} = *r*_{i} *r*_{i +1} …*r*_{j}, 1 ≤ *i*, *j* ≤ *n*, where *r*_{i} = *A*, *C*, *G* or *U*. Let *I*_{ij} be a secondary structure on *R*_{ij} that meets the usual constraints of unknotted structure and at least three intervening bases between any base pair. For structures under the constraints, let *IP*_{ij} be a structure on *R*_{ij} with the ends constrained to form a base pair. By summing over all structures on the fragment, the equilibrium partition functions restricted to *R*_{ij} are defined as:

*u*(*i*, *j*) = Σ_{Iij} exp[–*E*(*R*_{ij}, *I*_{ij} )/*RT*]**2**

*up*(*i*, *j*) = Σ_{IPij} exp[–*E*(*R*_{ij}, *IP*_{ij} )/*RT*]**3**

where the sum for *u*(*i*, *j*) is over all possible *I*_{ij}, and the sum for *up*(*i*, *j*) is over all possible *IP*_{ij}. *E*(*R*_{ij}, *I*_{ij} ) is the free energy of structure *I*_{ij} for *R*_{ij}, and for the gas constant *R* and the absolute temperature *T =* 310.15 K, kcal/mol/*RT* = 1.6225. *E*(*R*_{ij}, *IP*_{ij} ) is the free energy of structure *IP*_{ij}. Recursive calculation of partition functions was previously employed for computing base-pairing probabilities (3). The recursions in the Appendix extend this early work by including all but coaxial stacking from the recent free energy parameters (18,19). In particular, free energies for dangling ends have been incorporated. More specifically, the free energy rules and parameters include free energies for stacking in a helix, stacking for a terminal mismatch in a hairpin loop (size ≥ 4 nt) or an interior loop, and penalties for hairpin, bulge, interior and multibranched loops. Free energies for dangling ends are used for exterior and multibranched loops. For hairpins, a bonus for *UU* and *GA* first mismatches (included in the terminal stacking data) and a bonus for *G*·*U* closure preceded by two *G* nucleotides in base pairs are applied, and a penalty for oligo-*C* loops (all unpaired nucleotides are *C*) is used. A table is consulted for tetraloops with four unpaired nucleotides. For a bulge of 1 nt, the stacking energy of the adjacent pairs is added. For interior loops, tables for 1 × 1, 1 × 2 and 2 × 2 loops are consulted, and a penalty for asymmetry is applied. A terminal *A*–*U*, *G*·*U* penalty is explicitly applied to exterior loop, multibranched loops, bulges longer than 1 nt and triloops (hairpin loops with three unpaired nucleotides), while this penalty is included in the terminal stacking data for hairpin loops (size ≥ 4 nt) and interior loops. The free energy parameters are for 37°C and 1 M Na^{+}; however, this algorithm can be used with any set of nearest-neighbor parameters derived for other conditions. The recursions for partition functions are presented in such a fashion that sampling probabilities can be readily derived.

With the partition function *u*(1, *n*) available, the Boltzmann equilibrium probability for a secondary structure *I*_{1 n} of sequence *R*_{1 n} can then be computed. Under the Boltzmann model, *I*_{1 n} is a random variable. When *R*_{1 n} is also considered a random variable, the Boltzmann equilibrium probability is, in fact, a conditional probability of the secondary structure, given the sequence data:

*P*(*I*_{1 n} |*R*_{1 n} ) = exp[–E(*R*_{1 n}, *I*_{1 n} )/*RT*]/*u*(1, *n*)**4**

### Sampling structures from the Boltzmann equilibrium probability distribution

In this section, we first present equations for computing sampling probabilities. Next, we describe the sampling algorithm. At the end, we discuss the features of the sampling algorithm, and illustrate them with examples.

*Equations for computation of sampling probabilities*. We previously described sampling algorithms for RNA secondary structures using a stacking energy model (16,17). The task of structure sampling can also be accomplished for a more comprehensive energy model, because the recursions for restricted partition functions correspond to sampling probabilities for mutually exclusive and exhaustive cases:

*Sampling probability for a case = contribution to partition function by the case/partition function.*

Specifically, consider a fragment *R*_{ij} for which it is unknown whether the ends form a pair. For the five cases (a), (b), (c), (d) and (e) as shown in Figure FigureA2A2 A2 in the derivation of the recursion for *u*(*i*, *j*) (equation **A1** in the Appendix), the sampling probabilities are given by the following equations:

*P*_{0} = 1/*u*(*i*, *j*)

*P*_{ij} = *up*(*i*, *j*)exp[–*etp*(*i*, *j*)/*RT*]/*u*(*i*, *j*)

*P*_{hj} = *up*(*h*, *j*)exp{–[*ed*5(*h*, *j*, *h* – 1) + *etp*(*h*, *j*)]/*RT*}/*u*(*i*, *j*), *i* < *h* < *j*,

*P*_{il} = *up*(*i*, *l*)exp[–*etp*(*i*, *l*)/*RT*]{exp[–*ed*3(*i*, *l*, *l* + 1)/*RT*]*u*(*l* + 2, *j*) + *u*(*l* + 1, *j*) – *u*(*l* + 2, *j*)}/*u*(*i*, *j*), *i* < *l* < *j*

*P*_{s 1 h} = *s*1(*h*, *j*)/*u*(*i*, *j*), *i* < *h* < *j* – 1

*P*_{hl} = *up*(*h*, *l*)exp{–[*ed*5(*h*, *l*, *h* – 1) + *etp*(*h*, *l*)]/*RT*}{exp[–*ed*3(*h*, *l*, *l* + 1)/*RT*]*u*(*l*+2, *j*) + *u*(*l* + 1, *j*) – *u*(*l* + 2, *j*)}/*s*1(*h*, *j*), *h* < *l* < *j*

where *P*_{0} is the sampling probability for case (a): *R*_{ij} is single stranded; *P*_{ij} is the sampling probability for case (b): *h* = *i*, *l* = *j*; {*P*_{hj} } are the sampling probabilities for case (c): *i* < *h* < *l* = *j*; {*P*_{il} } are the sampling probabilities for case (d): *h* = *i* < *l*<*j*; {*P*_{s 1 h} } are the probabilities for first sampling *h* for case (e): and *i* < *h* < *l* < *j*; {*P*_{hl} } are the probabilities for sampling *l* after *h* is sampled. Other terms in the equations are defined in the Appendix. Because the probabilities of all mutually exclusive and exhaustive cases sum up to 1, we have *P*_{0} + *P*_{ij} + Σ_{i < h < j} *P*_{hj} + Σ_{i < l < j} *P*_{il} + Σ_{i < h < j – 1} *P _{s}*

_{1 h}= 1, and Σ

_{h < l < j}

*P*

_{hl}= 1. The computation is linear by using

*s*1(

*h*,

*j*) (an auxiliary array defined by equation

**A5**in the Appendix) through {

*P*

_{s 1 h}} and {

*P*

_{hl}}. When the ends are known to form a base pair

*r*

_{i}–

*r*

_{j}, the pair can close a hairpin, or be the exterior pair of a base pair stack, or close a bulge or an interior loop, or close a multibranched loop. The conditional probabilities for these cases, based on the recursion for

*up*(

*i*,

*j*) (equation

**A2**in the Appendix), are given by the following equations:

*Q*_{ijH} = exp[–*eh*(*i*, *j*)/*RT*]/*up*(*i*, *j*)

*Q*_{ijS} = exp[–*es*(*i*, *j*, *i* + 1, *j* – 1)/*RT*]*up*(*i* + 1, *j* – 1)/*up*(*i*, *j*)

*Q*_{ijBI} = {Σ_{i < h < l < j} exp[–*ebi*(*i*, *j*, *h*, *l*)/*RT*]*up*(*h*, *l*)}/*up*(*i*, *j*)

*Q*_{ijM} = *up*_{m} (*i*, *j*)/*up*(*i*, *j*)

*Q*_{hlBI} = exp[–*ebi*(*i*, *j*, *h*, *l*)/*RT*]*up*(*h*, *l*)/{Σ_{i < h ′ < l ′ < j} exp[–*ebi*(*i*, *j*, *h*′, *l*′)/*RT*]*up*(*h*′, *l*′)}, *i* < *h*′ < *l*′ < *j*

where *Q*_{ijH} is the sampling probability for hairpin loop; *Q*_{ijS} is the sampling probability for base pair stack; *Q*_{ijBI} is the sampling probability for a bulge or an interior loop; and *Q*_{ijM} is the sampling probability for a multibranched loop. {*Q*_{hlBI} } are used for sampling *h* and *l* after the case of bulge or interior loop is sampled. Other terms in the equations are defined in the Appendix. For mutually exclusive and exhaustive cases, we have *Q*_{ijH} + *Q*_{ijS} + *Q*_{ijBI} + *Q*_{ijM} = 1, and Σ_{i < h < l < j} *Q*_{hlBI} = 1. *up*_{m} (*i*, *j*) is the contribution to *up*(*i*, *j*) by the case of a multibranched loop (see equations **A2** and **A3** in the Appendix).

In the case of a multibranched loop, the probabilities for sampling the closing base pair *r*_{h 1} –*r*_{l 1} of the first 5′ end internal helix in the loop correspond to the terms in the recursion for *up*_{m} (*i*, *j*) (equation **A3** in the Appendix) with the quartic term expressed in terms of *s*2(*h*, *j*) (an auxiliary array defined by equation **A6** in the Appendix). More specifically, we first sample *h* and *l* according to the following conditional probabilities:

*P*_{ij ( i + 1) l} = *up*(*i* + 1, *l*)exp{–[*a* + 2*c + etp*(*i* + 1, *l*)]/*RT*}{exp[–*ed*3(*i* + 1, *l*, *l* + 1)/*RT*]*u*1(*l* + 2, *j* – 1) + *u*1(*l* + 1, *j* – 1) – *u*1(*l* + 2, *j* – 1)}/*up*_{m} (*i*, *j*), *i* + 1 < *l* < *j*

*P*_{ij(i + 2)l} = *up*(*i* + 2, *l*)exp{–[*a* + 2*c* + *b* + *ed*3(*j*, *i*, *i* + 1)+ *etp*(*i* + 2, *l*)]/RT}{exp[–*ed*3(*i* + 2, *l*, *l* + 1)/*RT*]*u*1(*l* + 2, *j* – 1) + *u*1(*l* + 1, *j* – 1) – *u*1(*l* + 2, *j* – 1)}/*up*_{m} (*i*, *j*), *i*+2<*l*< *j*

*P*_{ijs 2 h} = exp{–[*a* + 2*c* + (*h* – *i* – 1)*b* + *ed*3(*j*, *i, i* + 1)]/RT}*s*2(*h*, *j*)/*up*_{m} (*i*, *j*), *i* + 3 ≤ *h* < *j* – 1

*P*_{ijhl} = *up*(*h*, *l*)exp{–[*ed*5(*h*, *l*, *h* – 1) + *etp*(*h*, *l*)]/RT}{exp[–*ed*3{*h*, *l*, *l* + 1)/*RT*]*u*1(*l* + 2, *j* – 1) + *u*1(*l* + 1, *j* – 1) – *u*1(*l* + 2, *j* – 1)}/*s*2(*h*, *j*), *h* < *l* < *j*

where {*P*_{ij ( i + 1) l} } are sampling probabilities for the cases when *h* = *i* + 1; {*P*_{ij ( i + 2) l} } are sampling probabilities for the cases when *h* = *i* + 2; and {*P*_{ijhl} } are probabilities for sampling *l* after *h* ≥ *i* + 3 is sampled with probabilities {*P*_{ijs 2 h} }. For mutually exclusive and exhaustive cases, we have Σ_{i + 1 < l < j} *P*_{ij ( i + 1) l} + Σ_{i + 2 < l < j} *P*_{ij ( i + 2) l} *+* Σ_{i + 3 ≤ h < j – 1} *P*_{ijs 2 h} = 1, and Σ_{h < l < j} *P*_{ijhl} = 1. *u*1(*k*, *j* – 1) is an auxiliary partition function for the multibranched loop (see discussions on equation **A3** in the Appendix). Once both *h* and *l* are sampled, the closing base pair *r*_{h 1} –*r*_{l 1} of the first internal helix is given by setting *h*1 = *h* and *l*1 = *l*.

For sampling the second internal helix, the sampling probabilities for base pair *r*_{h 2} –*r*_{l 2} of the helix closest to the 5′ end of *R*_{( l 1 + 1)( j – 1)} correspond to terms in the recursion for *u*1(*l*1 + 1, *j* – 1) (equation **A4** in the Appendix, with *i* substituted by *l*_{1} + 1 and *j* substituted by *j* – 1) with the quartic term expressed in terms of *s*3(*h*, *j* – 1) (an auxiliary array defined by equation **A7** in the Appendix).

More specifically, we first sample *h* and *l* according to conditional probabilities:

*Q*_{( l 1 + 1)( j – 1)( l 1 + 1) l} = *up*(*l*1 + 1, *l*)exp{–[*c + etp*(*l*1 + 1, *l*)]/*RT*}{*f*(*j*, *l*1 + 1, *l*) exp[–(*j* – 1 – *1*)*b*/*RT*] + exp[–*ed*3(*l*1 + 1, *l*, *l* + 1)/*RT*]*u*1(*l* + 2, *j* – 1) + *u*1(*l* + 1, *j* – 1) – *u*1(*l* + 2, *j* – 1)}/*u*1(*l*1 *+* 1, *j* – 1), *l*1 + 1 < *l* ≤ *j* – 1

*Q*_{( l 1 + 1)( j – 1)( l 1 + 2) l} = *up*(*l*1 + 2, *l*)exp{–[*c + b + etp*(*l*1 + 2, *l*)]/*RT*}{*f*(*j*, *l*1 + 2, *l*) exp[–(*j* – 1 – *l*)*b*/*RT*] + exp[–*ed*3(*l*1 + 2, *l*, *l* + 1)/*RT*]*u*1(*l* + 2, *j* – 1) + *u*1(*l* + 1, *j* – 1) – *u*1(*l* + 2, *j* – 1)}/*u*1(*l*1 *+* 1, *j* – 1), *l*1 + 2 < *l* ≤ *j* – 1

*Q*_{( l 1 + 1)( j – 1) s 3 h} = exp{–[*c +* (*h – l*1 *–* 1)*b*]/*RT*}*s*3(*h*, *j* – 1)/*u*1(*l*1 *+* 1, *j* – 1), *l*1 + 3 ≤ *h* ≤ *j* – 2

*Q*_{( j – 1) hl} = *up*(*h*, *l*)exp{–[*ed*5(*h, l, h* – 1) *+ etp*(*h*, *l*)]/*RT*}{*f*(*j*, *h*, *l*) exp[–(*j* – 1 – *1*)*b*/*RT*] + exp[–*ed*3(*h*, *l*, *l* + 1)/*RT*]*u*1(*l* + 2, *j* – 1) + *u*1(*l* + 1, *j* – 1) – *u*1(*l* + 2, *j* – 1)}/*s*3(*h*, *j* – 1), *h* < *l* ≤ *j* – 1

where {*Q*_{( l 1 + 1)( j – 1)( l 1 + 1) l} } are sampling probabilities for cases when *h* = *l*1 + 1; {*Q*_{( l 1 + 1 )( j – 1)( l 1 + 2) l} } are sampling probabilities for cases when *h* = *l*1 + 2; and {*Q*_{( j – 1) hl} } are probabilities for sampling *l* after *h* ≥ *l*1 + 3 is sampled with probabilities {*Q*_{( l 1 + 1)( j – 1) s 3 h} }. For mutually exclusive and exhaustive cases, we have Σ_{l 1 + 1 < l ≤ j – 1} *Q*_{( l 1 + 1)( j – 1)( l 1 + 1) l} + Σ_{l 1 + 2 < l ≤ j – 1} *Q*_{( l 1 + 1)( j – 1)( l 1 + 2) l} + Σ_{l 1 + 3 ≤ h ≤ j – 2} *Q*_{( l 1 + 1)( j – 1) s 3 h} = 1, and Σ_{h < l ≤ j – 1} *Q*_{( j – 1) hl} = 1. Once both *h* and *l* are sampled, the closing base pair *r*_{h 2} –*r*_{l 2} of the second internal helix is given by setting *h*2 = *h* and *l*2 = *l*. Next, we must consider two possibilities: either there is no additional helix in the loop, or there is at least one more helix. These two mutually exclusive cases are addressed by two additive terms in equation **A4** for *u*1(*l*1 + 1, *j* – 1) in the Appendix. Conditional on sampled *h*2 and *l*2, these terms give the following binomial probability for no additional helix between *r*_{l 2 + 1} and *r*_{j – 1} :

*P*_{Bh2l2 ( j – 1)} = *f*(*j*, *h*2, *l*2)exp[–(*j* – 1 *– l*2)*b*/*RT*]/{*f*(*j*, *h*2, *l*2)exp[–(*j* – 1 – *l*2)*b*/*RT*] + exp[–*ed*3(*h*2, *l*2, *l*2 + 1)/*RT*]*u*1(*l*2 + 2, *j* – 1) + *u*1(*l*2 + 1, *j* – 1) – *u*1(*l*2 + 2, *j* – 1)}**5**

and the probability of at least one more helix is 1 – *P*_{Bh2l2 ( j – 1)}. If no additional helix is sampled, sampling is terminated for this multibranched loop; otherwise, the closing base pair of the next internal helix is sampled, followed by another binomial sampling. This process stops whenever no additional helix is sampled. At the end of this process, for *L* sampled internal helices with closing base pairs *r*_{hk} – *r*_{lk}, 1 ≤ *k* ≤ *L*, sampling probabilities for *r*_{hk} – *r*_{lk} are computed with equation **A4** for *u*1[*l*(*k* – 1) + 1, *j* – 1] (2 ≤ *k* ≤ *L*). This computation, and the computation for the binomial sampling with probability *P*_{Bhklk ( j – 1)} are performed (L–1) times on overlapping fragments with decreasing length. *P*_{Bhklk ( j – 1)} is given by equation **5** with *h*2 and *l*2 substituted by *hk* and *lk*, respectively. Similar to the probability computation for equation **A1**, the computation of the sampling probabilities with equations **A3** and **A4** is linear by using *s*2(*h*, *j*) and *s*3(*h*, *j* – 1). When long interior loops are disregarded, the probability computation for equation **A2** is bounded by a constant.

*Description of the sampling algorithm*. Two stacks A and B are used by the sampling algorithm (Fig. (Fig.1).1). Stack A stores fragments {(*i*, *j*, *I*)} for sampling, where, for the fragment from the *i*th base to the *j*th base, *I* = 1 if it is known the ends form a pair, or *I* = 0 if this pair is unknown. Stack B collects base pairs and unpaired bases that will define a sampled secondary structure upon the completion of sampling. At the start, (1, *n*, 0) is the only fragment in stack A. Specifically, a structure is drawn recursively as follows. (i) Starting with *R*_{1n}, draw single-stranded *R*_{1 n} or a base pair according to probabilities *P*_{0}, *P*_{ij}, {*P*_{hj} }, {*P*_{il} } and {*P*_{s 1 h} } for *i* = 1, *j* = *n*; if *h* is sampled for case (e) in the derivation for equation **A1**, then *l* is sampled with {*P*_{hl} }. In case (a), i.e. single-stranded *R*_{1n}, the sampling is completed; in case (b), (1, *n*, 1) is stored in stack A; in case (c), (*h*, *n*, 1) is stored in stack A, and the unpaired bases from the first base to the (h – 1)th base are stored in stack B; in case (d), (1, *l*, 1) and (*l* + 1, *n*, 0) are stored in stack A; in case (e), (*h*, *1*, 1) and (*l* + 1, *n*, 0) are stored in stack A, and the unpaired bases from the first base to the (h – 1)th base are stored in stack B. (ii) For a new fragment *R*_{ij} from stack A, if base pair r_{i} – r_{j} was not sampled previously (i.e. *I* = 0), we sample and store results by the same process for *R*_{1 n}, with 1 and *n* substituted by *i* and *j* respectively. (iii) For a new fragment *R*_{ij} from stack A with ends paired (i.e. *I* = 1), we first sample loop type with probabilities *Q*_{ijH}, *Q*_{ijS}, *Q*_{ijBI} and *Q*_{ijM}, and then proceed as follows. (iiia) For a hairpin loop, the unpaired bases in the loop and the closing pair are stored in stack B as part of a sampled structure, and they are no longer involved in further sampling. (iiib) For stacking, the exterior base pair is stored in stack B, and the interior base pair defines a new fragment (*i* + 1, *j* – 1, 1) to be stored in stack A. (iiic) For a bulge or an internal loop, we sample the interior base pair in the loop with {*Q*_{hlBI} }. The exterior base pair and unpaired bases in the loop are stored in stack B, and the interior base pair defines a new fragment to be stored in stack A. (iiid) For a multibranched loop, we first sample the interior base pair closest to the 5′ end of *R*_{ij} ; we then sample the second interior base pair. Next, we perform a binomial sampling for one of the two cases: no additional helix on the 3′ side of the loop, or at least one more helix. In the latter case, we sample another interior base pair for one more helix. For the remaining fragment on the 3′ side of the loop, we repeat the binomial and interior base pair sampling until no additional helix is sampled. Unpaired bases in the loop and *r*_{i} – *r*_{j} are stored in stack B, and new fragments defined by the interior base pairs are stored in stack A for further sampling.

*i*th base to the

*j*th base,

*I*= 1 for (

*i*,

*j, I*) if it is known that the ends form a pair, or

*I*=

**...**

After the completion of sampling for a fragment from stack A and storage of new fragment(s) in stack A and/or storage of base pair and unpaired bases in stack B, the fragment in the bottom of stack A is selected for subsequent sampling. The process terminates when stack A is empty, and a sampled secondary structure is formed by the base pairs and unpaired bases in stack B. A statistically representative sample of RNA secondary structures is generated by repeating this process.

*Features of the sampling algorithm and examples*. The algorithm samples a structure exactly and rigorously from the Boltzmann equilibrium probability distribution (equation **1**), because the sampling probabilities are computed by Boltzmann conditional probability distributions based on partition functions restricted to fragments. This is obvious for the unfolded state with a free energy of 0, whose sampling probability of 1/*u*(1, *n*) is also its Boltzmann probability by equation **1**.

From a statistical mechanics perspective, there exists an ensemble of probable structures. Furthermore, structure *I* is a random variable that follows the Boltzmann distribution. *I* can be expressed by an upper triangular matrix of random and dependent indicator variables {*I*_{ij} }, 1 ≤ *i* < *j* ≤ *n*. *I*_{ij} = 1 if the *i*th base is paired with the *j*th base, or *I*_{ij} = 0 otherwise. The requirement for at least three unpaired intervening bases between any base pair implies *I*_{ij} = 0 for *j* = *i* + 1, *i* + 2 and *i* + 3, 1 ≤ *i*, *i* + 3 ≤ *n*. The assumption of no pseudoknots implies *I*_{ij} *I*_{i ′ j ′} = 0 for *i*′ < *i* < *j*′ < *j*. Also, when base triples are prohibited, Σ_{1 ≤ i ≤ n} *I*_{ij} ≤ 1, and Σ_{1 ≤ j ≤ n} *I*_{ij} ≤ 1.Thus, *I* is a high-dimensional random variable. Sampling directly from a high-dimensional probability distribution is often difficult. In some cases, however, the difficulty can be overcome by conditional sampling at lower dimension(s). More specifically, given data *y*, if we can sequentially sample *x*_{1} from the conditional distribution *p*(*x*_{1} |*y*), *x*_{2} from *p*(*x*_{2} |*x*_{1,} *y*) and *x*_{k} from p(*x*_{k} |*x*_{1},…, *x*_{k – 1}, *y*) (*k* = 3,…, *m*), then *x* = (*x*_{1}, *x*_{2},…, *x*_{m} ) follows distribution *p*(*x*|*y*), because the joint probability distribution is the product of the conditional distributions. This is the scheme adopted for the secondary structure sampling described above. For given RNA sequence data, the new base pairs and unpaired bases are sampled by conditioning on already formed substructures from previous sampling steps. Upon the completion of the process, the collection of substructures defines a structure sampled according to the Boltzmann equilibrium probability distribution.

The sampling process is similar to the traceback algorithm employed in the dynamic programming algorithms (12,13), but it differs in that base pairing is randomly sampled with Boltzmann conditional probabilities, rather than selected by the minimum energy principle for the fragments. Because the probability of a structure decreases exponentially with increasing free energy, the most likely structure in a sample is the MFE structure. In other words, the MFE structure has the largest sampling probability, because its Boltzmann probability is larger than that for any other structure.

For *Leptomonas collosoma* spliced leader RNA (SL RNA), 56 nt in length, two experimental secondary structures 1 and 2 have been elucidated (20). Neither of these is the MFE structure as computed by the *mfold* server (http://www.bioinfo.rpi.edu/applications/mfold). Based on structures generated by our sampling algorithm, sampling estimates for the MFE structure and the two experimental structures are computed (Table (Table1).1). The MFE structure has the largest observed frequency among all sampled structures. Further more, for each of these three structures, the Boltzmann equilibrium probability is closely estimated by its maximum likelihood estimate (MLE) computed from the sample, and is contained in the 95% confidence interval (CI). This illustrates the feature of our algorithm that secondary structures are sampled by their Boltzmann equilibrium probabilities.

**Maximum likelihood estimate (MLE) and its standard deviation (SD) and 95% confidence interval (CI) for Boltzmann equilibrium probability of a secondary structure for**

*L.collosoma*SL RNA, computed from 1 000 000 independently sampled**...**

Because there are no more than (*n* – 3)/2 base pairs in a secondary structure, and because the time for sampling a pair is at most *O*(*n*) when long interior loops are disallowed, the time of the sampling algorithm is bounded by *O*(*n*^{2}), i.e. quadratic in the worst case. Thus, once the forward recursions for the partition functions are completed in cubic time, a sample of structures can be quickly generated. This is illustrated, in Table Table2,2, for 10 biological sequences having a wide range of lengths.

## APPLICATIONS

In this section, we show that structures generated by our algorithm fall into classes. Thus, classification of sampled structures presents a means for achieving a statistical delineation and an efficient representation of the Boltzmann ensemble of RNA secondary structures. We show that experimentally verified alternative structures are revealed through sampling. The sampling algorithm enables probabilistic prediction of structural motifs. We illustrate this by computing probability profiles for the prediction of accessible regions, for the design of RNA-targeting nucleic acids. Probability profiling for specific loop types is also illustrated. A mutual accessibility plot, in which probability profiles of two RNAs are overlaid for predicting RNA:RNA interaction, and free energy distributions of sampled structures are also illustrated. We show that a sample of moderate size from the ensemble of an enormous number of structures is sufficient to guarantee statistical reproducibility in the estimates of typical sampling statistics.

### Class representation of Boltzmann ensemble of secondary structures

*Classification of sampled structures.* Alternative RNA secondary structures are involved in a variety of RNA regulatory functions through conformational switching (see examples given in the Introduction). For the *L.collosoma* SL RNA, two competing secondary structural forms 1 and 2 have been indicated by RNase data, although the roles of the structures have yet to be determined (20). We examined 1000 structures sampled by our algorithm for this sequence, and found that the structures fall into two classes, corresponding to the two experimental structural forms 1 and 2. Class 1 can be further subdivided into classes 1A, 1B and 1C; each of these subclasses has an even higher level of structural similarity among its members. Class 2 can be further broken down into classes 2A and 2B. A group of structures can be displayed by means of a two-dimensional histogram (2Dhist). Distinct patterns in this representation are indicative of common structural features for the group, whereas scattering of the squares would indicate its structural diversity. As illustrated in Figure Figure2A–C,2A–C, structures in classes 1A, 1B and 1C have in common two helices, represented by the two clusters of five squares and four squares, respectively. Specifically, the first helix is formed by base pairs *U*^{16}–*A*^{38}, *G*^{17}·*C*^{37}, *U*^{18}–*A*^{36}, *A*^{19}–*U*^{35}, *G*^{20}·*C*^{34}. The second helix is formed by *U*^{22}–*A*^{32}, *C*^{23}·*G*^{31}, *A*^{24}–*U*^{30} and *G*^{25}·*C*^{29}. On the other hand, the histograms also show that members of these classes have different structural features. Structures in classes 2A and 2B also have in common two helices (Fig. (Fig.3A3A and B), which are different from the two helices common to classes 1A, 1B and 1C. The major difference between class 2A and class 2B is the existence of an additional helix for class 2B. This helix is represented by a cluster of squares in the bottom left portion of the histogram in Figure Figure33B.

*L.collosoma*SL RNA. The 2Dhist shows the frequencies of base pairs, with nucleotide position on both axes. Within each histogram, the sizes of the solid squares are proportional to the

**...**

*Probability of a class and the Boltzmann probability of its representative*. For a class of similar structures, the structure occurring with the highest frequency (i.e. the most probable structure) in the sample is taken as the representative of the class. Class 1A is represented by experimental structural form 1 (Fig. (Fig.4A).4A). The MFE structure from *mfold* shown by Figure Figure4B4B is the representative of class 1B. Class 1C is represented by the structure in Figure Figure4C4C that is the MFE structure with a short helix removed. Experimental structure form 2 (Fig. (Fig.5A)5A) is the representative for class 2A. The representative for class 2B, shown in Figure Figure5B,5B, is experimental structural form 2 with an additional hairpin–helix stem on its long single-stranded 5′ end. The probability of a class is computed by that class’s frequency in the sample, the Boltzmann equilibrium probability of the representative structure is computed by that structure’s free energy, and the partition function is computed from the forward step of our algorithm (equation **1**). The size of a class is reflected by the class probability. It is a surprising observation that the Boltzmann probability of the representative structure is not necessarily reflective of the magnitude of the class probability (Fig. (Fig.6).6). For example, the probability for class 1C is ~13.4% larger than that for class 1B; however, the Boltzmann probability of class 1C’s representative is merely 37.8% that of the representative structure for class 1B.

*L.collosoma*SL

**...**

*Entropic class.* For class 2B, the ratio of the class probability to the Boltzmann probability of its most probable member is 290.70, which is strikingly high. Despite the very small Boltzmann probability for its most probable member, this group contains a substantial number of similar structures such that the collection of these structures has a much higher aggregate probability than the probability of each group member. Such ‘entropic classes’ of structures can be revealed by sampling and classification of sampled structures. However, a structure in an entropic class can be easily overlooked, when it is examined individually on the basis of its free energy or Boltzmann probability.

Table Table33 presents a summary of the above analyses. Although the two experimental structures are 25.2% and 15.9% away from the MFE, respectively, they are both predicted by the sample. This sequence was also folded on the *mfold* server. For suboptimality percentage *P* under 15 (default = 5), only the optimal folding is returned. For a large *P*, e.g. *P* = 30, the two alternative structures are returned, among three suboptimal foldings. This example underscores the importance of examining suboptimal structures. It also shows that important alternative structures and structural motifs can be revealed by a statistical sample of the Boltzmann ensemble. These findings suggest that the Boltzmann ensemble of secondary structures for an RNA molecule can be adequately represented by the classes determined in a sample and the probabilities of the classes, together with the class-representative structures and their Boltzmann probabilities.

### Prediction of alternative structures

The analysis of *L.collosoma* SL RNA suggests that alternative biological structures can be adequately revealed by a statistical sample. We investigate this further by applying the sampling algorithm to prediction of mRNA secondary structures. mRNA secondary structures can play a regulatory role in determining the rate of translation initiation (9). This is explained by a model of co-existing alternative structures: one structure favors the translation initiation while the other inhibits the translation initiation (9). Also, it has been argued that the accessibility of the initiation codon is important for maximization of expression (22). The secondary structure of an mRNA is generally unavailable from experimental methods, because complete structural probing by chemical or enzymatic methods is very difficult for long-chain RNAs. A rare exception is the short mRNA for the cIII gene of bacteriophage λ, for which two conformations A and B (Fig. (Fig.7A7A and B) were elucidated and were demonstrated to co-exist in equilibrium (8). The sequence of 132 nt in the structures spans 46 nt of the coding region and 86 nt upstream from the initiation codon *A*^{0}*UG*^{2}. In structure A, the initiation codon and part of the Shine–Dalgarno sequence *U*^{–13}*AAGGAG*^{–7} are in a closed, base-paired conformation such that the ribosome-binding site is occluded. In structure B, the ribosome-binding site is accessible for interactions. It is speculated that the cIII gene expression is regulated at the level of translation initiation through the ratio of the two structures at equilibrium, and through changes in temperature or Mg^{2+} concentration; perhaps ribosome binding can also shift the equilibrium (8). For the mRNA sequence of the cIII gene, a sample of 100 structures was generated by our algorithm and was manually examined. In this sample, 89 are close variants of structure A. The left-most stem in structure A is precisely predicted in 67 of the 89 structures. The exact right-most stem or a modification with one or both of additional pairs *A*^{–12}–*U*^{42}, *A*^{–11}–*U*^{41} is predicted in 72 of the 89 structures. Appreciable variability in the location of the interior and bulge loops is observed for the middle stem. Structure C in Figure Figure7C7C is one of three structures in the sample that closely resemble structure B. The appreciable modification is the additional short helix involving the Shine–Dalgarno sequence formed by base pairs *G*^{–10}·*C*^{44} and *G*^{–9}·*C*^{43}. The remaining eight structures (structures not shown) in the sample do not resemble either structure A or B, and have diverse structural features. The optimal folding by *mfold* is a modification of structure A with three additional base pairs *C*^{–54}·*G*^{–35}, *A*^{–12}–*U*^{42} and *A*^{–11}–*U*^{4}, with an MFE of Δ*G*°_{37} = – 48.5 kcal/mol. Structure A is well predicted by the optimal folding. Its free energy is Δ*G*°_{37} = – 46.1 kcal/mol, 5% away from the MFE. Structure B has Δ*G*°_{37} = – 40.2 kcal/mol, 17% away from the MFE. Structure C has Δ*G*°_{37} = – 42.9 kcal/mol, 12% away from the MFE. For *P* = 30, neither B nor a variant resembling B as closely as C is predicted by suboptimal foldings from *mfold*, although both structures B and C are well within this range of suboptimality. By using the option for specifying base pair constraint in *mfold*, we verified that structures B and C are indeed present in the ‘missing’ set of suboptimal foldings that are excluded by the algorithm design for *mfold*, as discussed in the Introduction. In contrast to the stability indicated by the free energies, experimental analysis showed that structure B is favored by a factor of ~3 (8). The discrepancy could be explained by tertiary interactions that preferentially stabilize structure B (8). This application not only exemplifies that an important alternative structure can be better predicted by a sample of moderate size, but it also shows that alternative structures of low probability can be biologically important, because stability contributions from potential tertiary interactions are unaccounted for. The finding also suggests that the sampling algorithm is well suited to the prediction of secondary structure of mRNAs, because an mRNA may exist as a population of conformations in an intracellular environment (5).

### Assignment of probabilities to structural motifs

In many applications, certain types of structural motifs are of biological interest. Sampling also enables probabilistic prediction of any motif, with or without specific constraint(s). The probability of a motif can be directly estimated by the frequency of that motif’s occurrence in a sample. For the mRNA of the cIII gene of bacteriophage λ, this is illustrated in Table Table44 for several constrained motifs involving the *AUG* initiation codon or the Shine–Dalgarno sequence, and for a helix, a base pair and a single-stranded fragment of two bases.

### Probability profiling for predicting accessible regions in RNA secondary structure

Single-stranded regions in RNA secondary structure may be accessible for RNA:DNA, RNA:RNA and RNA:protein interactions. For prediction of accessible sites for targeting by antisense oligonucleotides, we developed a probability profiling approach based on the sampling algorithm (23). On a profile for fragment width *W*, the probability that *W* consecutive bases are all unpaired is plotted against the first base of the segment. This approach was shown to make substantially better predictions than the MFE structure. The significance of assigning probability as a measure of confidence in prediction is also highlighted by Figure Figure8A:8A: a single-stranded region (nucleotides 26–37) predicted by both the MFE structure and the ss-count statistic from *mfold* has low probabilities on the probability profile [*W* = 4 bases, as described in Ding and Lawrence (23)]. The ss-count statistic gives the propensity of a base to be unpaired, as measured by the frequency with which it is unpaired in a group of the optimal and suboptimal foldings within a specified increment of the MFE. For this RNA sequence, the same ss-count values were returned for suboptimality percentages *P* = 5 and *P* = 30. These values were averaged for *W* = 4 bases and were converted to the probability scale. Figure Figure8B8B illustrates a case of several substantial probability peaks on the profile (*W* = 4 bases), with little signal from the MFE structure, but more comparable predictions by ss-count. These cases occur when the MFE structure and the majority of the competing structures do not have a similar local structure. For the entire 1346 nt sequence, the correlation between ss-count and the profile probability is 0.6592 for *W* = 4. This correlation decreases to 0.5482 for *W* = 10. The MFE structure has commonly been used by experimentalists for antisense nucleic acid design, although with limited success (24). It is thus important to consider suboptimal foldings in the selection of antisense target sites. The probability profiling approach has the advantage that it fully assesses the uncertainty in the predictions.

### Probability profiling for specific loop types

A probability profile statistically delineates unpaired bases (*W* = 1) or fragments (*W* >1), regardless of the type of loop in which the base or fragment occurs. An extension to account for loop type is straightforward by keeping track of the loop type for unpaired bases returned from the sampling step. Thus, the sampling algorithm also enables probability profiling of a specific loop type.

For *Escherichia coli* tRNA^{Ala}, Figure Figure99 shows the profile plots (*W* = 1) Hplot, Bplot, Iplot, Mplot and Extplot, for hairpin loop, bulge loop, interior (internal) loop, multibranched loop and the exterior loop, respectively. For example, at sequence position *i*, Hplot presents the probability that nucleotide *i* is in a hairpin loop. The probability is computed by the observed frequency in a sample of 1000 structures. The three high probability regions on Hplot correspond to the regions for the D loop (*A*^{14}*GCUGGGA*^{21}), the anticodon loop (*A*^{32}*UGGCAU*^{38}) and the *T*Ψ*C* loop (*U*^{54}UCGAUC^{60}) in the cloverleaf conformation determined phylogenetically for this tRNA. There is no bulge or interior loop in the cloverleaf structure. Accordingly, there is little signal on Bplot, and only weak signal on Iplot. The three high probability regions on Mplot correspond to the three segments of unpaired bases (*U*^{8}*A*^{9}, *G*^{26} and *A*^{44}*GGUC*^{48}) in the multibranched loop of the cloverleaf. The high probability segment on Extplot corresponds to the four free dangling bases (*A*^{73}*CCA*^{76}) on the 3′ end of the cloverleaf structure. Interestingly, Extplot indicates that ~16% of the sampled structures have *A*^{14}*GC*^{16} as the junction that connects two separate folding domains.

*E.coli*tRNA

^{Ala}. (

**A**) Hplot displays the probability that a base lies in a hairpin loop; (

**B**) Bplot displays the probability that a base is in a bulge loop; (

**C**) Iplot displays the probability that a base is in an interior (internal) loop;

**...**

The most intriguing observation is that not only is the anticodon loop (second peak from the left in Hplot) the most conserved among three hairpin loops, but it also is the most conserved accessible region for any loop type. The respective probabilities on Hplot for the three anticodon bases *G*^{34}, *G*^{35} and *C*^{36} are 0.968, 0.961 and 0.962. Thus, possible alternative foldings of this tRNA in the intracellular environment are highly unlikely to affect the accessibility of the anticodon for base pairing with the codon on the mRNA. In other words, for this tRNA, the function of codon recognition is preserved with high certainty even if the folding deviates from the classic cloverleaf. Further analysis of other tRNA sequences from the tRNA database (2) is warranted to assess the degree of generality of such predicted anticodon accessibility. Before such an analysis, an extension of the algorithm for partition functions needs to be developed to allow constraints, because many tRNAs have modified bases that cannot be involved in base pairing.

### Mutual accessibility plot for prediction of RNA:RNA interaction

For RNA:RNA interactions through antisense binding, e.g. between an RNA target and chemically synthesized or natural occurring antisense RNAs, or between an RNA target and *trans*-cleaving ribozymes, the structures of both RNAs are important. Antisense binding is largely dependent on the accessibility of both the bases at the target site and the complementary bases on the antisense RNA or ribozyme. This mutual accessibility between two RNAs can be assessed with an overlay plot of probability profiles for the two RNAs at the target site (Fig. (Fig.1010).

**A**) Mutual accessibility plot obtained by overlaying probability profiles (fragment width

*W*= 4) at the target site for a 60 nt antisense RNA (embedded in a long RNA through an expression vector) and the targeted mRNA of

*H.sapiens*γ-glutamyl

**...**

The mutual accessibility plot provides a new tool to address local accessibility of both RNAs at the target site, in addition to the uncertainty in the RNA folding predictions. This tool can be valuable for the rational design of ribozymes and antisense RNAs for gene inhibition. Based on the mutual accessibility plot, we have tested three hammerhead ribozymes against breast cancer resistance protein (BCRP) in cultured cells. All three ribozymes were successful in substantially reducing the levels of the protein (K.Kowaski, Y.Ding and E.Schneider, unpublished data).

### Boltzmann probability-weighted density of states and free energy distributions

Cupal and co-workers (25) presented a recursive algorithm to compute the free energy distribution of all secondary structures (i.e. the density of states, or DOS). The algorithm is *O*(*n*^{5}) in time with a memory requirement of *O*(*n*^{3}), and is thus computationally prohibitive even for sequences of moderate length. For short sequences, this algorithm is useful for the study of evolution, through comparison of DOS between biological sequences and random sequences of the same composition (26).

A sampling estimate of the free energy distribution of probable structures is readily available from our sampling algorithm and is referred to as the Boltzmann probability-weighted density of states, or BPWDOS (Fig. (Fig.11A),11A), because structures are sampled with Boltzmann equilibrium probabilities. Information for the BPWDOS can be displayed in alternative forms for depiction of the probability that a structure lies within a threshold of the MFE, or in a free energy interval (Fig. (Fig.11B11B and C). For a specified low energy interval, sampling also generates representative structures for further examination. This overcomes the disadvantage of the algorithm by Cupal and co-workers that no information exists about individual structures corresponding to the low energy states. Such distributions could be valuable in evolutionary studies on long sequences and studies of the RNA energy landscape (27). Similarly, other sampling statistics, such as the distribution of the number of base pairs in sampled structures, can be computed.

### Sample size and statistical reproducibility

The calculation of sampling statistics is typically based on a sample of 1000 structures. To assess the adequacy of this sample size, we generated two independent samples for the mRNA of *Homo sapiens N*-acetylglucosamine kinase, 1187 nt in length. Each sample contains 1000 structures. For this sequence, an estimate of the number of all secondary structures is 1.8^{1187} ≈ 10^{303}. From the 2Dhist, the patterns of base pair frequencies are nearly identical for the two samples (Fig. (Fig.12A,12A, first sample; Fig. Fig.12B,12B, second sample). The probability profiles (*W* = 4) for both samples are also computed and are overlaid in a single plot, by plotting against common nucleotide position (Fig. (Fig.13A).13A). The two profile curves are colored red and blue, respectively. However, the overlay plot is overwhelmingly dominated by the blue color; the red curve is hardly visible when the two profiles overlap. The red curve is observed only in the case of distinguishable variation. This statistical phenomenon is better observed by an enlargement of a 200 nt region shown in Figure Figure1313B.

*H.sapiens N*-acetylglucosamine kinase. The histograms (

**A**and

**B**) display nearly identical patterns of base pair probabilities estimated

**...**

*H.sapiens N*-acetylglucosamine kinase. (

**A**) Complete profiles for the entire mRNA. (

**B**

**...**

Furthermore, through a pairwise structural comparison, we found that the two samples do not have a single structure in common. Thus, it might initially appear to be surprising that, for an ensemble of ~10^{303} structures, a sample of only 1000 structures can yield statistical reproducibility of typical sampling statistics, even if samples can be entirely different. However, these results are fully expected, because a sample generated by the algorithm is guaranteed to be statistically representative. A simple analogy is sampling from the US population of 280 million persons. If two random samples of individuals are taken independently with a sample size of 1000, it is highly likely that the two samples will not have a single individual in common, because the population is large, and because sampling is not only random but also independent. However, the two samples would produce highly consistent estimates for demographic characteristics, such as the percentage of males in the population.

Although the precision in the estimates of sampling statistics does increase with sample size, a sample size of 1000 structures is adequate for most applications. Nevertheless, in the case that a rare event of small Boltzmann probability is of interest, the sample size can be increased to improve the precision in the estimates. For example, for the *L.collosoma* SL RNA, a sample size of 1 000 000 structures is used, because the Boltzmann probability for experimental structure 1 is merely 0.0036 (Table (Table11).

### Availability of software *S*fold

Based on the structure sampling algorithm and the novel tools, we have developed a software packaged named *S*fold, for statistical RNA folding and rational design of RNA-targeting nucleic acids. *S*fold is available through Web servers at http://sfold.wadsworth.org and http://www.bioinfo.rpi.edu/applications/sfold.

## DISCUSSION

We have presented a novel statistical algorithm to sample RNA secondary structures exactly and rigorously, according to the Boltzmann equilibrium probability distribution of the secondary structures for a given RNA sequence. The forward step of the algorithm for calculating partitions functions is cubic when long interior loops are prohibited, and the backward sampling step can rapidly generate a large, statistically representative sample of structures. A sample generated by the algorithm is guaranteed to be statistically representative. Characteristics of the Boltzmann ensemble revealed by sampling are statistically reproducible for independent samples of moderate size.

In the full ensemble of secondary structures for an RNA sequence, some structures are very similar, while others are substantially different. Although it is not computationally feasible to enumerate and classify all structures, the results on *L.collosoma* SL RNA show that the ensemble can be partitioned into mutually exclusive classes. Like the individual structures, the classes are not equally probable. The structure sample size of 1000 used in our classification analysis sets a probability threshold of 0.001. Classes with Boltzmann probabilities that are significant relative to the threshold are expected to be represented in the sample. Classes with insignificant probabilities are unlikely to be observed in the sample. Thus, sampling provides an efficient means to statistically delineate the Boltzmann ensemble of secondary structures through significant classes. The development of an algorithm for the determination of distinct classes will thus be an important topic of future research.

The sampling algorithm enables a rigorous statistical description of uncertainties in RNA folding prediction. The demonstrated new predictive insights and novel application tools are unique to this algorithm. The sampling approach avoids the limitation of suboptimal folding presentation by a designed set. It also bypasses the difficulty associated with a complete enumeration and examination of all suboptimal foldings within a specified increment of the MFE.

Prediction of suboptimal foldings is important, because the biological structures may not be well predicted by the MFE structure. Even for RNAs with possibly unique structures, the sampling approach has the advantage of addressing uncertainty in the predictions.

The algorithm is shown to meet the challenges of predicting alternative biological structures and of making an adequate representation of suboptimal foldings. The improved predictions in applications to mRNAs reported both here and in our previous work suggest that the sampling algorithm and the probability profile approach are well suited to the prediction of mRNA secondary structures and to the assessment of target accessibility, because an mRNA may exist as a population of different structures, and a stochastic approach to accessibility evaluation may be appropriate (5). The probability profiling approach reveals target sites that are commonly accessible for a large number of statistically representative structures for the target RNA. Through rigorous assignment of statistical confidence in predictions, this novel approach bypasses the long-standing difficulty of how to select a single structure for accessibility evaluation.

For antisense oligonucleotides and *trans*-cleaving ribozymes, it has been well established that the target accessibility is primarily determined by the secondary structure of the target RNA. Recently, experimental evidence has emerged to support the idea that the potency of small interfering RNAs (siRNAs) is also determined by the target secondary structure and accessibility (28–31). In the post-genomic era, reverse genetic tools based on these RNA-targeting nucleic acids are becoming increasingly important for high throughput functional genomics, drug target validation and development of human therapeutics. The probability profiling method and the mutual accessibility plot are useful tools for the rational design of RNA-targeting nucleic acids for gene knock-down studies, as well as for the design of nucleic acid probes such as molecular beacons for tracing mRNAs (32). Based on probability profiling, we have designed antisense oligonucleotides against *E.coli lacZ*, and siRNAs against the human estrogen receptor. Results from preliminary testing in cell extracts and in cultured cells are highly encouraging (a summary of our unpublished data is available on the *S*fold Web sites). Large-scale experimental testing is warranted to further validate the novel design methodology.

## ACKNOWLEDGEMENTS

We thank D.H. Mathews and D.H. Turner for making free energy parameters available to us and for interpretation of the parameters, and D.H. Mathews for a suggestion that significantly improved the run time for the calculation of partition functions. We also thank K. McDonough, E. Schneider, B. Pentecost, and M. Fasco for preliminary testing of antisense oligonucleotides, hammerhead ribozymes and siRNAs. The Computational Molecular Biology and Statistics Core at the Wadsworth Center is acknowledged for providing computing resources for this work. The secondary structure displays in Figures Figures4,4, ,55 and and77 were produced using the XRNA interactive program (http://rna.ucsc.edu/rnacenter/) developed by Bryn Weiser and Harry Noller at the University of California, Santa Cruz. The two-dimensional histograms for Figures Figures22 and and33 were generated by Histo-Scope and NPlot, an interactive histogramming and plotting program (http://www.fnal.gov/fermitools/abstracts/histoscope/abstract.html) developed by George Dimas and Mark Edel at Fermi National Accelerator Laboratory, Batavia, Illinois. The long-term development of the *S*fold software is supported by National Science Foundation grant DMS-0200970 and National Institutes of Health grant GM068726 to Y.D. Chiyu Chan is responsible for the development of the graphical tools and the Web user interface.

## APPENDIX: RECURSIONS FOR EQUILIBRIUM PARTITION FUNCTIONS

The structural elements in RNA secondary structure include helix, hairpin loop, bulge loop, interior (internal) loop and multibranched loop (Fig. (Fig.A1).A1). The exterior loop consists of free dangling bases on the 5′ end and the 3′ end, and bases in the junction between two separate folding domains.

When an unpaired base is adjacent to two helices, we only consider the 3′ dangling, because it is usually more energetically favorable than 5′ dangling, according to the free energy data for dangling ends (33). The assumed additivity of free energy implies multiplicativity of contributions by structural elements to the partition functions. The contributions to the partition functions by mutually exclusive conformational cases are, however, additive. These features are important in the derivation of a recursive algorithm for partition functions.

#### Recursions

For fragment R_{ij}, it can be either single stranded or there is at least one base pair on the fragment. In the latter case, if we consider the base pair *r*_{h} –*r*_{l} closest to the 5′ end of the fragment [the first (*h*–*i*) bases are thus single stranded], we then have the following mutually exclusive and exhaustive cases, as illustrated by Figure FigureA2:A2: (a) R _{ij} is single stranded; (b) *h* = *i*, *l* = *j*; (c) *i* < *h* < *l* = *j*; (d) *h* = *i* < *l* < *j*; (e) *i* < *h* < *l* < *j*. Thus, *u*(*i*, *j*) is a sum of five terms:

*u*(*i*, *j*) = 1 + *up*(*i*, *j*)exp[– *etp*(*i*, *j*)/*RT*] + Σ_{i < h < j} *up*(*h, j*)exp{– [*ed*5(*h*, *j*, *h* – 1) + *etp*(*h*, *j*)]/*RT*} + Σ_{i < l < j} *up*(*i*, *l*)exp[– *etp*(*i*, *l*)/RT]{exp[– *ed*3(*i*, *l*, *l* + l)/*RT*]*u*(*l* + 2, *j*) + *u*(*l* + 1, *j*) – *u*(*l* + 2, *j*)} + Σ_{i < h < l < j} *up*(*h, l*)exp{– [*ed*5(*h*, *l*, *h* – 1) + *etp*(*h*, *l*)]/*RT*}{exp[– *ed*3(*h*, *l*, *l* + 1)/RT]*u*(*l* + 2, *j*) + *u*(*l* + 1, *j*) – *u*(*l* + 2, *j*)}**A1**

where for base pair *r*_{i} –*r*_{j}, *etp*(*i*, *j*) is the terminal *A*–*U*, *G*–*U* penalty, and *ed*5(*h*, *l*, *h* – 1) is the free energy for 5′ dangling *r*_{h – 1} on *r*_{h} –*r*_{l}, and *ed*3(*h*, *l*, *l* + 1) is the free energy for 3′ dangling *r*_{l + 1} on *r*_{h} –*r*_{l}. When *r*_{i} and *r*_{j} form a base pair, there are the following exclusive and exhaustive cases: (i) *r*_{i} –*r*_{j} closes a hairpin; (ii) *r*_{i} –*r*_{j} is the exterior pair of a base pair stack; (iii) *r*_{i} –*r*_{j} closes a bulge or an interior loop; or (iv) *r*_{i} –*r*_{j} closes a multibranched loop. Thus, *up*(*i*, *j*) is the sum of four contributions:

*up*(*i, j*) = exp[–*eh*(*i*, *j*)/*RT*] + exp[–*es*(*i*, *j, i +* 1*, j* – 1)/RT]*up*(*i* + 1, *j* – 1) + Σ_{i < h < l < j} exp[–*ebi*(*i*, *j*, *h*, *l*)/*RT*]*up*(*h*, *l*) + *up*_{m} (*i*, *j*)**A2**

where *eh*(*i*, *j*), *es*(*i*,*j*, *i* + l, *j* – 1) and *ebi*(*i*, *j*, *h*, *l*) are, respectively, the free energies for a hairpin closed by *r*_{i} –*r*_{j}, for stacking between base pairs *r*_{i} –*r*_{j} and *r*_{i + 1} –*r*_{j – 1}, and for a bulge or an interior loop with exterior base pair *r*_{i} –*r*_{j} and interior base pair *r*_{h} –*r*_{l}, while *up*_{m} (*i*, *j*) is the contribution from case (iv). For case (iv), if we consider the internal helix closest to *r*_{i} with closing pair *r*_{h} –*r*_{l}, the recursion for *up*_{m} (*i*, *j*) is

*up*_{m} (*i*, *j*) = Σ_{i + 1 < l < j} *up*(*i* + 1, *l*)exp{–[*a* + 2*c* + *etp*(*i* + 1, *l*)]/*RT*}{exp[–*ed*3(*i* + 1, *l*, *l* + 1)/*RT*]*u*1(*l* + 2, *j* – 1) + *u*1(*l* + 1, *j* – 1) – *u*1(*l* + 2, *j* – 1)} + Σ_{i + 2 < l < j} *up*(*i +* 2*, l*)exp{–[*a* + 2*c* + *b* + *ed*3(*j, i*, *i* + 1) + *etp*(*i* + 2, *l*)]/*RT*}{exp[–*ed*3(*i* + 2, *l*, *l* + 1)/*RT*]*u*1(*l* + 2, *j* – 1) + *u*1(*l* + 1, *j* – 1) – *u*1(*l* + 2, *j* – 1)} + Σ_{i + 3 ≤ h < l < j} *up*(*h, l*)exp{–[*a* + 2*c* + (*h* – *i* – 1)*b* + *ed*3(*j, i*, *i +* 1) + *ed*5(*h*, *l*, *h* – 1) + *etp*(*h*, *l*)]/*RT*}{exp[–*ed*3(*h*, *l*, *l* + 1)/*RT*]*u*1(*l* + 2, *j* – 1) + *u*1(*l* + 1, *j* – 1) – *u*1(*l* + 2, *j* – 1)}**A3**

where *a*, *b*, *c* are the offset, free base penalty and helix penalty of the assumed linear penalty for a multibranched loop: loop penalty = *a* + *b*(number of unpaired bases) + *c*(number of helices); the three sums with *h* = *i* + l, *h* = *i* + 2 and *h* ≥ *i* + 3 are for different cases of dangling on *r*_{i} –*r*_{j} and *r*_{h} –*r*_{l}. *u*1(*k*, *j* – 1) is an auxiliary partition function for a multibranched loop with the following properties: there is at least one helix between *r*_{k} and *r*_{j – 1} ; *r*_{k – 1} is the 3′ end of the previous helix in the loop; *r*_{j} is the 3′ end base of the closing pair *r*_{i} –*r*_{j} for the loop. Similar to the derivation of *up*_{m} (*i*, *j*), we consider the closing base pair *r*_{h ′} –*r*_{l ′} of the helix closest to the 5′ end of *R*_{( l + 1) j}, and we take into account both the dangling energy and the terminal penalty. Furthermore, we must consider both the case of no additional helix between *r*_{l ′ + 1} and *r*_{j} and the case of at least one more helix. For *u*1(*i*, *j*), *r*_{j + 1} is the 3′ base of the closing base pair for the multibranched loop, and *r*_{i – 1} is the 3′ end of the previous helix in the loop. The recursion for *u*1(*i*, *j*) is:

*u*1(*i*, *j*) = Σ_{i < l ≤ j} *up*(*i, l*)exp{–[*c* + *etp*(*i*, *l*)]/*RT*}{*f*(*j* + 1, *i*, *l*)exp[–(*j* – *l*)*b*/*RT*] + exp[–*ed*3(*i*, *l*, *l* + 1)/*RT*]*u*1(*l* + 2, *j*) + *u*1(*l* + 1, *j*) – *u*1(*l* + 2, *j*)} + Σ_{i + 1 < l ≤ j} *up*(*i +* 1*, l*)exp{–[*c* + *b* + *etp*(*i +* 1, *l*)]/*RT*}{*f*(*j* + 1, *i +* 1, *l*)exp[–(*j* – *l*)*b*/*RT*] + exp[–*ed*3(*i* + 1, *l*, *l* + 1)/*RT*]*u*1(*l* + 2, *j*) + *u*1(*l* + 1, *j*) – *u*1(*l* + 2,*j*)} + Σ_{i + 2 ≤ h < l ≤ j} *up*(*h, l*)exp{–[*c* + (*h* – *i*)b + *etp*(*h*, *l*) + *ed*5(*h*, *l*, *h* – 1)]/*RT*}{*f*(*j* + 1, *h*, *l*)exp[–(*j* – *l*)*b*/*RT*] + exp[–*ed*3(*h*, *l*, *l* + 1)/*RT*]*u*1(*l* + 2, *j*) + *u*1(*l* + 1, *j*) – *u*1(*l* + 2, *j*)}**A4**

where *f*(*j* + l, *h*, *l*) = 1 for *l* = *j* and *f*(*j* + l, *h*, *l*) = exp[–*ed*3(*h*, *l*, *l* + 1)/*RT*] for *l* < *j*. The computation is *O*(*n*^{4}) for equations **A1**, **A3** and **A4** as written, and it is *O*(*n*^{3}) for equation **A2** when long interior loops are disallowed. We introduce three additional auxiliary arrays *s*1(*h*, *j*), *s*2(*h*, *j*) and *s*3(*h*, *j*):

*s*1(*h*, *j*) = Σ_{h < l < j} *up*(*h*, *l*)exp{–[*ed*5(*h*, *l*, *h* – 1) + *etp*(*h*, *l*)]/*RT*}{exp[–*ed*3(*h*, *l*, *l* + 1)/*RT*]*u*(*l* + 2, *j*) + *u*(*l* + 1, *j*) – *u*(*l* + 2, *j*)}**A5**

*s*2(*h*, *j*) = Σ_{h < l < j} *up*(*h*, *l*)exp{–[*ed*5(*h*, *l*, *h* – 1) + *etp*(*h*, *l*)]/*RT*}{exp[–*ed*3(*h*, *l*, *l* + 1)/*RT*]*u*1(*l* + 2, *j* – 1) + *u*1(*l* + 1, *j* – 1) – *u*1(*l* + 2, *j* – 1)}**A6**

*s*3(*h*, *j*) = Σ_{h < l ≤ j} *up*(*h*, *l*)exp{–[*ed*5(*h*, *l*, *h*-1) + *etp*(*h*, *l*)]/*RT*}{*f*(*j* + 1, *h*, *l*)exp[–(*j* – *l*)*b*/*RT*] + exp[–*ed*3(*h*, *l*, *l* + 1)/*RT*]*u*1(*l* + 2, *j*) + *u*1(*l* + 1, *j*) – *u*1(*l* + 2, *j*)}**A7**

Then, the quartic sum in equation **A1** becomes Σ_{i < h < j – 1} *s*1(*h*, *j*), the quartic sum in equation **A3** becomes exp[(–*ed*3(*j*, *i*, *i* + l)/*RT*] Σ_{i + 3 ≤ h < j – 1} exp [–(*a* + 2*c* + (*h* – *i* – 1)*b*)/ *RT*]*s*2(*h*, *j*), and the quartic sum in equation **A4** becomes Σ_{i + 2 ≤ h ≤ j – 1} exp [–(*c* + (*h* – *i*) *b*)/*RT* ]*s*3(*h*, *j*). At the cost of storage of these arrays, the algorithm is cubic when long interior loops (e.g. size >30) are disregarded.

#### Boundary values

We start the computation with boundary values for short fragments and proceed to longer ones using the recursions. For 1 ≤ *i* ≤ *j* ≤ *i* + 3 ≤ *n*, *u*(*i*, *j*) = 1, *up*(*i*, *j*) = 0, *u*1(*i*, *j*) = 0, *s*1(*i*, *j*) = 0, *s*2(*i*, *j*) = 0, and *s*3(*i*, *j*) = 0; for *j* = *i* + 4 ≤ *n*, *u*(*i*, *i* + 4) = 1 + exp[–(*eh*(3) + *etp*(*i*, *i* + 4))/*RT*], *up*(*i*, *i* + 4) = exp[–*eh*(3)/*RT*], *u*1(*i*, *i* + 4) = exp[–(*c* + *eh*(3) + *etp*(*i*, *i* + 4))/*RT*], *s*1(*i*, *i* + 4) = 0, *s*2(*i*, *i* + 4) = 0, and *s*3(*i*, *i* + 4) = exp[–(*eh*(3) + *etp*(*i*, *i* + 4) + *ed*5(*i*, *i* + 4, *i* – 1 )/*RT*]; for 1 ≤ *i* ≤ *n*, *u*(*i* + l, *i*) = 1, *u*1(*i* + l, *i*) = 0; and for 1 ≤ *i* ≤ *n* – 1, *u*1(*i* + 2, *i*) = 0.

## REFERENCES

*Leptomonas collosoma*spliced leader RNA can switch between two alternate structural forms. Biochemistry, 32, 5301–5311. [PubMed]

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (852K)

- A bayesian statistical algorithm for RNA secondary structure prediction.[Comput Chem. 1999]
*Ding Y, Lawrence CE.**Comput Chem. 1999 Jun 15; 23(3-4):387-400.* - Computing the partition function and sampling for saturated secondary structures of RNA, with respect to the Turner energy model.[J Comput Biol. 2007]
*Waldispühl J, Clote P.**J Comput Biol. 2007 Mar; 14(2):190-215.* - Sfold web server for statistical folding and rational design of nucleic acids.[Nucleic Acids Res. 2004]
*Ding Y, Chan CY, Lawrence CE.**Nucleic Acids Res. 2004 Jul 1; 32(Web Server issue):W135-41.* - Revolutions in RNA secondary structure prediction.[J Mol Biol. 2006]
*Mathews DH.**J Mol Biol. 2006 Jun 9; 359(3):526-32. Epub 2006 Feb 6.* - Prediction of RNA secondary structure by free energy minimization.[Curr Opin Struct Biol. 2006]
*Mathews DH, Turner DH.**Curr Opin Struct Biol. 2006 Jun; 16(3):270-8. Epub 2006 May 19.*

- RNA Secondary Structure Analysis Using RNAstructure[Current protocols in bioinformatics / edito...]
*Mathews DH.**Current protocols in bioinformatics / editoral board, Andreas D. Baxevanis ... [et al.]. 2006 Mar; 0 12Unit-12.6* - Facilitating RNA Structure Prediction with Microarrays[Biochemistry. 2006]
*Kierzek E, Kierzek R, Turner DH, Catrina IE.**Biochemistry. 2006 Jan 17; 45(2)581-593* - CapR: revealing structural specificities of RNA-binding protein target recognition using CLIP-seq data[Genome Biology. 2014]
*Fukunaga T, Ozaki H, Terai G, Asai K, Iwasaki W, Kiryu H.**Genome Biology. 2014; 15(1)R16* - Consecutive GA Pairs Stabilize Medium Size RNA Internal Loops[Biochemistry. 2006]
*Chen G, Turner DH.**Biochemistry. 2006 Mar 28; 45(12)4025-4043* - Systematic design and functional analysis of artificial microRNAs[Nucleic Acids Research. 2014]
*Arroyo JD, Gallichotte EN, Tewari M.**Nucleic Acids Research. 2014 May 1; 42(9)6064-6077*

- Gene (nucleotide)Gene (nucleotide)Records in Gene identified from shared sequence links
- MedGenMedGenRelated information in MedGen
- NucleotideNucleotidePublished Nucleotide sequences
- PubMedPubMedPubMed citations for these articles
- SubstanceSubstancePubChem Substance links
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree

- A statistical sampling algorithm for RNA secondary structure predictionA statistical sampling algorithm for RNA secondary structure predictionNucleic Acids Research. Dec 15, 2003; 31(24)7280PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...