# Prediction and statistics of pseudoknots in RNA structures using exactly clustered stochastic simulations

^{*}To whom correspondence should be sent at the present address: Institut Curie, Section de Recherche, Centre National de la Recherche Scientifique–Unité Mixte de Recherche 168, 11 Rue Pierre et Marie Curie, 75005 Paris, France. E-mail: rf.eiruc@trebmasi.evreh.

## Abstract

*Ab initio* RNA secondary structure predictions have long dismissed helices interior to loops, so-called pseudoknots, despite their structural importance. Here we report that many pseudoknots can be predicted through long-time-scale RNA-folding simulations, which follow the stochastic closing and opening of individual RNA helices. The numerical efficacy of these stochastic simulations relies on an 𝒪(*n*^{2}) clustering algorithm that computes time averages over a continuously updated set of *n* reference structures. Applying this exact stochastic clustering approach, we typically obtain a 5- to 100-fold simulation speed-up for RNA sequences up to 400 bases, while the effective acceleration can be as high as 10^{5}-fold for short, multistable molecules (≤150 bases). We performed extensive folding statistics on random and natural RNA sequences and found that pseudoknots are distributed unevenly among RNA structures and account for up to 30% of base pairs in G+C-rich RNA sequences (online RNA-folding kinetics server including pseudoknots: http://kinefold.u-strasbg.fr).

The folding of RNA transcripts is driven by intramolecular GC/AU/GU base-pair stacking interactions. This primarily leads to the formation of short, double-stranded RNA helices connected by unpaired regions. *Ab initio* RNA-folding prediction restricted to tree-like secondary structures is now well established (refs. 1–7, ref. 8 and references therein, www.bioinfo.rpi.edu/applications/mfold, and www.tbi.univie.ac.at) and has become an important tool to study and design RNA structures, which remain by and large refractory to many crystallization techniques. Yet, the accuracy of these predictions is difficult to assess, despite the precision of stacking interaction tables (7), due to their *a priori* dismissal of pseudoknot helices (Fig. 1*A*).

*A*) An RNA secondary structure with pseudoknots. (

*B*) Minimum set of helices defined as “pseudoknots” and visualized for convenience by colored, single-stranded regions connected by two straight lines. (

*C*) The entropic cost of the actual

**...**

Pseudoknots are regular double-stranded helices that provide specific structural rigidity to the RNA molecule by connecting different “branches” of its otherwise more flexible, tree-like secondary structure (Fig. 1 *A* and *B*). Many ribozymes, which require a well defined 3D enzymatic shape, have pseudoknots (9–17). Pseudoknots are also involved in mRNA–ribosome interactions during translation initiation and frameshift regulation (18). Still, the overall prevalence of pseudoknots has proved difficult to ascertain from the limited number of RNA structures known to date. This recently has motivated several attempts to include pseudoknots in RNA secondary structure predictions (19–21).

There are two main obstacles to include pseudoknots in RNA structures: a structural modeling problem and a computational efficiency issue. In the absence of databases for pseudoknot energy parameters, their structural features have been modeled at various descriptive levels by using polymer theory (19, 21, 22). From a computational perspective, pseudoknots have proved not easily amenable to classical polynomial minimization algorithms (20) because of their intrinsic nonnested nature. Instead, simulating RNA-folding dynamics has provided an alternative avenue to predict pseudoknots (21, 22) in addition to bringing some unique insight into the kinetic aspects of RNA folding (8, 21).

Yet, stochastic RNA-folding simulations can become relatively inefficient due to the occurrence of short cycles among closely related configurations (22), which typically differ by a few helices only. Not surprisingly, similar numerical pitfalls have been recurrent in stochastic simulations of other trapped dynamical systems (ref. 23 and references therein and refs. 24–27).

To address this computational efficiency issue and capture the slow folding dynamics of RNA molecules, we developed a generic algorithm that greatly accelerates RNA-folding stochastic simulations by exactly clustering the main short cycles along the explored folding paths. The general approach, which may prove useful to simulate other trapped dynamical systems, is discussed in *Theory and Methods*. In *Results*, the efficacy of these exactly clustered stochastic (ECS) simulations is first compared with nonclustered RNA-folding simulations, before being used to predict the prevalence of pseudoknots in RNA structures on the basis of the structural model introduced in ref. 21, and reviewed briefly hereafter.

## Theory and Methods

**Modeling and Visualizing Pseudoknots in RNA Structures.** We model the 3D constraints associated with pseudoknots using polymer theory. The entropy costs of pseudoknots and internal, bulge, and hairpin loops are evaluated on the same basis by modeling the secondary structure (including pseudoknots) as an assembly of stiff rods (representing the helices) connected by polymer springs (corresponding to the unpaired regions) (Fig. 1*C*). In practice, free-energy computations involve the labeling of RNA structures into constitutive “nets” (shown as colored circuits in Fig. 1*C*) to account for the stretching of the unpaired regions linking the extremities of pseudoknot helices (see ref. 21 for details). In addition, free-energy contributions from base-pair stackings, terminal mismatches, and coaxial stackings are taken from the thermodynamic tables measured by the Turner Laboratory (7).

The main limitation of this structural model is the absence of hardcore interactions, which could stereochemically prohibit certain RNA structures with either long pseudoknots (e.g., >11 bp, one helix turn) or a large proportion of pseudoknots (e.g., >30% of formed base pairs). However, we found that such stereochemically improbable structures account for <1–10% of all predicted structures depending on G+C content (see *Results*). Hence, in practice, neglecting hardcore interactions is rarely a stringent limitation except for a few, somewhat-pathological cases.

Although the presence of pseudoknots in an RNA structure is not associated with a unique set of helices, it is convenient for visualization and statistics purposes to define the set of pseudoknots as the minimum set of helices that should be imagined broken to obtain a tree-like secondary structure (Fig. 1*B*). Finding such a minimum set (with respect to the number of base pairs or their free energy) amounts to finding the maximum tree-like set among the formed helices and can be done in polynomial time by using a classical “dynamic programming” algorithm.

**Modeling RNA-Folding Dynamics and Straightforward Stochastic Algorithm.** RNA-folding kinetics is known to proceed through rare stochastic openings and closings of individual RNA helices (28). The time-limiting step to transit between two structures sharing essentially all but one helix can be assigned Arrhenius-like rates, *k*_{±} = *k*° × exp(–Δ*G*_{±}/*kT*), where *kT* is the thermal energy. *k*°, which reflects only local stacking processes within a transient nucleation core, has been estimated from experiments on isolated stem-loops (28) (*k*° ≃ 10^{8} s^{–1}), whereas the free-energy differences Δ*G*_{±} between the transition states and the current configurations (Fig. 2) can be evaluated by combining the stacking energy contributions and the global coarse-grained structural model described above (Fig. 1*C*).

*G*

_{±}to close and open an individual helix between two neighbor RNA structures,

*i*and

*j*. Nucleation of the new helix usually involves some local unzipping of nearby helices at the barrier

**...**

Simulating a stochastic RNA-folding pathway amounts to following one particular stochastic trajectory within the large combinatorial space of mutually compatible helices (22). Each transition in this discrete space of RNA structures corresponds to the opening or closing of a single helix, possibly followed by additional helix elongation and shrinkage rearrangements to reach the new structure's equilibrium compatible with a minimum size constraint for each formed helix (21) (base-pair zipping/unzipping kinetics occurs on much shorter time scales than helix nucleation/dissociation). For a given RNA sequence, the total number of possible helices (which roughly scales as *L*^{2}, where *L* is the sequence length) sets the local connectivity of the discrete structure space and therefore the number of possible transitions from each particular structure.

Formally, we consider the following generic model. Each structure or “state” *i* is connected to a finite, yet possibly state-to-state varying number of neighboring configurations *j* via transition rates *k _{ji}* (the right-to-left matrix ordering of indices is adopted hereafter). Because

*k*is the average number of transitions from state

_{ji}*i*to state

*j*per unit time, the lifetime

*t*of configuration

_{i}*i*corresponds to the average time before any transition toward a neighboring state

*j*occurs, i.e.,

*t*= 1/Σ

_{i}_{〈}

_{j}_{〉}

*k*, and the transition probability from state

_{ji}*i*to state

*j*is

*p*=

_{ji}*k*, with Σ

_{ji}t_{i}_{〈}

_{j}_{〉}

*p*= 1, as expected, for all states

_{ji}*i*.

Hence, in the straightforward stochastic algorithm (21, 22), each new transition is picked at random with probability *p _{ji}* while the effective time is incremented with the lifetime

*t*of the current configuration

_{i}*i*. [In principle, the approach can be adapted to stochastically drawn lifetimes from known distributions

*P*(

^{i}*t*) with mean lifetime

*t*. This effectively yields a 𝒪(

_{i}*n*

^{3}) ECS algorithm in this case.] However, as mentioned in the Introduction, the efficiency of this approach is often severely impeded by the existence of kinetic traps consisting of rapidly exchanging states.

**ECS Simulations.** As in the case of RNA-folding dynamics, the simulation of other trapped dynamical systems generally presents a computational efficiency issue. In particular, powerful numerical schemes have been developed to compute the elementary escape times from traps for a variety of simulation techniques (see ref. 23 and references therein and refs. 24–27). Still, a pervasive problem usually remains for most applications due to the occurrence of short cycles among trapped states, and heuristic clustering approaches have been proposed to overcome these “numerical traps” (29).

To capture the slow folding dynamics of RNA molecules, we developed an exact stochastic algorithm that accelerates the simulation by numerically integrating the main short cycles among trapped states. This approach being quite general, it could prove useful to simulate other small, trapped dynamical systems with coarse-grained degrees of freedom.

In a nutshell, the ECS algorithm aims at overcoming the numerical pitfalls of kinetic traps by “clustering” some recently explored configurations into a single, yet continuously updated cluster *A* of *n* reference states. These clustered configurations are then collectively revisited in the subsequent stochastic exploration of states. Although stochasticity is “lost” for the individual clustered states, its statistical properties, however, are exactly transposed at the scale of the set *A* of the *n* reference states. This is achieved as follows. For each pathway on *A*, a statistical weight is defined, where *k* and *l* run over all consecutive states along from its “starting” state *i* to its “exiting” state *j* on *A*. The *n* × *n* probability matrix *P ^{A}* that sums the statistical weights over all pathways on

*A*between any two states

*i*and

*j*of

*A*is then introduced,

and the exit probability to make a transition outside *A* from the state *j* is noted: . Hence, starting from state *i*, the probability to exit the set *A* at state *j* is , with for all *i* of *A*.

Thus, in the ECS algorithm, one first chooses at random with probability the reference state *j* of *A* from which a new transition toward a state *k* outside *A* will then be chosen stochastically with probability . Meanwhile, the physical quantities of interest, such as the cumulative time lapse to exit the set *A* from *j* starting at *i*, are exactly averaged over all (future) pathways from *i* to *j* within *A*, as explained in the next subsection. Finally, the new state *k* is added to the reference set *A* while another reference state is removed, so as to update *A*, as discussed in *The* 𝒪(*n*^{2}) *Algorithm*.

**Exact Averaging over All Future Pathways.** We start the discussion with the path average time lapse to exit the set *A*. Let us introduce the time-lapse transform of , which sums the weighted cumulative lifetimes over all pathways on *A* between any two states *i* and *j* of *A*,

where the *t _{h}* values are summed over all consecutive states

*h*(from

*i*to

*j*included) along each pathway . Hence, the mean time

*to exit A*from any state

*j*of

*A*starting from configuration

*i*is . However, in the context of the ECS algorithm, the time lapse of interest is , the mean time to exit

*A*from a particular state

*j*,

The average of any path cumulative quantity of interest *x _{i}* can be similarly obtained by introducing the appropriate

*$\stackrel{\u0303}{P}$*{

^{A}*x*} matrix. In particular, the instantaneous efficiency of the algorithm is well reflected by the average pathway length between any two states of

*A*,

where with corresponding to the length of the pathway (1 is added at each state along each pathway ). Hence, starting from state *i*, corresponds to the average number of transitions that would have to be performed by the straightforward algorithm before exiting the set *A* at state *j*. As expected, can be very large for a trapped dynamical system, which accounts for the efficiency of the present algorithm. Because the approach is exact, there is, however, no *a priori* requirement on the trapping condition of the states of *A*, and the algorithm can be used continuously.

Similarly, the time average of any physical quantity *y _{i}* (such as the pseudoknot proportion of an RNA molecule) can be calculated by introducing the appropriate time-weighted matrix

*$\stackrel{\u0303}{P}$*{

^{A}*yt*}. For instance, the time average energy over all pathways between any two states

*i*and

*j*of

*A*is , where .

The actual calculation of the probability and path average matrices *P ^{C}* and

*$\stackrel{\u0303}{P}$*over a set

^{C}*C*of

*N*states will be performed recursively in the next subsection. As an intermediate step, we first consider hereafter the unidirectional connection between two disjoint sets

*A*and

*B*.

Let us hence introduce the transfer matrix *T ^{BA}* from set

*A*to set

*B*defined as , where

*p*is the probability to make a transition from state

_{ji}*i*of

*A*to state

*j*of

*B*if

*i*and

*j*are not connected). We will assume that

*A*has

*n*states and

*B*has

*m*states and that their probability and path average matrices

*P*, and

^{A}, $\stackrel{\u0303}{P}$^{A}, P^{B}*$\stackrel{\u0303}{P}$*are known. Starting at state

^{B}*i*of

*A*, we find that the probability to exit on

*j*of

*B*after crossing once and only once from

*A*to

*B*is , where we have used matrix notations. Let us consider a particular path from

*i*in

*A*to

*j*in

*B*crossing once and only once from

*A*to

*B*, with statistical weight (Π

^{B}p_{lk})

*p*(Π

_{ba}

^{A}p_{l′k′}). Its contribution to the average time to exit somewhere from the union of

*A*and

*B*is

or in matrix form for any “direct” pathway from *A* to *B*

which implies that applying the usual differentiation rules to any combination of probability matrices yields the correct combined path average matrices (defining for all *i* and *j*). Note that this out-of-equilibrium calculation of path average quantities is reminiscent of the usual equilibrium calculation of thermal averages through differentiation of an appropriate partition function. Indeed, the probability matrices introduced here are “partition functions” over all pathways within a set of reference states.

**The** 𝒪**(n ^{2}) Algorithm.** With this result in mind, we can now return to the calculation of the probability and path average matrices

*P*and

^{C}*$\stackrel{\u0303}{P}$*for the union

^{C}*C*of two disjoint sets

*A*and

*B*.

Defining *P ^{Ab}* =

*P*and

^{A}T^{AB}*P*=

^{Ba}*P*, we readily obtain the probability matrix

^{B}T^{BA}*P*as an infinite summation over all possible pathway loops between the sets

^{C}*A*and

*B*(

*I*is the identity matrix),

where *L ^{A}* = [

*I*–

*P*]

^{Ab}P^{Ba}^{–1}and

*L*= [

^{B}*I*–

*P*]

^{Ba}P^{Ab}^{–1}.

Defining also *$\stackrel{\u0303}{P}$ ^{Ab}* =

*$\stackrel{\u0303}{P}$*and

^{A}T^{AB}*$\stackrel{\u0303}{P}$*=

^{Ba}*$\stackrel{\u0303}{P}$*, we finally obtain the path average matrix

^{B}T^{BA}*$\stackrel{\u0303}{P}$*from simple “differentiation” of the “partition function”

^{C}*P*(Eq.

^{C}**6**)

Eqs. **6** and **7** are valid for any sizes *n* and *m* of *A* and *B*. Hence *P ^{C}* and

*$\stackrel{\u0303}{P}$*can be calculated recursively starting from

^{C}*N*isolated states and 2

*N*1 × 1 matrices

*P*= [1] and

^{i}*$\stackrel{\u0303}{P}$*{

^{i}*x*} = [

*x*], with

_{i}*i*= 1,

*N*, where

*x*is the value of the feature of interest in state

_{i}*i*. Clustering those states 2 by 2, then 4 by 4, etc., by using Eqs.

**6**and

**7**finally yields

*P*and

^{C}*$\stackrel{\u0303}{P}$*in 𝒪(

^{C}*N*

^{3}) operations (i.e., by matrix inversions and multiplications). However, instead of recalculating everything back recursively from scratch each time the set of reference states is modified, it turns out to be much more efficient to update it continuously each time a single state is added. Indeed, Eqs.

**6**and

**7**can be calculated in 𝒪(

*n*

^{2}) operations only, when

*m*= 1 and

*n*=

*N*–1, as we will show below. Naturally, a complete update also requires the removal of one “old” reference state each time a “new” one is added so as to keep a stationary number

*n*of reference configurations. As we will see, this removal step can also be calculated in 𝒪(

*n*

^{2}) operations only.

The 𝒪(*n*^{2})-operation update of the reference set, which we now outline, relies on the fact that *T ^{AB}, P^{Ab}*, and

*$\stackrel{\u0303}{P}$*are

^{Ab}*n*× 1 matrices and that

*T*, and

^{BA}, P^{Ba}*$\stackrel{\u0303}{P}$*are 1 ×

^{Ba}*n*matrices when

*m*= 1 and

*n*=

*N*–1(

*P*and

^{B}*L*are simple 1 × 1 matrices for a single state

^{B}*B*). Because we operate on vectors, the Sherman–Morrison formula (30) can then be used to calculate the

*n*×

*n*matrix . Hence, not only

*L*but also any matrix product

^{A}*L*, where

^{A}M*M*is an

*n*×

*n*matrix, can be evaluated in 𝒪(

*n*

^{2}) operations [by first calculating

*P*followed by

^{Ba}M*P*⊗ (

^{Ab}*P*)]. Noticing that the same reasoning applies for the

^{Ba}M*n*×

*n*matrices

*$\stackrel{\u0303}{P}$*⊗

^{Ab}*P*and

^{Ba}*P*⊗

^{Ab}*$\stackrel{\u0303}{P}$*provides a simple scheme to add a single reference state to

^{Ba}*A*and obtain matrices

*P*and

^{C}*$\stackrel{\u0303}{P}$*in 𝒪(

^{C}*n*

^{2}) operations by using Eqs.

**6**and

**7**.

To achieve the reverse modification consisting in removing one state *B* from the reference set *C*, it is useful to first imagine that the original *P ^{C}* and

*$\stackrel{\u0303}{P}$*were obtained by the addition of the single state

^{C}*B*to the

*n*-configuration set

*A*, as given by Eqs.

**6**and

**7**. Identifying row

*Q*, column

^{BA}*Q*, and their intersection

^{AB}*Q*corresponding to the single state

^{BB}*B*readily yields the vectors

*P*=

^{Ab}*Q*/

^{AB}*Q*=

^{BB}, P^{Ba}*T*(as

^{BA}*P*= [1]) and, hence, the

^{B}*n*×

*n*matrix [

*L*]

^{A}^{–1}=

*I*–

*P*⊗

^{Ab}*P*=

^{Ba}*I*– (

*Q*⊗

^{AB}*T*)/

^{BA}*Q*. This gives the following relations between the

^{BB}*known L*, and

^{A}, T^{AB}, T^{BA}, Q^{AA}, Q^{BB}, Q^{BA}, Q^{AB}, $\stackrel{\u0303}{P}$^{B}*$\stackrel{\u0303}{Q}$*and the unknown

^{AA}*P*and

^{A}*$\stackrel{\u0303}{P}$*,

^{A} which eventually provides *P ^{A}* and

*$\stackrel{\u0303}{P}$*by using the Sherman–Morrison formula (30) to invert

^{A}*I*+

*T*⊗

^{AB}*Q*,

^{BA}

Hence, the single state *B* can be removed from the set of reference *C* in 𝒪(*n*^{2}) operations to yield the updated probability and path average matrices *P ^{A}* and

*$\stackrel{\u0303}{P}$*.

^{A}Note, however, that this continuous updating procedure, using alternatively Eqs. **6** and **7** and Eqs. **9** and **10** in succession, is expected to become numerically unstable after too many updates of the reference set. For 1 ≤ *n* ≤ 300, we have usually found that the small numerical drifts [as measured, e.g., by can simply be reset every *n*th update by recalculating matrices *P ^{A}* and

*$\stackrel{\u0303}{P}$*recursively from

^{A}*n*isolated states in 𝒪(

*n*

^{3}) operations so as to keep the overall 𝒪(

*n*

^{2})-operation count per update of the reference set.

Another important issue is the choice of the state to be removed from the updated reference set. Although this choice, in principle, is arbitrary, the benefit of the algorithm strongly hinges on it (for instance, removing one of the most statistically visited reference states usually ruins the efficiency of the method). We have found that a “good choice” is often the state with the lowest “exit frequency” from the current state *i* , but other choices may sometimes prove more appropriate.

## Results

**Performance of the ECS Algorithm.** Before applying the ECS algorithm to investigate the prevalence of pseudoknots in RNA structures, we first focus on the efficacy of the approach by studying the net speed-up of the ECS algorithm with respect to the straightforward algorithm. As illustrated on Fig. 3 for a few natural and artificial sequences, there is an actual 10^{1}- to 10^{5}-fold increase of the ratio “simulated time over CPU time” between ECS and straightforward algorithms (Fig. 3, black lines) for RNA shorter than ≈150 nt. This improvement runs parallel to the expected speed-up (Fig. 3, gray lines) as predicted by (Eq. **3**), as long as the number *n* of reference states is not too large (typically *n* ≤ 50 here), so that the 𝒪(*n*^{2}) update routines do not significantly increase the operation count as compared with the straightforward algorithm.

Hence, the ECS algorithm is most efficient for small trapped systems (when the dynamics can be appropriately coarse-grained), although a several-fold speed-up can still be expected with somewhat larger systems such as the 394-nt-long group I intron pictured in Fig. 4*A*.

Alternatively, using this exact approach may also provide a controlled scheme to obtain approximate coarse-grained dynamics for larger systems. The c routines of the ECS algorithm are freely available on request.

**Pseudoknot Prediction and Prevalence in RNA Structures.** In the context of RNA-folding dynamics, the present approach can be used to evaluate time averages for a variety of physical features of interest, such as the free energy along the folding paths, the fraction of time particular helices are formed, the extension of an RNA molecule unfolding under mechanical force (32), the end-to-end distance of a nascent RNA molecule during transcription, etc. Here we report results on the prediction of pseudoknot prevalence in RNA structures. They have been obtained by performing several thousands of stochastic RNA-folding simulations including pseudoknots. As explained in *Theory and Methods*, the structural constraints between pseudoknot helices and unpaired connecting regions are modeled by using elementary polymer theory (Fig. 1*C*) (21) and added to the traditional base-pair stacking interactions and simple loops' contributions (7).

We found that many pseudoknots can be predicted effectively with such a coarse-grained kinetic approach probing seconds to minutes folding time scales. No optimum “final” structure is actually predicted, as such, in this folding kinetic approach. Instead, low free-energy structures are visited repeatedly as helices stochastically form and break. Fig. 4*A* represents the lowest free-energy secondary structure found for the 394-nt-long Tetrahymena group I intron, which shows 80% base-pair identity with the known 3D structure, including the two main pseudoknots, P3 and P13 (11, 12, 14–17). A number of smaller known structures with pseudoknots are also compared with the lowest free-energy structures found with similar stochastic RNA-folding simulations in ref. 21. In addition, to facilitate the study of folding dynamics for specific RNA sequences, we have set up an online RNA-folding server including pseudoknots (http://kinefold.u-strasbg.fr).

Beyond specific sequence predictions, we also investigated the general prevalence of pseudoknots by studying the “typical” proportion of pseudoknots in both random RNA sequences of increasing G+C content (Fig. 5) and in 150-nt-long mRNA fragments of the *Escherichia coli* and *Saccharomyces cerevisiae* genomes. The statistical analysis was done as follows: for each random and genomic sequence set, 100–1,000 sequences were sampled, and three independent folding trajectories were simulated for each of them by using the ECS algorithm. A minimum duration for each trajectory was determined so that >80–90% of sequences visit the same free-energy minimum structures along their three independent trajectories. The time average proportion of pseudoknots was then evaluated, considering this fraction of sequences had likely reached equilibrium (including the 10–20% of still unrelaxed sequences does not significantly affect global statistics). In practice, slow folding relaxation limits extensive folding statistics to sequences up to 150 bases and 75% G+C content, although individual folding pathways can still be studied for molecules up to 250–400 bases depending on their specific G+C contents.

*A*), 100-nt-long (

*B*), and 150-nt-long (

*C*) random sequences of increasing G+C content. Projected lines correspond to the average pseudoknot proportion in 50-nt-long (blue), 100-nt-long

**...**

The results for 50-nt-long (Fig. 5*A*), 100-nt-long (Fig. 5*B*), and 150-nt-long (Fig. 5*C*) random sequences show, first, a broad distribution in pseudoknot proportion from a few percent of base pairs to >30% for some G+C-rich random sequences. Such a range is compatible with the various pseudoknot contents observed in different known structures (e.g., see triangles and circles in Fig. 5*C*). Second, the average proportion of pseudoknots (Fig. 5*B* *Inset*) slowly increases with G+C content, because stronger (G+C-rich) helices are more likely to compensate for the additional entropic cost of forming pseudoknots. Third, and perhaps more surprisingly, this average proportion of pseudoknots seems roughly independent of sequence length except for very short sequences with low G+C content (Fig. 5*B* *Inset*), in contradiction to a naive combinatorial argument. Fourth, we found that the cooperativity of secondary structure rearrangements amplifies the structural consequences of pseudoknot formation; typically, a structure with 10 helices including 1 pseudoknot conserves not 9 but only 7–8 of its initial helices (whereas 2–3 new nested helices concomitantly form) if the single pseudoknot is excluded from the structure prediction. Thus, neglecting pseudoknots usually induces extended structural modifications beyond the sole pseudoknots themselves.

We compared these results with the folding of 150-nt-long sections of mRNAs from the genomes of *E. coli* (50% G+C content) and *S. cerevisiae* (yeast, 40% G+C content). These genomes exhibit similar broad distributions of pseudoknots despite small differences due to G+C content inhomogeneity and codon bias usage [pseudoknot proportions (mean ± SD): *E. coli*, 15.5 ± 6.5% (versus 16.5 ± 7.9% for 50% G+C-rich random sequences); yeast, 14 ± 6.6% (versus 15 ± 7.3% for 40% G+C-rich random sequences)]. Hence, genomic sequences seem to have maintained a large potential for modulating the presence or absence of pseudoknots in their 3D structures.

Overall, these results suggest that neglecting pseudoknots in RNA structure predictions is probably a stronger impediment than the small intrinsic inaccuracy of stacking energy parameters. In practice, combining simple structural models (Fig. 1*C*) and ECS simulations provides an effective approach to predict pseudoknots in RNA structures.

## Acknowledgments

We thank J. Baschenagel, D. Evers, D. Gautheret, R. Giegerich, W. Krauth, M. Mézard, R. Penner, E. Siggia, N. Socci, and E. Westhof for discussions and suggestions. H.I. acknowledges a stimulating 2-month visit at the Institute for Theoretical Physics (University of California, Santa Barbara), where the ideas for this work originated. This work was supported by Action Concertée Incitative Grants PC25-01 and 2029 from Ministère de la Recherche, France.

## Notes

Abbreviation: ECS, exactly clustered stochastic.

## References

**,**167–212.

**,**68–82.

**,**7826–7830.

**,**133–148. [PMC free article] [PubMed]

**,**1105–1119. [PubMed]

**,**167–188.

**,**911–940. [PubMed]

**,**199–253. [PubMed]

**,**1717–1731. [PMC free article] [PubMed]

**,**49–51. [PubMed]

**,**993–1009. [PubMed]

**,**432–438. [PubMed]

**,**567–574. [PubMed]

**,**1940–1943. [PubMed]

**,**1943–1946. [PubMed]

**,**955–965. [PubMed]

**,**367–370. [PubMed]

**,**167–185. [PubMed]

**,**609–617. [PMC free article] [PubMed]

**,**2053–2068. [PubMed]

**,**6515–6520. [PMC free article] [PubMed]

**,**953–962. [PubMed]

**,**10–18.

**,**127.

**,**R13985–R13988.

**,**4983–4987. [PubMed]

**,**381–386. [PubMed]

**,**L715.

**,**32–37. [PubMed]

**National Academy of Sciences**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (570K) |
- Citation

- Kinefold web server for RNA/DNA folding path and structure prediction including pseudoknots and knots.[Nucleic Acids Res. 2005]
*Xayaphoummine A, Bucher T, Isambert H.**Nucleic Acids Res. 2005 Jul 1; 33(Web Server issue):W605-10.* - Pair stochastic tree adjoining grammars for aligning and predicting pseudoknot RNA structures.[Proc IEEE Comput Syst Bioinform Conf. 2004]
*Matsui H, Sato K, Sakakibara Y.**Proc IEEE Comput Syst Bioinform Conf. 2004; :290-9.* - ProbKnot: fast prediction of RNA secondary structure including pseudoknots.[RNA. 2010]
*Bellaousov S, Mathews DH.**RNA. 2010 Oct; 16(10):1870-80. Epub 2010 Aug 10.* - The jerky and knotty dynamics of RNA.[Methods. 2009]
*Isambert H.**Methods. 2009 Oct; 49(2):189-96. Epub 2009 Jun 27.* - Bridging the gap in RNA structure prediction.[Curr Opin Struct Biol. 2007]
*Shapiro BA, Yingling YG, Kasprzak W, Bindewald E.**Curr Opin Struct Biol. 2007 Apr; 17(2):157-65. Epub 2007 Mar 23.*

- On the importance of cotranscriptional RNA structure formation[RNA. 2013]
*Lai D, Proctor JR, Meyer IM.**RNA. 2013 Nov; 19(11)1461-1473* - Transient RNA structure features are evolutionarily conserved and can be computationally predicted[Nucleic Acids Research. 2013]
*Zhu JY, Steif A, Proctor JR, Meyer IM.**Nucleic Acids Research. 2013 Jul; 41(12)6273-6285* - CoFold: an RNA secondary structure prediction method that takes co-transcriptional folding into account[Nucleic Acids Research. 2013]
*Proctor JR, Meyer IM.**Nucleic Acids Research. 2013 May; 41(9)e102* - Variable sequences outside the SAM-binding core critically influence the conformational dynamics of the SAM-III/SMK box riboswitch[Journal of molecular biology. 2011]
*Lu C, Smith AM, Ding F, Chowdhury A, Henkin TM, Ke A.**Journal of molecular biology. 2011 Jun 24; 409(5)786-799* - Cotranscriptional folding kinetics of ribonucleic acid secondary structures[The Journal of Chemical Physics. 2011]
*Zhao P, Zhang W, Chen SJ.**The Journal of Chemical Physics. 2011 Dec 28; 135(24)245101*

- PubMedPubMedPubMed citations for these articles
- SubstanceSubstancePubChem Substance links
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree

- Prediction and statistics of pseudoknots in RNA structures using exactly cluster...Prediction and statistics of pseudoknots in RNA structures using exactly clustered stochastic simulationsProceedings of the National Academy of Sciences of the United States of America. 2003 Dec 23; 100(26)15310

Your browsing activity is empty.

Activity recording is turned off.

See more...