- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# A Look-Ahead Model for the Elongation Dynamics of Transcription

^{†}Department of Mathematics, University of Michigan, Ann Arbor, Michigan

^{‡}Courant Institute of Mathematical Sciences, New York University, New York, New York

This document may be redistributed and reused, subject to certain conditions.

**This article has been corrected.**See Biophys J. 2010 February 17; 98(4): 741.

## Abstract

This article introduces a chemical kinetic model of the transcriptional elongation dynamics of RNA polymerase. The model's novel concept is a look-ahead feature, in which nucleotides bind reversibly to the DNA before being incorporated covalently into the nascent RNA chain. Analytical and computational methods for studying the behavior of the look-ahead model are introduced, and several approaches to parameter estimation are tested on synthetic and also on actual experimental data. Two types of experimental data are considered: 1), the mean velocity of RNA polymerase as a function of the ambient concentrations of the ribonucleoside triphosphates; and 2), the distribution of time intervals between the forward steps of RNA polymerase. By separately fitting the look-ahead model to these two types of data, we obtain estimates of the model parameters. The most difficult parameter to estimate is the width of the look-ahead window. Both types of data suggest a small window size, but the second type does a better job of distinguishing the different window sizes. These latter data rule out a window size of 1, and they strongly suggest a look-ahead window that is approximately four bases in width. Additional experiments to determine the window size are proposed.

## Introduction

RNA polymerase is the key enzyme of transcription, the step at which most regulation of gene expression occurs. Transcription consists of three distinct processes: initiation, elongation, and termination. Of these processes, elongation has been until recently the least studied, but this situation has fortunately changed with the advent and extensive use of single-molecule force microscopy (1–8).

From a modeling perspective, elongation is the transcriptional step most amenable to a quantitative description. The motion of RNA polymerase during transcription can be viewed as a stochastic process, more specifically as a random walk along the DNA. The goal of modeling is to characterize this random walk. Previous models of this kind (1–8) have all been mechanical in nature, i.e., they have considered, in one way or another, the elastic forces that arise within the RNA polymerase molecule during transcriptional elongation.

In this article (see also preliminary reports (9) and (10)), we introduce a formal chemical kinetic model for the dynamics of the movement of RNA polymerase along DNA. In our proposed model, we focus on the discrete events of reversible binding and unbinding of nucleotides to the DNA, and on the covalent linkage of nucleotides into the nascent RNA chain. In this sense, our model is formal, because it only considers the stepwise motion of the RNA polymerase, not the physics of how that motion is generated. The model proposed herein is most easily visualized in terms of the power-stroke mechanism for the forward motion of RNA polymerase (11,12), since we assume that covalent linkage of nucleotide to the nascent RNA chain is synchronous (at least on the timescales of interest) with forward translocation of the RNA polymerase by one basepair along the DNA. Our model could also be consistent with a Brownian ratchet mechanism (13) in which covalent linkage of nucleotide to the nascent RNA chain locks in diffusive forward motion of the RNA polymerase, which is provided that the overall time elapsed during a forward move would be short in comparison to the time intervals among the chemical events of binding, unbinding, and covalent linkage. We are concerned here with a sequence of chemical events, not with the physical mechanism that propels the enzyme forward.

The emphasis of this article is on parameter estimation. We first describe a stochastic simulation method that can be used to generate synthetic data on which parameter estimation procedures can be tested, and then we discuss a master-equation analysis that yields noise-free predictions for comparison with experimental data during parameter fitting. Two sets of published experimental data are considered in this article as targets for parameter estimation, and additional experiments are proposed. The first set of published data is that of Adelman et al. (14). It involves measurements of the mean velocity of transcription as a function of the ambient concentrations of the four ribonucleoside triphosphates. Velocity histograms are also reported in this work. The second set of published data (15) employs fixed concentrations of the ribonucleoside triphosphates, which are chosen to be equally rate-limiting. These concentrations are also chosen to be much lower than the values that are typically used, thus slowing the process of transcription to the point that individual forward steps of RNA polymerase are easily resolved. Such an experiment reveals the statistical distribution of the time intervals between successive forward steps of RNA polymerase, and this is valuable information for parameter estimation.

The most difficult parameter to estimate turns out to be the size of the look-ahead window. This is an integer parameter, denoted *w*, which is equal to the number of sites within the transcription bubble at which ribonucleoside triphosphates may be reversibly bound to the DNA template strand, before their covalent linkage to the nascent RNA chain. Although *w* = 1 may be regarded as a special case of the look-ahead model (as we do in this article), it should be kept in mind that only when *w* > 1 does the look-ahead model deserve its name, since it is only if *w* > 1 that there is any parallel processing of the ribonucleoside triphosphates, with selection of the correct base being done at several DNA template-strand sites simultaneously.

Our approach to the determination of the integer parameter *w* is simply to try different values of *w* and for each such value to fit the model to the experimental data by adjusting the rate constants of the model. We then compare the quality of the fit that can be achieved for each of the different hypothesized values of *w*. This is a fair comparison, since the model is formulated in such a way that the total number of parameters is independent of *w*.

When this fitting procedure is applied to the experimental data of Adelman et al. (14), the best fit to the mean transcription velocity as a function of the ribonucleoside triphosphate concentrations seems to be obtained with *w* = 1 or with *w* = 2, and the fit seems to become gradually worse as the window size increases from there. One might hope that the velocity histograms would help to choose between *w* = 1 and *w* = 2, but in fact these two cases predict nearly identical velocity histograms, both of which underestimate the spread in the experimental velocity histogram by roughly a factor of two (although this may well be explained by experimental variability not taken into account by the theory).

The fit of the model to the statistical distribution of the time intervals (waiting times) between successive forward moves, as reported in Abbodanzieri et al. (15), is much more successful at resolving the window size. Here, it turns out that there is a qualitative distinction between the predictions of the model with *w* = 1 and corresponding predictions with *w* > 1. Specifically, the predicted waiting time distribution in the case *w* = 1 is nonmonotonic: it rises to a peak and then decays. The waiting time distributions for *w* > 1 are monotone decreasing, as are the experimental data. An excellent fit is obtained for *w* = 4. We regard this as evidence in favor of the look-ahead hypothesis.

Additional experiments specifically designed to determine the window size are proposed, and the procedures for extracting the window size from the proposed experiments are tested on synthetic data.

## The Model

During elongation, the double-stranded DNA is locally melted by the RNA polymerase over a distance of ~14–17 basepairs. This locally melted region is known as the transcription bubble. Within the transcription bubble, one strand of the DNA acts as a template, upon which complementary ribonucleoside triphosphates (ATP, GTP, CTP, and UTP) can reversibly bind and unbind to/from the DNA template strand. It has been hypothesized, however, that only a part of the transcription bubble is actually used for transcription. The size of this window of activity within the transcription bubble formed by the RNA polymerase is an integer parameter of our model. The binding of ribonucleoside triphosphates within the window of activity is assumed to be reversible.

An irreversible reaction, however, is the incorporation of a nucleotide into the nascent RNA chain. This can occur only when that nucleotide is reversibly bound at the first site of the window of activity, i.e., the site at the 3′ end of the nascent RNA chain. When such incorporation of a nucleotide into the nascent RNA chain occurs, we assume that the RNA polymerase (and hence the transcription bubble and the window of activity) translocates forward one basepair. If the window of activity has a size of more than one basepair, it is quite likely that when the polymerase molecule, and hence the window, moves forward, it will already find the correct nucleotide bound at what has just become the site where that nucleotide can be incorporated into the growing RNA chain. This is the look-ahead feature of the model, a kind of parallel processing: placement of the correct ribonucleoside triphosphate at each site on the template strand of the DNA can occur before that site has been reached by the nascent RNA molecule.

The model is completely specified, then, by the following parameters:

*w*is the length (in bases) of the look-ahead window.- (
*k*_{on})_{ij}is the rate constant for reversible binding of ribonucleoside triphosphate of type*i*(ATP,CTP, GTP, or UTP) to deoxyribonucleotide of type*j*(A, C, G, T) in the template strand within the window of activity. - (
*k*_{off})_{ij}is the rate constant for unbinding of reversibly bound ribonucleoside triphosphate of type*i*from deoxyribonucleotide of type*j*. - (
*k*_{f})_{ij}is the rate constant for covalent incorporation of nucleotide of type*i*into the nascent RNA chain, provided that there is a ribonucleoside triphosphate of type*i*reversibly bound to a deoxyribonucleotide of type*j*at the first site or the window of activity.

Note that we consider not only correct Watson-Crick basepairings, but also the possibility of errors. The parameter (*k*_{on})_{ij} is of course, much larger, and (*k*_{off})_{ij} much smaller, when (*i*,*j*) is a correct Watson-Crick basepair than otherwise. This mechanism protects against errors in transcription. Further error protection could be obtained by making (*k*_{f})_{ij} larger when (*i*,*j*) is a correct Watson-Crick basepair then when it is not. In our simulations, however, we have assumed that *k*_{f} is constant, independent of (*i*,*j*).

Fig. 1 shows the look-ahead window of RNA polymerase. Since the first site (*left end* of *box*, indicated by *vertical tick mark*) is unoccupied, the polymerase cannot move forward. Possible events are the unbinding of C, G, or U, or the binding of any ribonucleoside triphosphate (rNTP) to any of the five unoccupied sites. Fig. 2 (*top*) is the same as Fig. 1 except that the first site within the look-ahead window is also occupied. Possible events still include the unbinding of any of the reversibly bound rNTPs or the binding of any rNTPs (including incorrect Watson-Crick basepairing) to any of the unoccupied sites. In this case, however, there is an additional possible event because the first site is occupied, namely, the forward motion of RNA polymerase, as depicted by the arrow in the figure. Note, in particular, that after this motion the new first site in the window may again be occupied (as shown), leading to the possibility of another forward step as a subsequent event.

**...**

*vertical mark*in the figure) of the window of activity is occupied, the rNTP that is located there can be covalently and irreversibly linked to the nascent RNA

**...**

### Simulation and analysis of the look-ahead model

#### A stochastic approach

One approach in studying the proposed model is to use stochastic computational methods. We model the movement of RNA polymerase along DNA using the Gillespie algorithm (16,17). For every possible transition, a suitable rate constant is assigned: for each unoccupied site within the window of activity, there are four binding rate constants, one for each of the ribonucleoside triphosphates that can possibly occupy that site. Note that if a site is occupied within the window of activity, then there is a rate constant for the ribonucleoside triphosphates on that site to dissociate, and if the first site within the look-ahead window is occupied, then there is a rate constant for the RNA polymerase to translocate forward one basepair along the DNA, incorporating the rNTP at the first window site into the nascent RNA chain while so doing.

The Gillespie algorithm jumps from event to event. Let *K* = (*k*_{1} + … + *k*_{m}) be the sum of the individual reaction rates of those reactions that are possible given the current state, where each of the *k*_{n} values is selected from one of the (*k*_{on})_{ij}, (*k*_{off})_{ij}, and, (*k*_{f})_{ij} (if appropriate). Note that the number of possible reactions at any given time is given by *m* = 4*u* + (*w* − *u*) + *b*, where *w* is the window size, *u* is the number of unoccupied sites, and *b* = 1 if the first site is occupied and *b* = 0 otherwise. At each step, choose the time *T* to the next event from the probability density function

and then, independently of the above, choose which event occurred so that event *j* is chosen with probability,

#### A master-equation formulation

Another approach to studying the look-ahead model is to formulate and solve the master equation that describes the time evolution of the probabilities of the different possible states of the model. Although the master equation describes an underlying stochastic process, the evolution of probabilities that it describes is deterministic, since these probabilities refer to a large ensemble of similar systems. Thus, the master-equation solution is noise-free, even though the underlying dynamics of the look-ahead model are stochastic. The same parameters that were used above when introducing the look-ahead model also appear in the master-equation formulation. We simplify the problem, however, by considering only correct Watson-Crick basepairing. Another simplification made here is that the DNA sequence is generated by a random process in which the choice of base at each location is made independently for the different locations on the DNA. Thus, we assume that the DNA sequence is fully characterized by the four base frequencies, whose sum must be one.

A master equation is a first-order differential equation describing the time-evolution of the probability of a system to occupy each one of a discrete set of states,

where *P*(*k*), which is a function of time although we do not write that explicitly, is the probability that the system is in the state *k* at any particular time, and where *R*(*k*, *l*), which in our case will be independent of time, is the probability per unit time that the system in state *k* will make a transition to state *l*. Once the master equation has been formulated, we study its steady state by setting each of the time derivatives *dP*(*l*)/*dt* equal to zero, along with an additional constraint that the probabilities of all states add up to one.

The formulation of the master equation for the look-ahead model proceeds as follows.

Let *w* = window size. Possible states of the window are

where *a*_{i} {1, 2, 3, 4} and *b*_{i} {0, 1}.

Here *a*_{i} indicates which DNA base on the template strand is located at site *i* within the window, and *b*_{i} indicates whether a complementary RNA base is present (*b*_{i} = 1) or absent (*b*_{i} = 0).

Possible reactions and corresponding rate constants are described below. For reversible binding events, we have the set of reactions

where *i* = 1,…,*w* and *k*_{on}(*a*_{i}) is the probability per unit time of binding an rNTP to site *i* of the window of activity when base *a*_{i} is present at the corresponding site on the DNA template strand, given that site *i* is currently empty, i.e., that it does not currently have an rNTP bound. The notation (*b*_{i} = 0) is a Boolean expression that evaluates to 1 when it is true and 0 when it is false, and similarly for other such expressions that appear below. Recall that the values of *b*_{i} are 1 or 0, depending on whether site *i* is occupied by an rNTP or not. The factor (*b*_{i} = 0) in the probability per unit time for filling site *i* therefore makes that probability per unit time equal to zero if site *i* is already filled. The notation *δ*^{i} represents a vector of length *w* with 1 in the *i*^{th} position and all other elements equal to zero, so that

Thus, if *b* denotes a state in which site *i* is empty, *b* + *δ*^{i} denotes a state in which all sites other than *i* are the same as in state *b*, but site *i* is filled.

For unbinding events, we have

where *k*_{off}(*a*_{i}) is the probability per unit time of the unbinding of an rNTP from site *i* of the window of activity, given that the base *a*_{i} is present at the corresponding site on the DNA template strand, and also that there is currently an rNTP (reversibly) bound at site *i*. The latter condition is enforced by the Boolean factor (*b*_{i} = 1) in the unbinding rate.

If the first site of the window of activity is occupied, then we must also allow for the incorporation reaction in which the RNA base located in position 1 of the window is covalently incorporated into the nascent RNA chain; the window then shifts forward by one basepair along the DNA. Recall that, in our model formulation, covalent linkage and forward motion are simultaneous.

When the window steps forward (to the right in our notation), all the *a*_{i} and *b*_{i} values shift one step to the left relative to the window. In this shift, the values that were originally stored as *a*_{1} and *b*_{1} are discarded, and we have to decide what values to put in *a*_{w} and *b*_{w}. Immediately after the shift (forward movement of the RNA polymerase) it is clear that we should set *b*_{w} = 0, since there has not been time for an rNTP to bind to the newly created last site that has just been introduced into the window of activity. It is also clear that *a*_{w} should be set equal to the value that represents the base on the DNA template strand that has just been drawn into the window of activity. Recall the assumption, stated above, which we make in this section, that the DNA sequence is random, with bases drawn independently from specified base frequencies for the DNA template strand. Let the probability of choosing base *j* for any particular position be *α*(*j*), where *j* = 1, 2, 3, 4, *α*(*j*) > 0, and ${\sum}_{j=1}^{4}\alpha \left(j\right)=1$. Then, immediately after the shift, we may set *a*_{w} = *j* with probability *α*(*j*).

It is now clear that the possible reactions and corresponding probabilities per unit time associated with incorporation of a base into the nascent RNA chain, together with the associated forward movement of the RNA polymerase molecule, are

where *a*′_{w} = 1, 2, 3, 4.

For a given starting state (*a*, *b*), there are, at most, *w* possible binding reactions (Eq. 2 with *i* = 1, 2, …, *w*); at most, *w* possible unbinding reactions (Eq. 4 with *i* = 1, 2, …, *w*); and at most, four possible incorporation/forward-stepping reactions (Eq. 5 with *a*′_{w} = 1, 2, 3, 4). In all three cases, only some of these possible reactions have nonzero rates, as indicated by the Boolean factors (*b*_{i} = 0), (*b*_{i} = 1), and (*b*_{1} = 1) in their rate constants (probabilities per unit time).

These reactions were written in terms of the state of origin. We also need to express them in terms of the destination state. In that case, the same reactions as above will appear but they, and their rates, will be expressed slightly differently. For reversible binding events, we have

Note that the rate constant now has the factor (*b*_{i} = 1), instead of (*b*_{i} = 0). The reason is that *b* now refers to the destination state. For unbinding events, we have

Finally, we have for the forward step of the RNA polymerase molecule,

Note that the condition (*b*_{1} = 1) is no longer needed here, since that requirement is built into the origin state. It is replaced by (*b*_{w} = 0), since the destination state cannot have anything bound to the last site in the window immediately after the forward move of the RNA polymerase.

The master equation may now be written as

There is one such equation for each of the 8^{w} choices of $\left(\begin{array}{c}a\\ b\end{array}\right)$. The steady-state equations are of course found by setting $\frac{d}{dt}P\left(\begin{array}{c}a\\ b\end{array}\right)=0$ and imposing the normalization

Once the steady-state equations have been solved, the mean forward velocity of the RNA polymerase in basepairs per second may be evaluated as

Note that $\overline{v}$ is just the product of *k*_{f} and the probability that *b*_{1} = 1.

### The master-equation and stochastic approaches are consistent

To verify that the stochastic simulation and the steady-state master-equation solution give the same mean velocity results, we consider a sequence of template DNA generated by the following simple stochastic process: each base is chosen independently with equal probabilities for the four possible outcomes. Note that the particular sequence chosen is only used in the stochastic simulation; the master equation only involves the base frequencies. We found that the only difference between the two results was the statistical error of the stochastic simulation, which can be reduced by increasing the length of the run. Such results are shown in Table 1.

If an actual DNA sequence is used in the stochastic simulation, then the best we can do to match it in the master-equation formulation is to input the four base frequencies from that DNA sequence. In this situation, we no longer expect perfect agreement in the computed mean velocities, even in the limit of infinitely long stochastic simulations, since the stochastic simulation result may depend on correlations in the given base sequence to which the master-equation formulation is blind. Our simulations found a small but persistent discrepancy between the mean velocity computed by the stochastic simulation when an actual DNA sequence was used and that predicted by the steady-state master-equation solution. Because the discrepancy is small, in practice, we can justify using the master-equation formulation for real DNA sequences by reflecting its base frequencies.

## Parameter Estimation

### Interpretation of experimental data

We first discuss the type of experimental data that are shown in Fig. 2 of Bai et al. (7). In the experiments reported there, a particular rNTP concentration was varied (with the other three rNTP concentrations held constant at 1000 *μ*M) to determine the influence of the varied rNTP concentration on the mean velocity of the RNA polymerase molecule. This was done for all four rNTP concentrations separately.

In our interpretation of these experimental data, we assume that the reversible binding of an rNTP to its complementary base on the template DNA strand is governed by the law of mass action. Thus,

where *i* = 1, 2, 3, 4 specifies a particular ribonucleoside triphosphate and where [rNTP]_{i} is the ambient concentration of that rNTP. In the above equation, [rNTP]_{0} = 1 mM = 1000 *μ*M is an arbitrarily chosen reference concentration that is introduced so that the units of (*k*_{on})_{i} and (*k*′_{on})_{i} are the same, namely s^{−1}. The particular value chosen for [rNTP]_{0} has no significance at all.

It is important to note that the above mass action equation only holds for direct simple binding with no intervening binding events, such as the rNTP binding to another site in the RNA polymerase before binding to its complementary base on the DNA template strand. Note that (*k*′_{on})_{i} is, by the mass action hypothesis made above, independent of concentration and is the actual parameter that we wish to find by comparing the model's results with the experimental data. No matter how many different combinations of rNTP concentrations were used in the experiment, there are only four distinct values of (*k*′_{on})_{i}. This type of experimental data is useful because each additional combination of rNTP concentrations enriches the data set without increasing the number of model parameters, provided that the mass action assumption is made as described above.

### Model calibration to noise-free synthetic data

Before considering actual experimental data (for which the true parameters are unknown), we test our approach to parameter estimation by generating synthetic data for an arbitrarily chosen set of parameter values, to see whether those parameters can be recovered by fitting the model to the synthetic data. The synthetic data that we generate will be of the type discussed above, i.e., they will describe the mean velocity of transcription as a function of the different rNTP concentrations.

There are two fundamentally different ways that such synthetic data can be generated. One is to use the master-equation formulation, which generates noise-free synthetic data, and the other is to use stochastic simulation, which generates noisy data with a noise level that can be adjusted (as in an actual experiment) by varying the amount of data that is collected. These two kinds of synthetic data will be used in this subsection and in the next, respectively. In both cases, though, regardless of which method was used to generate the synthetic data, we use the master-equation formulation in the parameter fitting process itself.

To make the parameter fitting procedure more robust by reducing the dimension of the parameter space, we make certain a priori assumptions that reduce the number of unknown parameters. In this article, we only do parameter fitting under the following simplifying assumptions: First, only correct Watson-Crick basepairing is considered. Next, we assume that all of the off-rates are negligible, and that the forward rate is independent of which nucleotide is being incorporated into the nascent RNA chain. Finally, we treat the base frequencies of the DNA template strand as known parameters, since these can be independently measured in any particular case. With these assumptions, we have six unknown parameters to consider: the window size *w*, and the five rate constants (*k*′_{on})_{A}, (*k*′_{on})_{C}, (*k*′_{on})_{G}, (*k*′_{on})_{U}, and *k*_{f}. Of course the window size is restricted to positive integer values (and in practice we only consider the values 1, 2, 3, or 4), and the rate constants are not allowed to be negative. There are no other constraints.

The objective function that we seek to minimize during parameter estimation is simply the squares' sum of the differences of the computed mean velocities from the experimental mean velocities (which are synthetic data in this section and the next, but which then will be taken as the actual experimental data of (7)). The way that we deal with the discrete parameter *w* is simply exhaustive search, i.e., we do a separate minimization of the objective function for each value of *w* and see which gives the smallest value of the objective function (which will be called the residual in the following). For each fixed *w*, we use the nonlinear least-squares package of MATLAB (The MathWorks, Natick, MA) to do the minimization of the objective function with respect to the five rate constants listed above. To construct an initial guess we choose each of these rate constants randomly and independently from an exponential distribution.

Noise-free synthetic data were generated for window sizes 1, 2, 3, and 4, with rate constants chosen arbitrarily, and then the true parameters were forgotten, so to speak, and parameter fitting was done as described above to see whether the true parameters, including the window size, could be recovered. The results, shown in Table 2, indicate not only that a reasonable residual value can be returned, but also that the original set of parameters can indeed be reliably recovered.

### Model calibration to stochastic synthetic data

In the previous subsection, we studied parameter estimation of a noise-free model to noise-free synthetic data. Here, we study parameter estimation of the same noise-free model in the context of stochastic synthetic data. The reason for doing this, of course, is that stochastic synthetic data are more representative of the kind of data that would actually be available from a real experiment. Our approach to parameter estimation here is exactly the same as in the previous subsection; the only difference is that stochastic simulations are used to generate the synthetic data. This introduces an additional consideration, however, which is the amount of data that is collected in any particular simulated experiment. As in real experiments, we regard each synthetic experiment as being comprised of some number of runs. Each run involves the synthesis of an RNA chain containing ~1800 bases. Recall that the output of interest is the mean velocity of transcription, which is obtained by averaging over all of the runs. Clearly, this mean velocity will be increasingly noise-free as the number of runs increases, and this should facilitate recovery of the true parameters. What we seek to determine, then, is the number of runs that will be needed for successful parameter recovery.

The results of the parameter estimation of the look-ahead model to stochastic synthetic data for different numbers of runs and for window size *w* = 3 can be found in Table 3. We observe that as the number of runs increases, the residual values get smaller, for the correct window size case. The residuals for the wrong window sizes are much larger than the residuals for the correct window size (see Table 4). The story is essentially the same (data not shown) when the true window size is different from 3.

The conclusion of these studies with stochastic synthetic data is that 30 runs (at each set of rNTP concentrations) suffice for the reliable recovery of the true parameters. This is a feasible number of runs for an actual experiment (see (14)).

### Parameter estimation to experimental data

In the previous subsections, we calibrated our model to synthetic data; we concluded that the methodology outlined above for parameter estimation reasonably recovers the original (i.e., true) parameters. This was demonstrated both for noise-free synthetic data and also for noisy synthetic data generated by stochastic simulation. In the latter case, it was necessary to control the noise by doing sufficiently many runs (30 runs) to obtain each data point.

We now estimate parameters that give the best fit of the look-ahead model to the actual experimental data found in Bai et al. (7). As in the synthetic data case, the fit is based on mean velocity as a function of concentrations of the various rNTPs, and the master-equation formulation of the model is used in the parameter-estimation procedure. The results are summarized in Table 5. The magnitudes of the residuals indicate that the best window sizes are 1 and 2. These two best fits are visualized in Figs. 3 and 4.

As an additional check on the model, the estimated parameter values are used to generate velocity histograms, and these are compared to the corresponding velocity histograms that are found experimentally (see Fig. 5). One might hope that the velocity histograms would help distinguish between the window sizes 1 and 2, but this is not the case. Indeed the predicted velocity histograms for those two cases are virtually indistinguishable from each other, and have approximately half the width of the corresponding experimental histogram. Although this discrepancy may point to deficiencies in the look-ahead model (and in particular to the special case of the look-ahead model that was used in doing the parameter fitting), it is also possible that there are sources of noise in the experimental procedure and data collection that are not taken into account in our simulations.

### Waiting time distribution

A more detailed approach to study the statistics of the motion of RNA polymerase is to analyze the distribution of the waiting times between successive base incorporations into the nascent RNA. In a recent publication (15), an experiment is described to measure this waiting time distribution under very low rNTP concentrations, concentrations which, besides being low, were chosen to be equally rate-limiting. (Note that the phrase “equally rate-limiting” is not intended to imply that the binding of rNTP is the rate-limiting step in the forward progress of RNA polymerase during transcription elongation. Instead, it refers to a condition in which the ambient concentrations of the different rNTP have been adjusted so that the mean time required for each DNA base to be transcribed is the same for all four of the DNA bases.) We now compare the results of these published experiments to the predictions of a special case of the look-ahead model: 1), only correct Watson-Crick basepairing is allowed; 2), all four binding rates *k*_{on} are equal, and all of the unbinding rates *k*_{off} are zero; and 3), the forward (incorporation) rate, *k*_{f}, which is relevant only when the first site of the look-ahead window is occupied, is the same, regardless of which base is being incorporated.

Note in particular the assumption that all four of the binding rates *k*_{on} are equal. Within the framework of the look-ahead model with negligible off-rates and a single forward rate, this is the parameter choice that realizes the condition used in the experiment that all four of the rNTP concentrations are equally rate-limiting. This is an important simplification, since it reduces the number of parameters that need to be determined, and even more so since it, together with the assumption that the forward (incorporation) rate is independent of which base is being incorporated, makes the statistics of the motion of RNA polymerase completely independent of the DNA sequence, thus simplifying the analysis of the model.

Under these simplifying assumptions, it is straightforward to show that the waiting time distribution of the look-ahead model is always of the form

where *ρ*_{T}(*t*) is the probability density for the time *T* of the next forward move after the forward move that occurred at *t* = 0, and where *θ* is a parameter that depends on the window size, *w*, in a manner that is detailed below for the particular cases *w* = 1, 2, 3, 4.

The explanation of this general formula for the waiting time distribution is very simple. Immediately after a forward move, the first site of the window of activity may be occupied or unoccupied. If it is occupied, then the time to wait until the next forward move is simply an exponentially distributed random variable with mean 1/*k*_{f}, as in the first term on the right-hand side of Eq. 12. In the opposite case, in which the first site is unoccupied immediately after a forward move, then the next forward move cannot occur until that site fills, an event that has probability per unit time *k*_{on}. In these circumstances, the waiting time until the next forward is the sum of two independent exponentially distributed random variables, the first with mean 1/*k*_{on} and the second with mean 1/*k*_{f}. The probability density for the sum has the form of a difference of exponentials, as in the second term on the right-hand side of Eq. 12. The factor (1 − *θ*)/(1 + *θ*) is the probability that the first site is occupied immediately after a forward move, and the factor 2*θ*/(1 + *θ*) is the probability that the first site is unoccupied immediately after a forward move. Note that these two factors add up to one. Different window sizes have different waiting time distributions only because these probabilities depend upon the window size. In particular, for window size 1 it is always the case that the first site is empty immediately after a forward move, so *θ* = 1 when *w* = 1. Clearly, with *k*_{on} and *k*_{f} held constant, increasing the window size can only decrease the probability that the first site is empty immediately after a forward move, thus we expect that *θ* will decrease as the window size increases.

The problem of determining the value of *θ* as a function of the ratio *γ* = *k*_{on}/*k*_{f} for any particular window size is a challenge for which the difficulty seems to grow rapidly with the window size. We have managed to solve this problem for *w* = 1, 2, 3, 4, and have verified the results by computer simulation. The formulae we have found are

where *γ* = *k*_{on}/*k*_{f}, and where

We have fit the above formula for the waiting time distribution to the experimental data reported in Abbodanzieri et al. (15). For each window size separately, we have found the parameters *k*_{on} and *k*_{f} that give the best fit of the model to the data, in a least-squares sense. The data are reported in Abbodanzieri et al. (15) on a semilogarithmic plot; that is, the logarithm of the probability density is plotted against the waiting time, and we have done the fit with the data in that format as well (see Fig. 6). Since the logarithmic scale emphasizes rare events, however, we have also replotted (but not refit) both the data and the best-fit theoretical curves on an ordinary linear plot for comparison (see Fig. 7).

The waiting time distribution clearly distinguishes the different window sizes. The most important result here is that the window size *w* = 1, in which there is no look-ahead at all, is clearly ruled out by the data. The theoretical probability density of the waiting time in that case has the form given by the second term only on the right-hand side of Eq. 12. This term, which is a difference of two exponentials, describes a curve that rises from zero to a peak value before decaying, unlike the data, which are monotone decreasing. The fact that the shortest waiting times have the highest probability densities in the data is strong qualitative evidence in favor of the look-ahead concept, since this observation implies that the first site of the window of activity is quite likely to be occupied immediately after a forward move, and this requires the kind of parallel processing that is implied by the look-ahead model.

The fit of the model prediction to the data is particularly good for the window size *w* = 4, the largest window size for which we currently have a theoretical result available for comparison. Although the fit for this case on the logarithmic scale (Fig. 6) still shows some error for the longer waiting times (which occur only rarely in the data), that error becomes invisible when the data and the theoretical results are replotted on a linear scale, where the visual impression is of an essentially perfect fit (Fig. 7).

The conclusion that the non-lookahead case *w* = 1 has a waiting time distribution that rises from zero to a peak before decaying (contrary to the monotone decay of the experimental data) is very general and not dependent on specific modeling assumptions. Any transcription model with a single site for binding of rNTP has the feature that at least two steps are needed per forward step of the enzyme, namely the binding of rNTP and its covalent linkage to the nascent RNA chain. Any such model will therefore have a nonmonotone waiting time distribution qualitatively like that derived above for the case *w* = 1. The only escape from this conclusion is that the rising phase of the waiting time distribution may be so fast that it is not resolved by the experimental measurement. This would be the case, for example, if the binding/unbinding of rNTP were a process of rapid equilibrium. We consider this possibility below.

### Rapid equilibrium limit

A potential criticism of the parameter fitting procedures considered in this article is that all of the unbinding rates have arbitrarily been set equal to zero. In this section, therefore, we briefly discuss the opposite limit, in which the reversible binding/unbinding within the window of activity is regarded as a rapid equilibrium process. In this limit, the size of the window of activity makes no difference, so the look-ahead feature of our model becomes irrelevant, and we might as well consider only the case *w* = 1. This obviously implies that it is futile to try to determine the window size by parameter fitting if reversible binding is a rapid equilibrium process.

Let us consider the form of the waiting time distribution under the rapid equilibrium assumption with the rNTP concentrations chosen to be equally rate-limiting. What “equally rate-limiting” means in the context of rapid equilibrium is that the product *k*_{f}*p*_{occupied} is the same for all four of the rNTP, where *p*_{occupied} is the probability that a site which can bind that particular rNTP is occupied. Note that *k*_{f} and *p*_{occupied} may separately differ for the different rNTPs, provided that their product is the same for all four rNTPs. This can always be achieved by adjusting the ambient rNTP concentrations, since each of the *p*_{occupied} can be adjusted within the interval (0, 1) by varying the corresponding rNTP concentration.

Under the conditions described in the previous paragraph, it is easy to see that the waiting-time distribution is a simple exponential of the form *k* exp(− *kt*), where *k* = *k*_{f}*p*_{occupied} This would give a straight line on a semilogarithmic plot and is inconsistent with the experimental data reported in Abbodanzieri et al. (15).

One can always argue, however, that the equally-rate-limiting condition as described above may not have been perfectly achieved in the experiment. In that case, the waiting time distribution under the assumption of rapid equilibrium would be mixture of several exponentials, and it might indeed be possible to fit the experimental data with such a model. Further experimental work may be needed to clarify this issue. If the rapid equilibrium assumption is correct, then it should be possible to find a combination of ambient rNTP concentrations that fulfill the above conditions and make the waiting time distribution into a single exponential. This would disprove (or at least make irrelevant) the look-ahead model, since rapid equilibrium makes the first site of the look-ahead window be the only one that matters.

A further prediction of the rapid-equilibrium assumption, in common with all models that have *w* = 1, is that the waiting times for the individual forward moves of the RNA polymerase model should be statistically independent of each other. This will be discussed more fully below; see Proposed Experimental Test to Rule Out a Large Class of Models in which Look-Ahead Does Not Occur.

### Proposed experiments to determine the window size

The foregoing results leave some ambiguity about the size of the look-ahead window. Experimental data on mean velocity as a function of concentration (7) are best fit by the look-ahead model with *w* = 1 or *w* = 2. Velocity histograms obtained in those same experiments do not help to determine the window size (and indeed have widths that are approximately twice that predicted by the model, regardless of the window size), but experimental data on waiting time distributions obtained with low, equally rate-limiting ambient rNTP concentrations (15) are well fit by the look-ahead model with *w* = 4. To help resolve this ambiguity, we now propose two additional experiments that may help to determine the window size.

The experiments we propose are both of the type in which ambient rNTP concentrations are varied and the mean velocity of transcription is measured. In the first proposed experiment, we again exploit the notion of equally rate-limiting concentrations (15) to obtain what we call universal curves. There is one such curve for each window size, with no adjustable parameters. In the second case, we propose experiments with saturating rNTP concentrations for more direct determination of the parameter *k*_{f}, after which it should be straightforward to determine the window size.

Throughout this section, we employ the six-parameter version of the look-ahead model that was considered previously. Recall that the unknown parameters of this version of the model are the window size, *w*, the forward rate, *k*_{f}, and the four concentration-independent on-rates, (*k*′_{on})_{i}, from which the on-rates themselves, (*k*_{on})_{i}, can be determined once the rNTP concentrations are known. It is the ability to manipulate the on-rates by varying the concentrations that motivates the experimental protocols proposed here.

### Universal curves

This proposed method of determining the window size is based on the observation that the look-ahead model simplifies enormously in the special case that all four of the rates (*k*_{on})_{i} are equal. In that case, the DNA sequence becomes irrelevant, and the unknown parameters are reduced to three: *w*, *k*_{f}, and *k*_{on}. Let $\overline{v}$ be the mean velocity of the RNA polymerase along the DNA in basepairs per second. From dimensional considerations, it is clear that $\overline{v}/{k}_{\text{f}}$ is determined by *k*_{on}/*k*_{f} for any particular window size, *w*. It is intuitively clear that $\overline{v}/{k}_{\text{f}}$ is a monotonically increasing function of *k*_{on}/*k*_{f}. This function starts at zero when its argument is zero and asymptotically approaches one as its argument approaches infinity. There is one such function for each window size *w*. These functions involve dimensionless variables only and have no adjustable parameters. In that sense, they are universal, and their graphs are universal curves. The universal curves may be obtained by solving the master equation in the appropriate special cases and then plotting the results as $\overline{v}/{k}_{\text{f}}$ versus *k*_{on}/*k*_{f}.

Examples of the universal curves are plotted in Fig. 8. As *w* increases, the curves shift up and to the left, i.e., the velocity is an increasing function of *w* when the other parameters are held fixed. This is a reflection of the parallel-processing feature of the look-ahead model. The RNA polymerase moves faster when *w* is larger because of the opportunity to bind more rNTP in advance of their covalent incorporation into the growing RNA chain.

*k*

_{on}/

*k*

_{f}when the rNTP concentrations have been adjusted, so that all four of the on-rates are equal (and then varied in fixed proportions to vary the common value of

*k*

_{on}). Results computed by solving

**...**

For purposes of parameter estimation, however, the most important feature of the universal curves is that each of them is unique. If one could make a plot of experimental data of $\overline{v}/{k}_{\text{f}}$ as a function of *k*_{on}/*k*_{f}, that plot would presumably fall on one and only one of the universal curves. The one with which it agreed would reveal the correct value of *w*.

To make use of this idea, though, we have to make all of the on-rates equal, and also we then need to be able to vary that common on-rate and plot the results in terms of the dimensionless variables stated above, namely $\overline{v}/{k}_{\text{f}}$ as a function of *k*_{on}/*k*_{f}. It is not immediately obvious how we can do any of this, since we do not know any of the parameters of the model a priori.

To overcome this difficulty, we make use of the parameter-fitting procedure described above, in which the data take the form of plots of the mean velocity of transcription as a function of the concentrations of each of the rNTP, varied one at a time. Even though that parameter fitting procedure is not very effective for determining the value of *w*, it does determine the best-fit rate constants, *k*_{f} and (*k*′_{on})_{i}, for any hypothesized value of *w*. We can therefore check whether any particular guess for *w* is correct in the following way:

- Step 1. Given the experimental data on the mean velocity of transcription as a function of each of the rNTP concentrations, together with a hypothesized value of
*w*, determine the best-fit rate constants*k*_{f}and (*k*′_{on})_{i}. - Step 2. Use the fitted values of (
*k*′_{on})_{i}to determine the rNTP concentrations that make all four of the (*k*_{on})_{i}=*k*_{on}, independent of*i*. Since (*k*_{on})_{i}= (*k*′_{on})_{i}[rNTP]_{i}/[rNTP]_{0}, the correct choice of [rNTP]to yield any particular common on-rate_{i}*k*_{on}is given by$${\left[\text{rNTP}\right]}_{\text{i}}={\left[\text{rNTP}\right]}_{0}\frac{{k}_{\text{on}}}{{\left({k\prime}_{\text{on}}\right)}_{\text{i}}}\text{.}$$(14) - Step 3. Now do an experiment with the rNTP concentrations set according to the above formula, and plot a single point with coordinates ${k}_{\text{on}}/{k}_{\text{f}},\overline{v}/{k}_{\text{f}}$. In these ratios, the value of
*k*_{f}that should be used is the one that was obtained during the parameter fitting for the hypothesized value of*w*. - Step 4. Repeat this procedure for enough values of
*k*_{on}/*k*_{f}to get a picture of the graph of $\overline{v}/{k}_{\text{f}}$ versus*k*_{on}/*k*_{f}. - Step 5. Plot the data points of this graph on the same axes as the family of universal curves. If the result fits the universal curve for the hypothesized value of
*w*, then that value of*w*is correct, or at least self-consistent.

What is less clear, perhaps, is what will happen when the hypothesized value of *w* is incorrect. In that case, the parameters obtained by best fit will be wrong, the concentrations used will not actually yield equal values of *k*_{on}, and the results will not, typically, fall on any of the universal curves.

Fortunately, we can test the proposed experiment by computer simulation using synthetic data. In this test, we consider only window sizes *w* = 1, 2, 3 for clarity of illustration, but the method can be extended without difficulty to larger window sizes. The results of the proposed experiment by computer simulation can be seen in Fig. 9. The correct window size can be inferred from the calculated curve that most closely matches to one of the three universal curves.

*w*= 2,

*k*

_{f}= 25.0/s, (

*k*′

_{on})

_{A}= 150.0/s, (

*k*′

**...**

### RNA polymerase velocity at high rNTP concentrations

When parameter fitting is done for a hypothesized window size, the best-fit value of *k*_{f} depends on the window size in a systematic way. This is again because of the parallel-processing feature of the look-ahead model as discussed above. Since a larger window size produces faster motion for any given set of rate constants, the fitting procedure necessarily adjusts the rate constants to compensate for the window size in an attempt to match the observed mean velocity. The result is that the best-fit value of *k*_{f} will be a decreasing function of the hypothesized window size. This means that if we have an independent way to measure *k*_{f}, we can use that independent measurement to determine the window size, simply by seeing which of the hypothesized window sizes led to the most accurate prediction of *k*_{f}.

Within the framework of the look-ahead model, the most obvious way to measure *k*_{f} is to employ saturating concentrations of all four rNTPs, so that the window is always fully occupied, and the RNA polymerase simply moves forward with probability per unit time equal to *k*_{f}. Indeed, experimentalists seem to be not too far from this condition when they set all of the rNTP concentrations equal to 1000 *μ*M. From Figs. 3 and 4, however, it is clear that this does not quite produce the limiting velocity of forward movement, and that higher concentrations would be needed for that purpose.

As before, we test this proposed experiment by computer simulation. Table 6 summarizes the results. The table shows that the velocities computed at saturating rNTP concentrations do indeed match the values of *k*_{f} that were obtained by parameter fitting with the correct window size.

### Proposed experimental test to rule out a large class of models in which look-ahead does not occur

The original formulation of the model proposed in this article is very general. Besides the window size *w*, it involves 3 × 4 × 4 parameters (*k*_{f})_{ij}, (*k*_{ON})_{ij}, and (*k*_{OFF})_{ij}, where *i* denotes one of the four possible DNA bases and *j* denotes one of the four possible rNTPs. In particular, this general formulation allows for the possibility of non-Watson-Crick basepairing and for errors in transcription. One can make the model even more general than this in the case *w* > 1 by allowing (*k*_{ON})_{ij} and (*k*_{OFF})_{ij} to depend not only on *i* and *j* but also on position within the window of activity. Also, one can generalize even further by including the limiting case of rapid equilibrium in which (*k*_{ON})_{ij} → ∞ and (*k*_{OFF})_{ij} → ∞, but in such a way that $\frac{{\left({k}_{\text{ON}}\right)}_{\text{ij}}}{{\left({k}_{\text{OFF}}\right)}_{\text{ij}}}$ has a finite limit.

Within the framework of this large class of models, we seek an experimental test that can potentially rule out all of the non-lookahead models. These are the models with *w* = 1, and also all of the rapid equilibrium models (regardless of *w*). The models with *w* = 1 have no look-ahead feature, and the rapid equilibrium models may as well have *w* = 1, since the activity at window sites other than the first (if any) has no effect on the dynamics of transcription in the case of rapid equilibrium.

The experiment that we propose involves the transcription of a random DNA sequence, more specifically one in which the bases at the different sites along the DNA are chosen independently according to prescribed base frequencies. The ambient concentrations of the various rNTP should be chosen sufficiently low that the times of the individual forward moves of the RNA polymerase can be resolved, as in Abbodanzieri et al. (15). There is no requirement here that these concentrations should be equally rate-limiting, however.

The experiment that we have just described defines a stationary stochastic process of which the output is the sequence of times at which the RNA polymerase makes its forward moves. For all of the models that we have classified above as non-lookahead models, it is easy to see that the stochastic process in question is a renewal process, in which the time intervals between successive forward moves are independent random variables. We may regard this universal prediction of the non-lookahead case as a null hypothesis, and use randomization tests for serial correlation such as those discussed in Manly (18) to see whether the null hypothesis may be rejected. Rejection of the null hypothesis would not prove the validity of the look-ahead model, but it would rule out a large number of non-lookahead alternatives.

## Discussion and Conclusions

Because our chemical kinetic model assumes the simultaneous incorporation of nucleotides along with unidirectional forward translocation of the RNA polymerase, our model is most easily visualized in terms of powerstroke mechanisms such as those of Yin and Steitz (11) and Gong et al. (12). We emphasize, however, that our model is agnostic as to physical mechanism, and deals only with chemical kinetic events such as binding, unbinding, and covalent linkage of bases to the nascent RNA chain (which we regard as being synchronous with forward motion of the RNA polymerase enzyme).

We argue that backward translocation is uncommon for several reasons: 1), the breaking of a covalent bond of the nascent RNA chain is energetically unfavorable; 2), at certain sites, the folding of the nascent RNA chain into a hairpin provides a backstop that prevents the nascent RNA chain from moving backward; and 3), backward translocation occurs only under special circumstances, namely during transcriptional arrest, transcriptional termination, or a complete absence of rNTPs (19,20). Our proposed model is best supported by the experimental work of Gong et al. (12), which disputes backward translocation and supports the idea of presorting rNTPs on template DNA sites upstream of the active site.

The nature of pauses in the motion of RNA polymerase has been much debated. Pausing is important to understand because it enables synchronization of enzymatic events and regulates the overall speed of transcription. Recent single molecule experiments on transcriptional elongation (14,19,21,22) have all reached different results and conclusions concerning the nature of pausing. Forde et al. (21) has hypothesized that elongation is a bipartite mechanism, in which the RNA polymerase backtracks followed by a conformational change of the polymerase complex, which results in an arrested molecule incapable of being rescued by an assisted mechanical force. Bai et al. (7,23) have hypothesized that pausing is the result of backward translocations along the DNA. Neuman et al. (19) and Shaevitz et al. (24) have hypothesized that a structural rearrangement within the RNA polymerase enzyme is the cause of short pausing. Based on the latter experiments (19,24), the majority of pausing has been shown to be short and ubiquitous, and is not the result of backtracking along the DNA; instead, it is thought that the polymerase enters an off-pathway state of pause (25). Longer pauses (those >20 s), on the other hand, occur much less frequently and are hypothesized to occur by an entirely different mechanism.

In the look-ahead model, the statistics of the motion of RNA polymerase may be described as follows. Consider the limit in which the forward rate constant is very fast. Then RNA polymerase moves forward every time that the first site within the look-ahead window becomes occupied. The distribution of the waiting time for this to occur will be exponential with a rate constant that may be sequence-dependent. Once a forward step does occur, it may be immediately followed by one or several additional forward steps, depending on how many adjacent sites within the look-ahead window happened to be filled at the moment when the first site is filled. Put another way, the RNA polymerase slides the length of the adjacently filled sites within the window of activity. Such sliding is consistent with the inchworm model (26) of transcriptional elongation that was popular during the 1980s. The inchworm model has never been formally ruled out (19).

An interesting property of the look-ahead model that we have not yet fully explored is the potential role of the look-ahead feature in preventing transcription errors. Assuming that there is a nonzero probability of incorporating an incorrect nucleotide covalently into the nascent RNA chain, it becomes important to reduce the probability of such an incorrect base being present at the site where it would be incorporated. This may be accomplished by having a high off-rate for incorrect basepairings, and by allowing sufficient time for this off-rate to be effective. The look-ahead model provides this possibility (in contrast to a model that only involves binding followed by a covalent linkage).

Using the master-equation formulation of our model, we performed parameter estimation to both synthetic and actual data. Our computational experiments involving parameter fitting to synthetic data show that original parameters can be recovered, even when the synthetic data, generated using the Gillespie method, are noisy. The amount of noise that is introduced in this way decreases inversely as the square-root of the number of runs that are used to generate the synthetic data. By varying the number of runs, we are able to assess the influence of this type of noise on the parameter estimation process. The scenario considered here, in which the synthetic experimental data are corrupted by noise, is more realistic than the noise-free case. Note, in particular, that we are not simply adding arbitrary noise to the data, but instead are considering a type of noise that is intrinsic to the physical process under consideration. Moreover, our computational experiments show that the number of individuals runs necessary to recover the original parameters from noisy synthetic data is not prohibitive, but instead is a feasible number to do in an actual experiment.

We have also performed parameter estimation studies based on two different types of actual experimental data. The first kind of data that we employed concerns the mean velocity of RNA polymerase as a function of the ambient rNTP concentrations, varied one at a time (7). The best fits of the predictions of the look-ahead model to such data are achieved with the window sizes *w* = 1 and *w* = 2. The second kind of data that we used is the statistical distribution of the waiting times between forward moves of the RNA polymerase enzyme (15). These data were obtained with the ambient rNTP concentrations chosen to be equally rate-limiting, an important condition which simplifies the analysis of the look-ahead model. The fit of the predictions of the model to these data clearly rules out the window size *w* = 1 and is excellent for the window size *w* = 4. In this connection, it should be noted that Abbodanzieri et al. interpret their own data as being consistent with a secondary site for rNTP binding, a suggestion that seems to be in accord with the look-ahead concept.

All of the parameter fitting in this article has been done under the assumption that the unbinding rates from the sites within the look-ahead window are negligible. This assumption was made primarily to avoid the proliferation of parameters that would otherwise result. We have, however, briefly considered the opposite assumption, i.e., that the binding/unbinding of rNTP to sites within the look-ahead window are in rapid equilibrium. The rapid equilibrium assumption makes the size of the look-ahead window irrelevant, so one may as well consider *w* = 1, and theoretical results are relatively easy to derive. In particular, it is easy to predict the form of the waiting time distribution for comparison with the experimental data of Abbodanzieri et al. (15). We have done this for the special case in which the ambient rNTP concentrations have been adjusted to make *k*_{f}*p*_{occupied} the same for all of the different rNTP. Note that these assumptions imply that each of the rNTP concentrations is equally rate-limiting, as in the experiment reported in Abbodanzieri et al. (15). In this special case of rapid equilibrium, the theoretical waiting time distribution is a simple exponential, which is inconsistent with the experimental data (15).

Because our parameter fitting results give different answers for the window size, we have proposed two additional experiments to help resolve this issue. In both cases, we have shown that the proposed method of determining the window size is effective when applied to synthetic data. Since these proposed experiments have not yet been done, actual data are not available.

The first of the proposed experiments is based on the observation that when the rNTP concentrations are manipulated in a specific way, all four of the on-rates become equal and the relationship between the mean velocity of RNA polymerase and the common on-rate can be expressed in terms of certain universal curves, a different one for each window size. These curves relate dimensionless variables and do not involve any adjustable parameters, so it should be possible to determine the window size by seeing which of the universal curves best fits the data.

In a second proposed experiment, we suggest using saturating concentrations of all four rNTPs so that the mean velocity of the RNA polymerase, expressed in basepairs per second, will be equal to the parameter *k*_{f} of the model. The reason this should determine the window size is that different hypothesized window sizes lead to different predictions of *k*_{f}, so an independent determination of *k*_{f} will tell which of these predictions is correct.

A limitation of the parameter fitting done in this article is that it has involved only a special case of the look-ahead model. This special case is characterized by the following additional assumptions, as well as those of the look-ahead model itself. 1), Only correct Watson-Crick basepairing is allowed. 2), The forward rate is assumed to be independent of which nucleotide is being incorporated into the growing RNA chain. 3), We assume that the off-rates can all be neglected.

Quite possibly, one or more of these limitations is responsible for the discrepancy that remains between model predictions and experimental results, even when we have made a best fit of the parameters of the model. Note, for example, that our velocity histograms computed with best-fit parameters are narrower than those obtained experimentally (see Fig. 5). This might not be the case if incorrect Watson-Crick basepairing were allowed, for example. Such issues will be the subject of future research.

The above described limitations are to some extent overcome, however, by our proposed experiment to rule out a large class of non-lookahead models. The discussion of this proposed experiment is based upon the full model of this article, without the simplifying assumptions that were made to facilitate parameter fitting, and also without relying on the use of equally rate-limiting rNTP concentrations. Within this large class of models, we identify a subset that we refer to as non-lookahead models, and we note how all of these can potentially be ruled out by a statistical test involving rejection of the null hypothesis that the time intervals between successive forward moves on RNA polymerase are independent random variables.

An important question not considered in this article is the structural basis for our proposed look-ahead model. As discussed in Vassylyev et al. (27), there is structural evidence for a preinsertion site that is distinct from the catalytic site of RNA polymerase. Noncovalent binding and selection of the correct rNTP occurs at the preinsertion site, and hydrolysis and linkage to the nascent RNA chain occurs at the catalytic site. This hypothesis, described in Vassylyev et al. (27), is similar, but not identical to the look-ahead model with a window size *w* = 2. The differences are that in the look-ahead model of this article it is possible for an rNTP to bind directly to the catalytic site (if that site should happen to be empty) as well as to the preinsertion site. Another difference, perhaps consistent with Vassylyev et al. (27) but not discussed there, is the possibility of parallel processing that exists in the look-ahead model: the preinsertion site can fill while the catalytic site is occupied. Strong qualitative evidence in favor of such parallel processing comes from the experimental fact that the measured probability density of the waiting time for a forward move is monotone decreasing (15), so that the most likely waiting time is zero. This cannot be the case if two (or more) kinetic steps must occur serially for each forward step of the RNA polymerase molecule. As reported herein, our best fit to the data in Abbodanzieri et al. (15) occurs with a look-ahead model whose window size is 4. We are not aware of any structural data that would support a window size >2, however, so this leaves a discrepancy between kinetic and structural evidence that needs to be resolved.

Finally, it is important to keep in mind that the data to which our proposed model predictions are compared in this article come from experiments on prokaryotic RNA polymerase. There is no reason why the look-ahead model should not be applicable to eukaryotic RNA polymerases; indeed, one might reasonably expect larger window sizes in the eukaryotic case. It is therefore exciting to note that single-force microscopy has recently been applied to the study of transcriptional elongation by eukaryotic RNA polymerase (28). This opens up the possibility that the model described here will have a new domain of applicability. Indeed, by fitting the model both to the prokaryotic and to the eukaryotic RNA polymerases, one should be able to learn more about the differences between these two classes of related enzymes.

## Acknowledgments

We thank the Center for Applied Mathematics at Cornell University for the generous use of computing resources during this project. We thank Arthur LaPorta, Lu Bai, and Dan S. Johnson at LASSP/Cornell University for information concerning their single molecular experiments. In addition, we thank Udo Wehmeier for preliminary work on the model of this article, and Darren J. Wilkinson for discussion with Y.R.Y. about parameter estimation approaches. Andrew Matteson was instrumental in pointing out an error in an earlier version of Eq. 13, which is now correct, thanks to his vigilance. We thank the two anonymous reviewers of this article for their insightful comments and suggestions, especially for calling our attention to the waiting time distributions reported in Abbodanzieri et al. (15) that seem to be the best available data for determining the window size of the look-ahead model. Finally, we thank Daniel B. Forger for his advice and support.

Y.R.Y was supported on an National Science Foundation-Integrative Graduate Education and Research Traineeship grant No. DGE-033366 and C.S.P. was supported in part by National Institutes of Health grant No. 1P50GM071558-01A2 to the Systems Biology Center in New York.

Soli deo gloria.

## References

*In*Pausing, Arrest, and Termination: Structure and Mechanism: Lectures at the KITP 2003 Program on Bio-Molecular Networks. 2003. http://online.itp.ucsb.edu/online/bionet03/ebright/.

**The Biophysical Society**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (837K)

- Stochastic kinetics description of a simple transcription model.[Bull Math Biol. 2006]
*Roussel MR, Zhu R.**Bull Math Biol. 2006 Oct; 68(7):1681-713. Epub 2006 Jun 20.* - The Influence of Look-Ahead on the Error Rate of Transcription.[Math Model Nat Phenom. 2010]
*Yamada YR, Peskin CS.**Math Model Nat Phenom. 2010 Jan 27; 5(3):206-227.* - Dynamics of backtracking long pauses of RNA polymerase.[Biochim Biophys Acta. 2009]
*Xie P.**Biochim Biophys Acta. 2009 Mar; 1789(3):212-9. Epub 2008 Dec 6.* - Stochastic transcription initiation: Time dependent transcription rates.[Biophys Chem. 2006]
*Murugan R.**Biophys Chem. 2006 Apr 20; 121(1):51-6. Epub 2006 Jan 27.* - The topology of transcription by immobilized polymerases.[Exp Cell Res. 1996]
*Iborra FJ, Pombo A, McManus J, Jackson DA, Cook PR.**Exp Cell Res. 1996 Dec 15; 229(2):167-73.*

- Detecting sequence dependent transcriptional pauses from RNA and protein number time series[BMC Bioinformatics. ]
*Emmert-Streib F, Häkkinen A, Ribeiro AS.**BMC Bioinformatics. 13152* - The Influence of Look-Ahead on the Error Rate of Transcription[Mathematical modelling of natural phenomena...]
*Yamada YR, Peskin CS.**Mathematical modelling of natural phenomena. 2010 Jan 27; 5(3)206-227* - A systems view of the protein expression process[Systems and Synthetic Biology. 2011]
*Gokhale S, Nyayanit D, Gadgil C.**Systems and Synthetic Biology. 2011 Dec; 5(3-4)139-150* - Development of a "Modular" Scheme to Describe the Kinetics of Transcript Elongation by RNA Polymerase[Biophysical Journal. 2011]
*Greive SJ, Goodarzi JP, Weitzel SE, von Hippel PH.**Biophysical Journal. 2011 Sep 7; 101(5)1155-1165* - Multiscale Complexity in the Mammalian Circadian Clock[Current opinion in genetics & development. ...]
*Yamada YR, Forger DB.**Current opinion in genetics & development. 2010 Dec; 20(6)626-633*

- A Look-Ahead Model for the Elongation Dynamics of TranscriptionA Look-Ahead Model for the Elongation Dynamics of TranscriptionBiophysical Journal. Apr 22, 2009; 96(8)3015PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...