We have estimated the maximal size for an RNA motif recoverable from selection-amplification for new RNA activities, under conditions that span those in present laboratory use. The number of sequence pieces from which an active site is folded (the modularity) is a crucial variable. Routine laboratory experiments might isolate RNAs of modularity 4 containing ≤33 specified nucleotides. The probability of recovering shorter motifs increases rapidly, but the likely maximal motif size declines 1.66 nucleotides per 10-fold decrease in experimental scale. In such experiments, randomized tracts of 80–120 nucleotides extract most of the benefit of longer initially randomized pools. The same methods also permit extrapolation to conditions more plausible during the initiation of an RNA world. Under these conditions, active RNAs were likely highly modular, even more so than in modern experiments. Strikingly, several lines of evidence converge on the conclusion that 15 to 35-mer active sites would be the working material for an early RNA world. If initiation of an RNA world is synonymous with emergence of active structures from randomized sequences (the Axiom of Origin), populations containing only zeptomoles of RNA (hundreds to hundreds of thousands of molecules) might yield RNAs at the lower end of this size range. This makes the RNA world much more accessible than previously suspected.

## Introduction

It is rare that a new technique makes possible a type of experiment not feasible before, but this is true of selection-amplification or SELEX.^{4,20,22} This procedure consists of cycles of alternating selection (biochemical fractionation) and amplification (replication), applied to RNA or DNA containing randomized tracts of nucleotides. Because nucleic acids are uniquely able to replicate, any usable fractionation can be applied to a starting population, then repeatedly re-applied to the replicated output from the fractionation. The population of molecules increases in purity. When repetitive selection for an initially rare molecule yields sufficient purity, the population is cloned (and active molecules thereby purified to homogeneity).

On one hand, such cyclic fractionation-replication is well suited for specific questions like “What is the sequence of the nucleic acid bound by protein A”? Protein A is repetitively used to sequester molecules for which it has affinity. After multiple selection-amplification cycles, a substantial plurality of molecules have a protein A binding site, revealed as conserved sequences among independently-derived clones. As for other specific types of questions, selection-amplification supplies an alternative to (for example) cloning and sequencing natural protein A binding sites, and clarifies such an experiment by reducing bias due to effects other than binding.

However, this chapter is primarily about the non-specific, open-ended use of selection-amplification. In such an experiment, one asks “Is an RNA (DNA) with property X possible?” As long as a selection (fractionation) exists that specifically concentrates molecules with property X, one can determine their existence and study them in pure form. No prior information is required about the potentially interesting molecules, and no prior example ever need have been observed. Such questions are pressing because of the prediction that many practical RNA activities are presently extinct. This unused RNA potential, imputed relic of an RNA world,^{26} can usually be demonstrated in no other way than by open-ended selection-amplification.

There are notable findings using this approach. The four chemical subreactions required for translation have been shown to be within RNA capabilities.^{29} Three of these; amino acid activation,^{19} aminoacyl-RNA synthesis,^{10} and direct coding interactions^{27} are not extant RNA capabilities. RNA should have once replicated itself, and a pure RNA-RNA replicase that uses free primed RNAs as template has been selected.^{15} Finally, the existence of an RNA-mediated metabolism has been supported. RNAs can synthesize^{9} and utilize the enzymatic cofactor CoA,^{13} long thought, because of its structure, to be a molecular remnant of an RNA world.^{26}

However, selection-amplification is not infinitely capable. It isolates only molecules that meet its constraints, and addresses the existence of a new RNA activity only within limits. We understand selection's outcome only if we know the scope of selection.

In order to define what sort of molecules are within reach, below we estimate how large a nucleotide motif can be derived from a randomized RNA sequence of specified size, under conditions that span those in current use. This discussion concerns only what might be present in the initial randomized pool, but models of the subsequent selective processes are also available.^{23}

## Calculations

What follows is only counting, though obscured by notation (details have been placed in an Appendix). The need for counting is fundamentally simple, and can be appreciated from a rough example. Suppose we pick single nucleotides blindly from a hat containing a large number of the standard four. In order to reduce the probability of missing one of A, G, C or U to some small value, we must pick several times 4 of them. In order to be similarly certain we don't miss a dinucleotide, we must pick about 4 times more, since there are 16 kinds rather than 4 kinds. Note the “about”: actually, the statistics for these small numbers are a bit different from picking huge numbers of nucleotide sequences (see the Appendix), but close enough for now. We usually wish to reason conversely; that is, for any particular number of sequences from my hat (or in my experiment), motifs of a particular size are present with high probability. What is that size?

This crucial question can be refit for finding motifs of *l* nucleotides within a pool of RNAs randomized at *n* positions. How many RNA folds involving *l* nucleotides (nt) divided into *m* indivisible sequence modules (the modularity) are present in a randomized sequence *n* nucleotides long (*l* ≤ *n*)? Each fold is a sample of *l*-mer sequences. If we multiply folds by the number of randomized molecules used, we know the effective number of sequences we have tested for some new function in a selection-amplification. As in the paragraph just above, we can then estimate the size of the *l*-mer we might recover. In the discussion below, *l*, the likely motif size, is our index for the capability of selected RNAs. The implicit assumption is—the larger the motif potentially selected, the more capable the selected molecule is likely to be.

## Results

### The Importance of Being Modular

Figure 1 shows the total number of motifs containing 20 fixed nucleotides (*l* = 20) in random regions of different lengths. The results vary widely. One can construct only a single occurrence (*O* = 1) for a motif that just fits: 1 module of 20 nucleotides in a randomized region 20 nucleotides long (*n* = 20). But in a second, equally realistic situation, about 10^{10} folds exist for motifs broken into 4 pieces (*m*=4) and allowed to find as many positions as they can in random regions 150 long (*n* = 150).

Figure 1 therefore strikingly emphasizes the importance of the modularity, *m*. The number of folds available for selection increases two or three orders of magnitude with each increase in *m*, the number of sequence pieces allowed. Therefore, in the absence of special considerations, selection-amplifications isolate molecules whose active sites are folded from as many separated sequence tracts as possible. This has a profound effect on the nature and analysis of selected molecules, to which we return below.

### The Importance of Random Region Size, *n*

What is the most effective size for a randomized region? Figure 2 is directed at this fundamental experimentalist's question, encountered by everyone who has performed a selection-amplification.

In the following, the “presence” of a motif (the probability of occurrence of an *l*-mer) is discussed. It seems at the outset that one might equally reasonably count a motif present if the probability of its occurrence is 0.5 or with similar justification, 0.99. Throughout the text below, motifs are said to be present when the probability of an *l*-mer is 0.5, that is, when 50% of all *l*-mers are present. This is not arbitrary, but chosen to maximize the accuracy of the calculations (see Appendix for details).

The scope of a practical selection is limited by the mass of RNA present. Some number of RNA molecules is always convenient; more is hard. The maximum number of molecules used may be limited in various ways; by economics, by the capacity of PCR machines, by the solubility of macromolecular RNAs in the presence of divalent ions, or by some combination of considerations. In Figure 2 the effect of total initial pool absorbance is shown using different strategies for the size of the randomized region, n, and seeking structures with a fixed, realistic modularity, *m* = 4.

It clearly pays to increase n, making longer random regions on fewer initial molecules. However, this strategy becomes less effective as randomized regions get longer. One adds about 3.6 fixed nucleotides to the likely motif by increasing the randomized region from 40 to 80, but going from 80 to 120 adds only 1.2 more. Adding another 40 randomized positions (to *n* = 160) potentially enlarges the accessible motif by only 0.7 nucleotides. There are three reasons for this behavior. One has fewer molecules as the mass of each one increases. In addition, Figure 1 shows that the rapidity of the increase in the number of folds coming from each molecule decreases as the randomized tract is lengthened. In addition (Appendix) sampling of sequences becomes less effective (more non-ideal) as a randomized region is lengthened, further decreasing effectiveness.

Therefore, if an experiment requires an RNA of unconditionally maximized capability, you should select using randomized regions of maximum attainable size. On the other hand, if ease of analysis, or unbiased replication, or higher stability of the molecules used is of value, most of the benefit of increased *n* is accessible by using molecules with 80–120 randomized positions. This is particularly true since multiple molecules can sometimes collaborate to form a site,^{25} potentially conferring some of the benefits of longer randomized regions via shorter molecules. A numerical point is that experiments of moderate size, conducted with moderate random regions, can contain substantial RNAs. From a reference experiment in Figure 2, 1 mL of A_{260} = 1, we may find an active region containing 34 specified (or 68 half-specified) nucleotides within a randomized region 120 long (*m* = 4).

### The Importance of Experimental Scale

As we have argued above, the size of experiments is limited. Therefore it is useful to think about the yield from experiments conducted on differing scales.

Figures 3A and 3B depict a set of calculated outcomes. Figure 3A shows the size of the explicit motif present in a population of 1 mL of *n* = 80-mer randomized RNA at starting A_{260}, varied over six orders of magnitude. The vicinity of “typical” experiments (1 mL at an absorbance of 1) is marked with a vertical dotted reference line. Figure 3B rephrases this same relationship in terms of the initial number of RNA molecules, each of unique sequence. The vertical reference line marks another attainable vicinity for experiments (1 nmol of randomized molecules, close to 1 A_{260}).

An experiment at laboratory scale might be conducted in 1000 μL. What is the reward for starting on a millionfold expanded scale, where the initial RNA solution will have to be mustered in many bathtubs, and heated and cooled with some exotic technology? As the Figures show, the answer is largely independent of modularity (at *n* = 80) and uniformly equal to addition of 10 nt, or 1.66 specified nucleotides per ten-fold increment in the experiment. Thus a “typical” experiment, which contained motifs of 33 nucleotides (*m* = 4, *n* = 80) in 1 mL of A_{260} = 1 would potentially yield sites of up to 43 specific nucleotides, if the logistics of experimentation in 1000 L could be surmounted. At a more practical level where real experimental decisions are usually confined, there would usually seem to be limited rationale for a ten-fold increase in scale. This analysis is similar from side to side in Figures 3, spanning six orders of magnitude in starting material.

This calculated effect of scale may seem antiintuitively small, but it is the unavoidable consequence of a simple notion in the first paragraph of CALCULATIONS above. That is to say, we need about 4-fold as much material to recover sequences one nucleotide longer. For 10-fold increases, 4^{x} = 10 and x = 1.66 nucleotides per order of magnitude, the factor that recurs throughout the calculations in Figure 3. One might say that the complexities of the present calculation (see Appendix) are mainly to show that consideration of folding or sampling nonidealities, for example, do not significantly alter this outcome. And if a 1.66 nucleotide return for tenfold in magnitude still seems small, then consider the implications for selection-amplification conducted on peptides.^{18} The effects of scale arise again below.

### The Importance of Motif Size, *l *

For some purposes we need to know the content of the population as a function of motif size, *l*. There are C*4^{l}
*l*-mer motifs in all (C is the number of ways to divide *l* nucleotides into* m *modules; see Appendix). Figure 4, showing the number of an average *l*-mer present versus *l*, is directed at this question. The horizontal reference line in the plot marks our standard for calculations. When *l*-mer motifs are present with probability = 0.5, there are 0.693 of the average *l*-mer amongst the total group of RNAs. For modularity *m* = 4, note that the reference line intersects the plot just below *l* = 33 specified nt, as also shown in Figure 3A.

However, the point of Figure 4 is in the unbroken slope above our previous reference. Shorter motifs are exponentially more present in the population. As we might hope intuitively, the slope of the lines for various modularities is similar, again approximately 1.6 nt/order of magnitude. Therefore a fixed RNA population of randomized sequences selection contains about an order of magnitude more *l*-mer for each 1.6 nt decrease in motif size,* l*.

## Summing Up

As the first order of business in summing up, we reflect on our approximations. In particular, most errors tend to increase the apparent size of the accessible motif, *l*. A reckoning has been used (see the Appendix) in which RNA folds are treated as linear abstractions, rather than as real structures in which only certain interactions and certain covalent continuities will be allowed. The number of real structures (and the real *l*) will be smaller. In addition, in real experiments there is cryptic damage to synthetic DNA that prevents transcription;^{4} thus we may often overestimate the number of unique sequences in a selection. Furthermore, the addition of randomized nucleotides to active structures inactivates some or most of them.^{21} Thus some motifs counted as being present will be difficult to recover. This is particularly true of less stable (and therefore usually smaller) ones (O. Kovalchuke & M. Yarus, unpublished), which are more easily poisoned by alternative foldings with added sequences. In addition, motifs that exist close to the lines in Figures 2 and 3 exist as one or a few copies in a large population (Fig. 4). Since no real biochemical procedure can be carried out with 100% recovery, these motifs can be lost in stochastic accidents. However, as one backs down from the calculated lines, the number of predicted copies of a motif increase rapidly, about 4-fold per omitted nucleotide (Fig. 4). Even if the calculated motif is not present, one slightly smaller is likely to be found. If the motif is made slightly smaller yet, its presence is virtually assured. To say the same thing in another way, in eqn 1 when we talk of the size of motifs, we implicitly take the logarithm of the numbers that have these errors. Thus *l*, the size of the motif, is more resistant to error than (for example) the number of folds. Combining all these considerations, calculated lines should be taken as the upper boundary of a region of feasibility. Within this region of feasibility the likelihood of finding a motif increases rapidly as one heads downward in the Figures, and the number of fixed nucleotides in the motif, *l*, decreases.

The present results will now be applied to modern selections, and in addition, to the nature of the first RNAs in an RNA world. The latter extension requires a virtually ubiquitous assumption, therefore worth explicit statement. We assume that the probability of an RNA world is synonymous with the probability of the emergence of active RNA structures from mostly nonfunctional RNA populations having highly varied sequences, and call this the Axiom of Origin.

### Modularity

Modularity has large effects. When the fold is composed of a greater number of pieces, this greatly increases the number of folds of a given size, *l* (Fig. 1). Thus the predominant RNAs meeting a selection will quite likely form an active site by folding together separated pieces of its primary sequence. This means that it is likely that there will be spacers with no specific function in a newly selected molecule. It will be difficult to detect and eliminate these, because they will frequently be internal, between active sequences. No good molecular biological method exists for making a selected RNA smaller by random deletion, though deletion during chemical DNA synthesis should be possible.^{7} In fact, the facile creation of randomized deletions would usher in selections for the smallest functional units. This would add a useful new dimension to selection-amplification for activities of all kinds (see below).

One might wonder if intercalated sequences between the *m* essential modules would be recognizable in some way. For example, might spacers be less structured? However, even random sequences make about 40–60% base pairs.^{6} In addition, intermodule sequences in a selected molecule are not truly random, but have been instead been repetitively chosen to allow the active part of the molecule to fold into its functional configuration. That is, even sequences whose only role is as spacers between functional modules will likely fold to give a purposeful, specific structure, difficult to distinguish from an active site by inspection. In fact, sometimes this tendency can be detected. An initially selected self-aminoacylating 95-mer with a modularity of 4 was reduced to a 29-mer with undiminished activity^{10} primarily by internal deletion of what seemed a uniformly structured parent.

Biological selection should resemble selection-amplification in these respects. That is, RNA structures will usually be created with intercalated dispensable regions. These virtually ubiquitous sequences are available raw material for further evolution; for example, for development of other intrinsic functions and interaction with other RNAs. Furthermore, conditions that alter the stability of ribonucleotide folds and therefore the practical modularity are probably crucial to early molecular evolution, though not usually discussed in this context.

### Size

Overall confidence in these calculations is somewhat increased by the observation that most selected RNAs are in fact composed of a number of fixed nucleotides smaller than suggested by Figures 2 and 3 . However, modularity's effects imply that total apparent size will likely be larger than the real size, making it difficult to know precisely how closely real selections approach calculated boundaries.

Above we played with the notion of bathtubs filled with concentrated RNA solutions. However, the evolutionary significance of these calculations lies in the other direction. At the left in Figure 3B, we read that even in experiments 6 orders of magnitude smaller than typical laboratory regimes, when we have only 10^{9} molecules to select among (*n* = 80, *m* = 4), active structures containing up to 23 specified nucleotides would be plausible. Since relatively capable ribozymes, for example, the hammerhead^{24} and even a selfaminoacylating ribozyme,^{10} are significantly below this range, proficient RNA structures can appear in populations containing only femtomoles of RNA. It seems quite likely that this ability to derive structures of substantial size, even from small molecular populations, was crucial to the initiation of an ancient biology based on oligoribonucleotides. For the beginnings of an RNA world, the Axiom of Origin suggests that we need to estimate the minimal size for a productive RNA population, and we return to this topic below.

The effects of modularity reinforce the above conclusion about population size. Modularity becomes more important as the size of the oligoribonucleotide population shrinks. This is readily visible in Figures 3. Because of the form of these results (1.66 motif nucleotides per size factor of 10), modularity adds a similar absolute increase in motif size in populations of every size. Therefore the proportionate impact of modularity grows as populations get smaller. For our index population from a modern selection-amplification (1 nmol RNA), modularity increase from m=1 to 4 adds about 19% to the size of the accessible motif. For a hypothetical ancient RNA population evolving toward biological function (1 fmol RNA), Figure 3B shows that the same change in modularity likely adds about 31% to the motif, and correspondingly to the capability of RNAs that might appear. If only 1 attomole RNA were available (1 amol = 6 × 10^{5} molecules), this same modularity increase would add 42% to the number of nucleotides specified in the accessible motif.

Thus we urge two conclusions—firstly, the ancient RNAs that initiated an RNA world were probably yet more modular in structure than those selected in modern experiments. Secondly, even with the equivalent of only an attomole (6 × 10^{5}) of 80-mer RNAs on hand, substantial modular structures (*m*=4) containing about 18 completely specified nucleotides are near the apparent upper limit of complexity.

### A Zeptomole World

How many ribonucleotides are required to specify useful catalytic sites? A convincing general answer to this question is presently beyond us, so we refer to observed structures. Simple activities can be seen in very small molecules. For example, RNA can abet its own hydrolytic instability by binding divalent metals to 7 specific nucleotides, GAAA/UUU.^{17} However, a more relevant reference may be the 29-nucleotide self-aminoacylating RNA.^{10} This RNA forms a Michaelis complex with a small molecule substrate and accelerates a reaction not frequent in the natural ribozyme repertoire, involving carbonyl chemistry. Substitution and truncation experiments^{11,28} suggest that, among 29 total nucleotides, activity requires ≥ 11 specific nucleotides (this is the minimal number required to create the active structure) but ≤ 19 nucleotides (this is maximal, conserving every nucleotide required in the active structure). Thus we need to calculate the size of an RNA pool that contains motifs of 11 19 nucleotides.

If Figure 3B is extrapolated at 1.66 nucleotides/order, structures of this complexity could be expected from zeptomoles of randomized 80-mer RNA molecules (1 zeptomole = 10^{−21} mol = 602 molecules, 33 attograms). Thus, surprisingly, catalytic RNAs might appear in unexpectedly tiny RNA populations, beginning at about one thousand times less than that in a modern bacterium. Accordingly, we conclude that sub-attomole RNA populations; that is, only zeptomoles of RNA, may have been sufficient for emergence of an RNA world. As a result, an RNA world is more probable, perhaps much more probable than usually considered. However, we cannot be certain this ribozyme example can be generalized; as pointed out above, selection experiments will not easily find minimal capable RNAs. Further, we need to consider the robustness of the numerical argument that led to this conclusion. There are arguments both for and against its accuracy.

### An Argument For

The zeptomole world at first may seem intimately tied to the method and example used for this analysis (CALCULATION, above, and Appendix), and therefore subject to dramatic later revision. Instead, this argument is more independent of numerical details than first appears. It requires only the notion that the accessible motif diminishes by 1.66 nt/order of magnitude in the population size. As pointed out above, this result is deducible from elementary sampling considerations. The CALCULATION arguably only shows that the imposition of other kinds of size-dependence (in the form of folding and sampling) does not submerge the fundamental relationship.

We therefore take the selection-amplification of larger functional RNAs, like the class 1 ligase, as an alternative anchor for the numerical argument. This ligase ribozyme was isolated from a pool with 220 randomized nucleotides.^{1} Consistent with the above discussion, it was later reduced to a 112-mer with a 93 nucleotides catalytic region,^{3} and also an active 97-mer.^{15} This may be a highly modular structure, and it is not certain how many nucleotides are essential in the present sense. However, this number is surely a substantial fraction of 93.^{3} If we correct for the 10^{13}-fold difference in the magnitude of these experiments and the zeptomole range (by subtracting 13 × 1.66 22 nt), we reproduce the original conclusion. Pools containing zeptomoles of RNA should yield active molecules with tens of essential nucleotides. Correction for the large randomized region leaves the conclusion intact. Accordingly, one needs only the slope in Figures 3 to make a zeptomole world plausible.

### An Argument Against

The satisfaction of the Axiom of Origin by zeptomoles of 80-mers depends on realistic estimation of the number of sub-folds from a randomized region. A skeptical view is that a zeptomole world presses our calculation of the number of folds beyond its limits (see the Appendix). We acknowledge above that the present calculations overestimate real folds, and therefore overestimate the versatility of zeptomoles of RNA. Said another way, taking these calculations as upper boundaries for *l* will at some point cease to be useful as the boundary *l*-mer becomes smaller. Perhaps even the 1.66 nt/order of magnitude rule is inaccurate in the extreme limiting cases we need (thereby invalidating the second argument above), though we have confirmed many of the concepts required by computation using small RNA populations (Appendix). The best remedy for these uncertainties seems to be an experimental measurement of the effective number of *l*-mer folds arising from *n* ribonucleotides, and this work is underway.

### Comparison with other Size Data

Nevertheless, RNAs in the size range calculated here are likely to be the principal agents of very early evolution. There is a remarkable convergence of independent quantitative evolutionary arguments on molecules of similar size. These present calculations show that structures spanning a few tens of specific nucleotides can arise from small amounts of RNA. This statement acceptably summarizes the results even if the initial number of molecules is stretched to the upper limits of conceivability. In addition, base pairing is error-prone, and this is necessarily reflected in recently selected RNA catalyzed, RNA-templated replication,^{15} which has a substantial error rate. Ancient replicators would presumably have begun as unsophisticated non-proofreading catalysts, inescapably limited in this way. Likely error rates limit the plausible size of the RNA accurately reproduced by such replicators to a few tens of nucleotides.^{16} Finally, if unordered polymerization of pre-activated nucleotides is carried out on mineral surfaces, RNAs of a few times 10 nucleotides in size can appear.^{5} On these several grounds, a primordial RNA cell (a ribocyte) would almost surely begin with RNAs of this size, though it might evolve the capabilities required to maintain larger molecules. Therefore, questions about the origin and initial stages of an RNA world can be focused—is life, even in an initial crude form, possible for an assemblage of 15 to 35-mer active sites? If the answer is “yes”, then the Axiom of Origin and the above calculations suggest that the RNA world did not necessarily require protracted gestation, but could have appeared quickly and inevitably.

## Appendix

When sampling equally likely sequences for an initial pool for selection-amplification, the Poisson distribution describes the probability that a sequence will be missed. To show this, we first show that the Poisson is appropriate to choice between the relevant numbers of alternative sequences. We then show how we count folds and estimate the largest *l*-mer that is likely to occur. Finally, we correct Poisson statistics for sampling errors in populations of real sequences.

### Appropriateness

The calculation of the likelihood of missing a sequence is different for few and many sequences. Thus, we first calculate this probability by a simple magnitude- and distribution-independent method, then show that this is equivalent to the Poisson for relevant cases.

Take k sequences at random from a total of q equally likely choices (q = 4^{l}, where l is sequence length): the probability that one choice yields any given sequence is 1/q. The probability that one choice misses any sequence is [1−1/q]. The probability that we have still missed any sequence even after k tries is [1−1/q]^{k}. Call this quantity Prk:

^{k}

For Poisson sampling, for the same process we would predict:

_{Poisson}(q,k) = e

^{−k/q}

Prk can be expanded so its limiting behavior is visible:

^{2}− 1/3n

^{3}..) + (k

^{2}/2)(1/n

^{2}+ 1/n

^{3}+ 11/12n

^{4}..) + (k

^{3}/6)(−1/n

^{3}− 3/2n

^{4}−7/4n

^{5}...) + ..

As q, the number of things chosen from) gets larger,

^{2}/2q

^{2}− k

^{3}/6q

^{3}+ ..

This is the standard expansion for e^{k/q}, that is, Prk and Prk_{Poisson} will converge as q gets larger. The equality is good at even at moderate q: Figure 5 shows that, even if we only choose between 64 sequences (trinucleotides; we are usually choosing among 10^{14} sequences), the Poisson is an excellent approximation. In fact, in cases of real interest the two lines in Figure 5 are indistinguishable over tens of orders of magnitude. This result corresponds to expectation for Poisson sampling: that is, it will apply when the events concerned have small probabilities that are constant over the domain of interest. In fact, for real sequences below we will need to correct for departures from equal probabilities.

### The Accessible *l*-mer

For Poisson sampling, we follow Ciesiolka, et al^{2} to show that motifs of *l* nucleotides are present within tracts of *n* randomized nucleotides as follows:

where P is the probability that any motif of l nucleotides will be present, *T(l, m, n)* is the total number of folds containing *l* nucleotides in *m* modules within nmer randomized regions of all molecules in the experiment. *C(l, m)* is the number of configurations of the motif (defined below). While *l* appears in eqn. 1 as an implicit transcendental function, such equations are easily and rapidly solved by mathematical software. Mathcad v7 (Mathsoft, Inc) was used for all results in the text above.

We need

where u is the number of unique molecules and O is the number of ways of getting an *l*-mer fold within the randomized *n*-mer in every molecule. Because every molecule in typical selections has a unique sequence, u is also the total number of RNA molecules used at the outset. So, in “bench units”:

where v ml of a solution of (n+50)-mer RNA at absorbance A_{260} are used for selection. The RNA (extinction per mol phosphate = 8500) is assumed to have 50 constant nucleotides in addition to its n randomized nt. Variations from 50 have little effect on the outcome. The crux of the calculation is the effective number of folds,^{21} O:

where R is the redundancy (the number of variations of an * l*-mer that will satisfy the selection), F is the number of folds (ways of disposing

*l*nucleotides divided into

*m*modules within

*n*total nt) and C is configurations (possible ways of dividing the

*l*motif nucleotides into

*m*modules). S is a sampling correction that corrects departures from the Poisson expectation to give an ‘effective number’ of folds (see last section).

*R*, the redundancy, corrects for a required nucleotide at a given position that may occur as any of the natural 4. A motif containing such a position should have *R* increased 4-fold (satisfactory sequences are 4 times as frequent), compared to a value of 1 for a unique nucleotide at that position. If only a purine will serve at that position, *R* is increased by 2-fold, and so on. With many positions free to vary somewhat, *R* is a multi-term product that can become a large number—Sabeti, et al^{21} estimate *R* for the hammerhead ribozyme as 5 × 10^{12}. In addition to nucleotides that are free to vary, there are smaller factors because functional sequences can sometimes be permuted on the linear sequence. Assume that 3 indivisible sequence segments a, b and c come together to make the motif (a structure with modularity *m* = 3). R would be multiplied by a permutation factor of 3 if __abc__, __cab__ and __bca__ are all active molecules.

Having troubled to define *R*, we now set it equal one for all calculations. This convenience frees discussion from details of a motif's tolerance for substitution. We do not know such details in general. Therefore, to make progress we choose to speak in terms of structures composed of nucleotides all of whose identities are unique. As a reminder of this decision, in the text we write of RNAs “composed of fixed nucleotides”, or similar language. The smaller factor due to permutation is quite real, because some motifs can be selected in permuted form.^{14} However, if a selected RNA makes use of the unique reactivity or unique structure of its 3' or 5' ends or both,^{11} permutation becomes implausible. Overall, permutation has been infrequently observed in real selections, and in that respect *R* = 1 with a good conscience.

*F* is the number of folds. A liberal attitude toward *F* is apt, because the number of folds with a ligand __present__ is the quantity of interest. Therefore, unseen bonds (with a ligand or substrate) connect the *m* sequences in the active site, stabilizing structures not otherwise stable.^{8} Consider the molecule as a sequence of nucleotides in the motif and nucleotides in the spacer(s). We multiply the ways of dividing the spacers by the (independent) ways of dividing the modules. First count the number of ways to dispose *m* modules containing a total of *l* nucleotides on a sequence of *n* nucleotides. In order to avoid modules with no nucleotides between them (which would count multiple times those cases where a larger module is the sum of the separate module sizes), we maintain ≥1 nucleotide in each spacer. There are three cases.

First: modules may be internal, with spacer sequences at both ends: •—••• ——••———•• where the diagram shows 3 modules placed among 8 spacer nucleotides. This yields a molecule with *m*+1 spacer regions; there are l nucleotides in modules and (*n−l*) nucleotides in spacers. This contributes a term to F:

*n − l*− 1)!/

*m*!(

*n − l − m*− 1)! (5)

which corresponds to choosing *m* places for modules from the *n−l*−1 spaces that will still leave at least one nucleotide in each spacer.

Second: one module may be at an end:—•• ——• ———••••• thereby dividing the molecule with m spacers. This term is multiplied by 2 because for every fold, there will be versions with a module at the 5' end: —•• ——• ———••• •• and the 3' end:••• ••———• ——•• —.

*n − l*− 1)!/(

*m*− 1)!(

*n − l − m*)! (6)

Third: there are folds with modules at both ends:—••• ——••• ••——— containing *m*−1 spacers and being composed by placing *m*−2 modules in *n−l*−1 places:

*n − l*− 1)!/(

*m*− 2)!(

*n − l − m*+ 1)! (7)

This last term is only relevant when *m*
^{3} 2. The full equation for *F* contains the sum of the terms in eqn 5, 6 and 7.

However, this does not yet enumerate all possibilities. The term C, the configurations (eqn 4), accounts for the fact that there are many folds with the same *n, l* and *m*, because there are multiple ways to divide *l* nucleotides into *m* modules even with fixed spacers. For example, a motif containing *l* = 20 nucleotides in *m* = 3 modules can occur as rather different RNAs constructed by folding together sequence pieces of size 12nt : 3nt : 5nt, or 4nt : 13nt : 3nt, or in yet other ways.

*C*= (

*l*− 1)!/(

*m*− 1)!(

*l − m*)! (8)

*C* counts ways of choosing (*m*−1) places to interrupt the (*l*−1) linkages between *l* nt, thereby creating *m* modules. *C* must therefore multiply *F* to account for the different distributions into modules *F*C* = (eqn 5 + eqn 6 + eqn 7) (eqn 8). Note that in calculations of the mean number of *l*-mers or their probability *P, C* cancels (compare eqn 1) because it is in both *O* and the total number of *l*-mers, *C(4 ^{l})*. For actual calculations, the Gamma function (the continuous equivalent of the factorial) was used in order to calculate outcomes with non-integral

*n*and

*l*.

### Relation to Previous Work

Ciesiolka et al^{2} did not include modularity greater than 1, and therefore needed neither combinatorics nor sampling correction. Our results are similar where they overlap. Sabeti et al^{21} use a simpler form for *F* and do not include calculation of *C* or sampling corrections. Their number of folds (that is, folds without configurations) is always larger than the comparable result here. It includes multiply counted adjacent modules with no spacer nucleotides intervening. Therefore our results diverge from theirs most when there are many modules, and when the total nucleotides in modules, *l*, approaches the size, *n*, of the randomized region. These conditions maximize module interfaces. Conversely, our results tend to the same limits as does Sabeti et al^{21} at small modularity and large randomized regions, because these conditions minimize folds with module-module interfaces.

### Sampling Factors and Folds, the Determination of *S*

We now consider departures from Poisson sampling in real sequence populations. We have studied these effects using purpose-written software implemented in Perl v5.6, running under Irix on an SGI Octane with a 300 MHz MIPS R12000 processor and 1 GB memory. We explicitly generated random nucleotide sequences using the Math:Random module, made all the combinatorial folds implied by the equations just above, and classified the *l*-mer motifs that actually occurred. The real probability of motif occurrence was compared to the Poisson expectation. The major finding is that it appears possible to make useful calculations with moderate corrections.

Surprisingly, it is easy to get some motifs, and much more difficult to get others. Not all motifs of size *l* and modularity *m* are sampled at the same rate. For example, consider 10-mer m =2 folds. Only pentamers must be found to complete a 5:5 nt configuration, while nonamers must be found to complete a 9:1 configuration. This means probability of the absence of decamers as a whole will not decline exponentially with trials (as for the Poisson). Instead it declines as a sum of exponentials, one for each configuration. With greater number of sequences sampled the curve becomes asymptotic to the (possibly much lower) slope for the rarest fold(s). We have used a definition of representation that requires the probability of a fold's absence to be 0.5 (*P* = 0.5). This limits calculations to a range in which we can accurately use an exponential approximation (Fig. 6A).

A second difficulty arises because we repetitively sample existing nucleotide sequences. Combinatorial counting of folds, as above, reuses the same subsection of the sequence many times. Pieces are recombined with other parts of the same sequence to make varied folds, each potentially counted as a separate attempt to find a motif of *l* nucleotides. Certain nucleotide sequences therefore recur. Accordingly, the probability of obtaining all sequences is not equal. Instead, because particular sequences recur, more than the expected Poisson number of sequences is needed to raise the probability that __any__
*l*-mer motif can be observed. Said another way, the accessible *l*-mer is somewhat smaller than for ideal Poisson sampling.

With calculation limited to *P* = 0.5, then over the accessible range of *n, l* and *m*, the probability of absence of an *l*-mer, as the number of sequences sampled increases, can be described with an exponential (Fig. 6A). Therefore, we can define a factor *S* (the ratio of the Poisson slope to that actually observed) by which the number of sequences must be increased in order to reach the Poisson probability that an *l*-mer will be absent. The correction *S*, over the range of our present calculations, implies 1 to 6.4-fold the sequences for Poisson sampling.

The modularity *m* and the size of the randomized region *n* are the most influential variables. Departure from the Poisson worsens as modularity increases because the combinatorial use of high modularities re-samples sequences more intensively. Longer randomized regions allow more folds (Fig. 1) so nonideality also increases with *n*. However, Figure 6B shows that these effects appear regular, and therefore readily predicted. The required correction factor grows linearly with increasing size for the randomized region. The data of Figure 6B were extrapolated, e.g., to *n* = 80, *m* = 4, when needed for calculations.

The departure from Poisson sampling is also worst when *l* is small, because this also tends to maximize reuse of sequences (Fig. 6C). However, as the Figure shows, when *l* is an appreciable fraction of *n* this correction is both small and slowly varying. Therefore we have used the same correction for all *l* (that implicit in the calculation for *n* and *m*, as above). The resulting variation in l over the range of larger motifs produces a variation in the size of the motif estimated as < 3 %, insignificant by comparison with other approximations.

Explicit sampling corrections for larger randomized regions and modularities were beyond the limits of available computer storage and computational speed (compare Fig. 1). As an example related to storage, there are about 10^{10} 20-mer folds of modularity 4 in one randomized 150-mer. To explicitly store the motifs from one randomized molecule and their addresses would therefore require about 340 gigabytes. For these reasons, our corrections were calculated by extrapolation from computable cases, as above. We hope the regularities observed in computation (Fig. 6) will encourage an analytical solution to this interesting sampling problem.

For calculations, *O* in eqn. 4 is divided by sampling correction factors, *S*, to give an effective number of folds. With the addition of sampling corrections, all terms in eqn 1 above have an explicit form. Initial RNA populations can now be varied in size and design, and the effects on the presence of given motifs in a starting RNA pool can be estimated (main text above).

## References

- 1.
- Bartel DP, Szostak JW. Isolation of new ribozymes from a large pool of random sequences. Science. 1993;261:1411–1418. [PubMed: 7690155]
- 2.
- Ciesiolka J, Illangasekare M, Majerfeld I. et al. Affinity selection-amplification from randomized ribooligonucleotide pools. Methods Enzymol. 1996;267:315–335. [PubMed: 8743325]
- 3.
- Ekland EH, Szostak JW, Bartel DP. Structurally complex and highly active RNA ligase derived from random RNA sequences. Science. 1995;269:364–370. [PubMed: 7618102]
- 4.
- Ellington A, Szostak JW. In vitro selection of RNAs that bind specific ligands. Nature. 1990;346:818–822. [PubMed: 1697402]
- 5.
- Ferris JP, Hill AR, Liu R. et al. Synthesis of long prebiotic oligomers on mineral surfaces. Nature. 1996;381:59–61. [PubMed: 8609988]
- 6.
- Fresco JR, Alberts BM, Doty P. Some molecular details of the secondary structure of ribonucleic acids. Nature. 1960;188:98–101. [PubMed: 13701785]
- 7.
- Hecker KH, Rill RL. Error analysis of chemically synthesized polynucleotides. Biotechniques. 1998;24:256–260. [PubMed: 9494726]
- 8.
- Hermann T, Patel DJ. Adaptive recognition by nucleic acid aptamers. Science. 2000;287:820–825. [PubMed: 10657289]
- 9.
- Huang F, Bugg CW, Yarus M. RNA-catalyzed CoA, NAD and FAD synthesis from phosphopantetheine, NMN and FMN. Biochemistry. 2000;39:15548–15555. [PubMed: 11112541]
- 10.
- Illangasekare M, Yarus M. A tiny RNA that catalyzes both aminoacylRNA and peptidylRNA synthesis. RNA. 1999;5:1482–1489. [PMC free article: PMC1369869] [PubMed: 10580476]
- 11.
- Illangasekare M, Kovalchuke O, Yarus M. Essential structures of a selfaminoacylating RNA. J Mol Biol. 1997;274:519–529. [PubMed: 9417932]
- 12.
- Illangasekare M, Sanchez G, Nickles T. et al. AminoacylRNA synthesis catalyzed by an RNA. Science. 1995;267:643–647. [PubMed: 7530860]
- 13.
- Jadhav VR, Yarus M. AcylCoAs from coenzymeribozymes. Biochemistry. 2002;41:723–729. [PubMed: 11790093]
- 14.
- Jenison RD, Gill SC, Pardi A. et al. Highresolution molecular discrimination by RNA. Science. 1994;263:1425–1429. [PubMed: 7510417]
- 15.
- Johnston WK, Unrau PJ, Lawrence MS. et al. RNA-catalyzed RNA polymerization: accurate and general RNAtemplated primer extension. Science. 2001;292:1319–1325. [PubMed: 11358999]
- 16.
- Joyce GF, Orgel LE. Prospects for understanding the origin of the RNA world In: Gesteland RF, Atkins JF, eds. The RNA World Cold Spring Harbor: Cold Spring Harbor Laboratory Press, 19931–25.
- 17.
- Kazakov S, Altman S. A trinucleotide can promote metal iondependent specific cleavage of RNA. Proc Natl Acad Sci USA. 1992;89:7939–7943. [PMC free article: PMC49830] [PubMed: 1518817]
- 18.
- Keefe AD, Szostak JW. Functional proteins from a randomsequence library. Nature. 2001;410:715–718. [PMC free article: PMC4476321] [PubMed: 11287961]
- 19.
- Kumar RK, Yarus M. RNA-catalyzed amino acid activation. Biochemistry. 2001;40:6998–7004. [PubMed: 11401543]
- 20.
- Roberson D, Joyce J. Selection in vitro of an RNA enzyme that specifically cleaves singlestranded DNA. Nature. 1990;344:467–468. [PubMed: 1690861]
- 21.
- Sabeti PC, Unrau PJ, Bartel DP. Accessing rare activities from random RNA sequences: the importance of the length of molecules in the starting pool. Chem Biol. 1997;4:767–774. [PubMed: 9375255]
- 22.
- Tuerk C, Gold L. Systematic evolution of ligands by exponential enrichment. Science. 1990;249:505–510. [PubMed: 2200121]
- 23.
- VantHull B, PayanoBaez A, Davis RH. et al. The mathematics of SELEX against complex targets. J Mol Biol. 1998;278:579–597. [PubMed: 9600840]
- 24.
- Verma S, Vaish NK, Eckstein F. Structurefunction studies of the hammerhead ribozyme. Curr Opin Chem Biol. 1997;1:532–536. [PubMed: 9667883]
- 25.
- Vlassov A, Khvorova A, Yarus M. Binding and disruption of phospholipid bilayers by supramolecular RNA complexes. Proc Natl Acad Sci USA. 2001;98:7706–7711. [PMC free article: PMC35406] [PubMed: 11427715]
- 26.
- White III HB. Coenzymes as fossils of an earlier metabolic state. J Mol Evol. 1976;7:101–104. [PubMed: 1263263]
- 27.
- Yarus M. RNA-ligand chemistry: A testable source for the genetic code. RNA. 2000;6:475–484. [PMC free article: PMC1369929] [PubMed: 10786839]
- 28.
- Yarus M, Illangasekare M. Aminoacyl-tRNA synthetases and self-acylating ribozymes In: Gesteland RF, Cech TR, Atkins JF, eds. The RNA world 2
^{nd}Edition. Cold Spring Harbor: Cold Spring Harbor Laboratory Press, 1999183–196. - 29.

## Publication Details

### Author Information

Michael Yarus and Rob Knight.

### Copyright

### Publisher

Landes Bioscience, Austin (TX)

### NLM Citation

Yarus M, Knight R. The Scope of Selection. In: Madame Curie Bioscience Database [Internet]. Austin (TX): Landes Bioscience; 2000-2013.