- Journal List
- HHS Author Manuscripts
- PMC2782888

# Automated Design of Synthetic Ribosome Binding Sites to Precisely Control Protein Expression

^{1}Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, CA, 94158

^{2}Graduate Group in Biophysics, University of California San Francisco, San Francisco, CA, 94158

Users may view, print, copy, download and text and data- mine the content in such documents, for the purposes of academic research, subject always to the full Conditions of use: http://www.nature.com/authors/editorial_policies/license.html#terms

## Abstract

Microbial engineering often requires fine control over protein expression; for example, to connect genetic circuits ^{1}^{-}^{7} or control flux through a metabolic pathway ^{8}^{-}^{13}. We have developed a predictive design method for synthetic ribosome binding sites that enables the rational control of a protein's production rate on a proportional scale. Experimental validation of over 100 predictions in *Escherichia coli* shows that the method is accurate to within a factor of 2.3 over a range of 100,000-fold. The design method also correctly predicts that reusing a ribosome binding site sequence in different genetic contexts can result in different protein expression levels. We demonstrate the method's utility by rationally optimizing a protein's expression level to connect a genetic sensor to a synthetic circuit. The proposed forward engineering approach will accelerate the construction and systematic optimization of large genetic systems.

**Keywords:**synthetic biology, translation, optimization, metabolic engineering, genetic circuit, RNA secondary structure

## Introduction

Microbial engineering is a time-consuming procedure that often requires multiple rounds of trial-and-error genetic mutation. As it becomes possible to construct larger pieces of synthetic DNA ^{14}, including whole genomes ^{15}, automated methods for genetic circuit assembly and metabolic pathway optimization will be critically important. As genetic systems grow in size and complexity, the application of a trial-and-error approach to optimizing these systems is more difficult.

A genetic system's function is optimized by varying the sequences of its regulatory elements to control the expression levels of its protein coding sequences. Each rate-limiting step in gene expression offers the opportunity for rationally modulating the protein expression level. In bacteria, ribosome binding sites (RBSs) and other regulatory RNA sequences are effective control elements for translation initiation ^{16}^{-}^{19}. As a consequence, they are commonly mutated to optimize genetic circuits, metabolic pathways, and the expression of recombinant proteins.

Previous studies have generated libraries of RBS sequences with the goal of optimizing the function of a genetic system ^{1}^{, }^{7}^{, }^{18}. Generation and selection of a sequence library can become impractical as the number of participating proteins increases, especially if measuring the function requires a low-throughput assay or screen ^{6}. For example, randomly mutating 4 nucleotides of an RBS generates a library of 256 sequences. The library size increases combinatorially with the number of proteins in the engineered system (16.7 million sequences for 3 proteins, 2.8×10^{14} sequences for 6 proteins).

A biophysical model of translation initiation would aid the optimization process by enabling the design of an RBS sequence to obtain a desired translation initiation rate. Using thermodynamics, the free energies of key molecular interactions involved in translation initiation have been characterized ^{20}^{, }^{21}. Thermodynamic models are made possible by measuring the sequence-dependent energetic changes during RNA folding and hybridization ^{22}^{-}^{26}. These methods have enumerated and characterized the attributes of a RBS sequence that affect its translation initiation rate, but a predictive model that combines all of the interactions together has not been created and tested.

Bacterial translation consists of four phases: initiation, elongation, termination, and ribosome turnover (Figure 1A) ^{27}. In most cases, translation initiation is the rate-limiting step. The translation initiation rate is determined by the summary effect of multiple molecular interactions, including the hybridization of the 16S rRNA to the RBS sequence, the binding of tRNA^{fMET} to the start codon, the distance between the 16S rRNA binding site and the start codon, and the presence of RNA secondary structures that occlude either the 16S rRNA binding site or the standby site ^{20}^{, }^{21}^{, }^{28}^{-}^{31}.

**...**

We have developed an equilibrium statistical thermodynamic model to quantify the strengths of the molecular interactions between the 30S complex and an mRNA transcript and to predict the resulting translation initiation rate. The thermodynamic model describes the system as having two states separated by a reversible transition (Figure 1B). The initial state is the folded mRNA transcript and the free 30S complex. The final state is the assembled 30S pre-initiation complex on an mRNA transcript. The difference in Gibbs free energy between these two states is quantified by the Gibbs free energy change ΔG_{tot}. The ΔG_{tot} depends on the mRNA sequence surrounding a specified start codon and will become more negative when attractive interactions are present and more positive when mutually exclusive secondary structures are present.

The translation initiation rate *r* is related to the ΔG_{tot} according to

where *β* is the Boltzmann factor for the system. The derivation of Equation 1 is presented in the Supplementary Methods. Importantly, Equation 1 describes the differences in translation initiation rate that result from differences in mRNA sequence. The amount of expressed protein *E* is proportional to the translation initiation rate where the proportionality factor *K* accounts for any ribosome-mRNA molecular interactions that are independent of mRNA sequence and any translation-independent parameters, such as the DNA copy number, the promoter's transcription rate, the mRNA stability, and the protein dilution rate (Supplementary Figure 1).

Given a specific mRNA sequence surrounding a start codon, called the subsequence, the ΔG_{tot} is predicted according to the energy model:

where the reference state is a fully unfolded subsequence with ΔG_{ref} = 0.

The ΔG_{mRNA:rRNA} term is the energy released when the last 9 nucleotides (nt) of the *E. coli* 16S rRNA – 3′-AUUCCUCCA-5′ – hybridizes and co-folds to the mRNA subsequence (ΔG_{mRNA:rRNA} < 0). Intra-molecular folding within the mRNA is allowed. All possible hybridizations between the mRNA and 16S rRNA are considered to find the highest affinity 16S rRNA binding site. The binding site minimizes the sum of the hybridization free energy ΔG_{mRNA:rRNA} and the penalty for non-optimal spacing ΔG_{spacing}. Thus, the algorithm can identify the 16S rRNA binding site regardless of its similarity to the consensus Shine-Dalgarno sequence.

The ΔG_{start} term is the energy released when the start codon and the initiating tRNA anti-codon loop – 3′-UAC-5′ – hybridize together. The ΔG_{spacing} is the free energy penalty caused by a non-optimal physical distance between the 16S rRNA binding site and the start codon (ΔG_{spacing} > 0). When this distance is increased or decreased from an optimum of 5 nt (or ~17 Å) ^{29}, the 30S complex becomes distorted, resulting in a decreased translation initiation rate.

The ΔG_{mRNA} is the work required to unfold the mRNA subsequence when it folds to its most stable secondary structure, called the minimum free energy structure (ΔG_{mRNA} < 0). The ΔG_{standby} is the work required to unfold any secondary structures sequestering the standby site (ΔG_{standby} < 0) after 30S complex assembly. We define the standby site as the 4 nucleotides upstream of the 16S rRNA-binding site, which is its location in a previously studied mRNA ^{28}.

To calculate the ΔG_{mRNA:rRNA}, ΔG_{start}, ΔG_{mRNA}, and ΔG_{standby} free energies, we use the NUPACK suite of algorithms, developed by Pierce and coworkers ^{32}, with the Mfold 3.0 RNA energy parameters ^{22}^{, }^{23}. These free energy calculations do not have any additional fitting or training parameters and explicitly depend on the mRNA sequence. In addition, the free energy terms are not orthogonal; changing a single nucleotide can potentially affect multiple energy terms.

We designed a series of experiments to quantify the relationship between the aligned spacing *s* and the free energy penalty ΔG_{spacing}. Thirteen synthetic RBSs are created where the aligned spacing is varied from 0 to 15 nucleotides while verifying that the ΔG_{mRNA:rRNA}, ΔG_{mRNA}, ΔG_{start}, and ΔG_{standby} free energies remain constant (Supplementary Table I). The translation initiation rates of RBS sequences are measured using a fluorescent protein measurement system (**Methods**). Steady-state fluorescence measurements are performed on *E. coli* cultures over a 24 hour period. Under these conditions, the average fluorescence measurement is expected to be proportional to the translation initiation rate *r*.

The quantitative relationship between the aligned spacing and ΔG_{spacing} is obtained from the fluorescence measurements (**Methods**). According to the data, it is conceptually useful to treat the 30S complex as a model barbell connected by a rigid spring, where either stretching or compressive forces cause a reduction in entropy and an increase in the ΔG_{spacing} penalty. We empirically fit these measured ΔG_{spacing} values to either a quadratic (s > 5 nt) or a sigmoidal function (s < 5 nt). Following this parameterization, we tested the accuracy of these equations on an additional set of synthetic RBS sequences (Supplementary Figure 2).

For an arbitrary mRNA transcript, the thermodynamic model (Equation 2) is evaluated for each AUG or GUG start codon. The algorithm considers only a subsequence of the mRNA transcript, consisting of 35 nucleotides before and after the start codon. This subsequence includes the RBS and part of the protein coding sequence. The model predictions do not improve when longer subsequences are considered (Supplementary Figure 3).

The development of the thermodynamic model makes certain assumptions. Contributions related to the ribosomal S1 protein's potential preference for pyrimidine-rich sequences are omitted from the free energy model^{33}. The model also assumes that the reversible transition between the initial and final state of 30S complex assembly reaches chemical equilibrium on a physiologically relevant timescale and without any long-lived intermediate states. The presence of overlapping or neighboring start codons, overlapping RBS and protein coding sequences, regulatory RNA binding sites, or RNAse binding sites also pose a challenge to the predictive accuracy of the thermodynamic model. The presence of multiple in-frame start codons, each with significant translation initiation, may distort its predictive accuracy. A genetic system can be designed to avoid many of these complications.

The thermodynamic model can be used in two ways. First, it can predict the relative translation initiation rate of an existing RBS sequence for a particular protein coding sequence on an mRNA transcript. We refer to this as “reverse engineering” because the RBS sequence already exists. Second, it can be used in conjunction with an optimization algorithm to identify a synthetic RBS sequence that is predicted to translate a given protein coding sequence at a user-selected rate. We refer to this mode as “forward engineering” because it generates a *de novo* sequence according to a user's specifications.

We use the thermodynamic model to predict the translation initiation rates of 28 existing RBS sequences (Figure 2A) that were obtained from a natural genome or taken from a list of commonly used sequences (Supplementary Table I). The lengths of these sequences, as measured by the distance from the transcriptional start site to the fluorescent protein's start codon, vary from 24 to 42 nucleotides. The steady-state protein fluorescences from the sequences are then assayed in the measurement system (**Methods**). The growth rates of the cell cultures did not correlate with protein fluorescence (Supplementary Figure 4). According to the theory (Equation 1), we expect a linear relationship between the predicted ΔG_{tot} and the log protein fluorescence. Using linear regression, the squared correlation coefficient R^{2} is 0.54 with Boltzmann factor *β* = 0.45 ± 0.05 mol/kcal (Figure 2B). The average error is |ΔΔG| = 2.1 kcal/mol (Figure 2C).

_{tot}from the input sequence. According

**...**

While these commonly used RBS sequences vary the protein expression by 1500 fold, the thermodynamic model predicts that both stronger and weaker RBSs are possible. For example, one of these RBS sequences contains a strong 16S rRNA binding site (ΔG_{mRNA:rRNA} = −15.2 kcal/mol), but did not yield a high protein expression level due to a strong mRNA secondary structure and non-optimal spacing (ΔG_{mRNA} = −11.4, ΔG_{spacing} = 1.73 kcal/mol). By optimizing the RBS sequence towards a selected ΔG_{tot}, we gain the ability to rationally control the translation initiation rate over a wide range with a proportional effect on the protein expression level.

Using the thermodynamic model, we developed an optimization algorithm that automatically designs an RBS sequence to obtain a desired relative protein expression level. The user inputs a specific protein coding sequence and a desired translation initiation rate. The rate can be varied over five orders of magnitude on a proportional scale. Equation 1 and the experimentally measured *β* = 0.45 mol/kcal is used to convert the user-selected translation initiation rate into the target ΔG_{tot}. The method then generates a synthetic RBS sequence according to the desired specifications.

The design method combines the thermodynamic model of translation initiation with a simulated annealing optimization algorithm to design an RBS sequence that is predicted to have a target ΔG_{tot} (Figure 2D). The RBS sequence is initialized as a random mRNA sequence upstream of the protein coding sequence. The method then creates new mRNA sequences by inserting, deleting, or replacing random nucleotides. For each new sequence, the ΔG_{tot} is calculated and compared to the target ΔG_{tot}. The sequences are then accepted or rejected according to the Metropolis criteria and three additional sequence constraints that are based on the model's assumptions (**Methods**). The procedure continues until the synthetic sequence has a predicted ΔG_{tot} to within 0.25 kcal/mol of the target. For a given target ΔG_{tot}, multiple solutions are possible, creating an ensemble of degenerate RBS sequences. The characterization of these ensembles is described in the Supplementary Discussion.

The forward design method is tested by generating 29 synthetic RBS sequences (Supplementary Table I) and comparing their predicted ΔG_{tot} values to the measured protein fluorescences. The coding sequence for a red fluorescent protein is specified and the ΔG_{tot} target is varied from −7.1 to 16.0 kcal/mol. The design method then generates a synthetic RBS sequence for each target ΔG_{tot}. These RBS sequences vary in length from 16 to 35 nucleotides and were highly dissimilar. The steady-state protein fluorescence for each sequence is measured (**Methods**). The growth rates of the cell cultures did not significantly vary across sequences (Supplementary Figure 4). As expected from the theory (Equation 1), we obtain a linear relationship between the log protein fluorescence and the predicted ΔG_{tot} with β = 0.45 ± 0.01 (R^{2} = 0.84) (Figure 2E). The average error is |ΔΔG| = 1.82 kcal/mol, corresponding to a 2.3-fold error in the protein expression level. The probability distribution of the ΔΔG for a synthetic RBS is well fit by a Gaussian distribution (Figure 2F).

We next tested the ability of the design method to control the translation initiation rates of different proteins. Two chimeric proteins are constructed that fused the first 27 nucleotides from commonly used transcription factors to a red fluorescent protein (TetR_{27}-RFP and AraC_{27}-RFP). The design method is then used to generate 23 synthetic RBSs with ΔG_{tot} targets ranging from −8.5 to 10.5 kcal/mol (Supplementary Table I). The thermodynamic model correctly predicts the translation initiation rates of the TetR_{27}-RFP (R^{2} = 0.54) and AraC_{27}-RFP (R^{2} = 0.95) chimeric protein coding sequences (Figure 3A). Notably, the linear relationship between the predicted ΔG_{tot} and the log protein fluorescence yields a similar slope *β* = 0.45 ± 0.05 mol/kcal.

**...**

A common practice is to reuse the same well-characterized RBS sequence for the expression of different proteins. Interestingly, the thermodynamic model predicts that this can yield dramatically different translation initiation rates. This absence of modularity will occur when the RNA sequence, containing the RBS, forms strong secondary structures with one protein coding sequence, but not another ^{30}.

We designed experiments to test the model's ability to predict the impact of changing the protein coding sequence on the translation initiation rate. We use the design method to generate 14 synthetic RBS sequences; these sequences are then placed upstream of two different protein coding sequences: the fluorescent protein (RFP) and a chimeric fluorescent protein (TF-RFP: LacI_{27}-RFP, TetR_{27}-RFP, or AraC_{27}-RFP). The optimization procedure for these synthetic RBSs was modified to maximize the objective function |ΔG_{RFP} − ΔG_{TF-RFP}|, where ΔG_{RFP} and ΔG_{TF-RFP} are the predicted ΔG_{tot}'s when the RBS sequence is placed upstream of either the RFP or TF-RFP protein coding sequences, respectively. As predicted by the model, the translation initiation rates of these synthetic RBS sequences greatly change when they are reused with different protein coding sequences (Figure 3B); for example, replacing the fluorescent protein with the TetR_{27}-RFP chimera resulted in a 530-fold increase in expression level.

The thermodynamic model can accurately predict these differences in translation initiation rate when the correct protein coding sequence is specified (R^{2} = 0.62 and 0.51, Figure 3C). When the incorrect protein coding sequence is used, the translation initiation rate is not accurately predicted (R^{2} = 0.04, 0.02). Consequently, when designing a RBS sequence, the beginning of the protein coding sequence must be included in the thermodynamic calculations.

Altogether, 119 predictions of the design method were tested, revealing that the translation initiation rate can be controlled over at least a 100,000-fold range. The thermodynamic model is most accurate when all free energy terms are included in the ΔG_{tot} calculation (Supplementary Figure 5). By themselves, each free energy term is a poor predictor of the translation initiation rate (Supplementary Figure 6) and excluding one free energy term from the ΔG_{tot} calculation results in a poorer prediction (Supplementary Figure 7). According to the distribution of the method's error (Figure 2F), an optimized RBS sequence has a 47% probability of expressing a protein to within 2-fold of the target. The probability increases to 72%, 85%, or 92% by generating two, three, or four optimized RBS sequences with identical target translation initiation rates (Supplementary Discussion).

We now demonstrate how combining the design method with a quantitative model of a genetic system enables the efficient optimization of its RBS sequences towards a targeted system behavior. Here, our objective is to optimize the connection between the arabinose-sensing *P*_{BAD} promoter and an AND gate genetic circuit^{7}. The AND gate genetic circuit is regulated by the expression levels of two input promoters (*P*_{BAD} and *P*_{sal}) and controls the expression level of an output gene, which is selected to be a *gfp* reporter (Figure 4A). The desired AND logic requires that the output gene is only expressed when both input promoters are active. The digital accuracy of the AND logic is highest when the maximum expression level from the *P*_{BAD} promoter is an optimal value between underexpression and overexpression. When the promoter is underexpressed, the *gfp* expression is never turned on; when overexpressed, transcriptional leakiness causes *gfp* expression to turn on even in the input's absence.

*gfp*reporter output when both the

*P*

_{BAD}and

*P*

_{sal}promoter inputs are sufficiently induced by arabinose and salicylate,

**...**

The quantitative model relates the RBS sequence downstream of the *P*_{BAD} promoter to the accuracy of the AND gate genetic circuit's function (Figure 4B). We use previously characterized transfer functions^{7} to relate the arabinose and salicylate concentrations to the expression levels of the *P*_{BAD} and *P*_{sal} promoters (*I*_{1} and *I*_{2}) (Supplementary Figure 8). The *P*_{BAD} promoter has a maximum protein expression level of *g*_{ref} = 590 au at full induction (*x* = 1.3 mM arabinose) and when using an RBS sequence with a predicted ΔG_{tot} value of ΔG_{ref} = −1.05 kcal/mol. We then substitute *I*_{1} and *I*_{2} into the AND gate genetic circuit's transfer function to determine the output gene's expression level, which is in turn substituted into the fitness function *F* that quantifies the ability of the genetic system to carry out AND logic (Supplementary Methods).

We can interconvert between the maximum protein expression level of the *P*_{BAD} promoter and the predicted ΔG_{tot} of its RBS sequence according to the equation,

where *g* is called the gain. The experimentally measured *β* = 0.45 mol/kcal is utilized. Consequently, we create a quantitative curve *F*(ΔG_{tot}) that relates the predicted ΔG_{tot} of the *P*_{BAD} promoter's RBS sequence to the fitness of the genetic system. The fitness curve identifies an optimal region at ΔG_{tot} = −1.17 ± 2 kcal/mol where the genetic system will exhibit the best AND logic with respect to the *P*_{BAD} promoter's RBS sequence (Figure 4B).

Using the forward engineering mode of the design method, we then generate 2 synthetic RBS sequences targeted to the optimum region of the genetic system's function (predicted ΔG_{tot} = −1.48 and −1.15 kcal/mol). We also design 7 additional synthetic RBSs to test the accuracy of the *F*(ΔG_{tot}) fitness curve, where the ΔG_{tot} ranged from 0.60 to 17.2 kcal/mol. Each RBS sequence (32 to 35 nt) is inserted downstream of the *P*_{BAD} promoter and the resulting genetic circuit's response to varying inducer concentrations is assayed (Figure 4C and **Methods**). The fitness values of these rationally mutagenized genetic systems are then compared to the predictions of the model and design method (Figure 4B). The insertion of two stronger RBS sequences (ΔG_{tot} = −2.5 and −3.0 kcal/mol) cause the genetic system to exhibit a fatal growth defect.

Both optimally designed synthetic RBS sequences result in a successful connection between the arabinose-sensing *P*_{BAD} promoter and the AND gate genetic circuit (mean fitness > 0.85, Figure 4B). The experimentally determined optimum in the *F*(ΔG_{tot}) curve is nearby ΔG_{tot} = 0.60 kcal/mol, which is only a 1.8 kcal/mol deviation from the model's prediction. The quantitative model and design method also correctly predict how the fitness of the genetic system deteriorates with an increasing ΔG_{tot}. Thus, our approach enabled us to rationally connect two synthetic genetic circuits together to obtain a target behavior while performing only a few mutations and assays (additional design calculations are located in the Supplementary Discussion).

A central goal of synthetic biology is to program cells to carry out valuable functions. As we construct larger and more complicated genetic systems, models and optimization techniques will be required to efficiently combine genetic parts to achieve a target behavior. To accomplish this, biophysical models that link the DNA sequence of a part to its function will be necessary. As engineered genetic systems scale to the size of genomes, the integration of multiple design methods will enable the design of synthetic genomes on a computer to control cellular behavior.

## Materials and Methods

### Software Implementation

A software implementation of the design method has been named the RBS Calculator and is available at http://voigtlab.ucsf.edu/software. Visitors may use the RBS Calculator in two ways: first, to predict the translation initiation rate of each start codon on an mRNA sequence (reverse engineering); second, to optimize the sequence of a ribosome binding site to rationally control the translation initiation rate with a proportional effect on the protein expression level (forward engineering). The translation initiation rate is gauged on a proportional scale with a suggested range of 0.1 to 100000, although a larger range is potentially feasible. In reverse engineering mode, the software will warn visitors when ribosome binding sites fail to satisfy the sequence constraints or contain additional sequence complications.

### A thermodynamic model of translation initiation

The mRNA subsequence *S*_{1} consists of the max(1, *n*_{start} − 35) to *n*_{start} nucleotides and the subsequence *S*_{2} consists of the max(1, *n*_{start} − 35) to *n*_{start} + 35 nucleotides, where *n*_{start} is the position of a start codon. The ΔG_{start} is −1.19 and −0.075 kcal/mol for AUG and GUG start codons, respectively ^{22}.

Using the NuPACK ‘subopt’ algorithm ^{32} with Mfold 3.0 parameters at 37°C ^{22}^{, }^{23}, base pair configurations of the folded 16S rRNA and sequence *S*_{1} are enumerated, starting with the minimum free energy (mfe) configuration and continuing with suboptimal configurations, each with a corresponding ΔG_{mRNA:rRNA}. For each configuration, the aligned spacing between the 16S rRNA binding site and start codon is calculated according to *s* = *n*_{start} − *n*_{1} − *n*_{2}, where *n*_{1} and *n*_{2} are the rRNA and mRNA nucleotide positions in the farthest 3′ base pair in the 16S rRNA binding site. When the 30S complex is stretched (s > 5 nt), the ΔG_{spacing} is calculated according to the quadratic equation,

where *s*_{opt} = 5 nt, c_{1} = 0.048 kcal/mol/nt^{2}, and c_{2} = 0.24 kcal/mol/nt. When the 30S complex is compressed (s < 5 nt), the ΔG_{spacing} is calculated according to the sigmoidal function,

where c_{1} = 12.2 kcal/mol and c_{2} = 2.5 nt ^{−1}. The above parameter values are determined by minimizing the difference between the ΔG_{spacing} values calculated from the experimental measurements (Supplementary Figure 2) and the evaluation of Equation 4 or 5. For each configuration, the ΔG_{spacing} is added to the ΔG_{mRNA:rRNA}. The configuration in the list with the lowest free energy is then identified as containing the predicted 16S rRNA binding site with a corresponding ΔG_{mRNA:rRNA}. The protein coding sequence is excluded from *S*_{1} because ribosome binding excludes the formation of downstream secondary structures.

Using the NuPACK ‘mfe’ algorithm and Mfold parameters, the mfe configuration of sequence *S*_{2} is calculated and its free energy is designated ΔG_{mRNA}. The standby site is the 4 nt region upstream of the 16S rRNA binding site. The energy required to unfold the standby site is determining by calculating the mfe of sequence *S*_{2} with and without preventing the standby site from forming base pairs. The difference between these mfe's is designated ΔG_{standby}. To calculate the mfe of sequence *S*_{2} with a standby site that is constrained to be single-stranded, the sequence is first split into two subsequences, their mfes are each calculated, and then summed together. The two subsequences are the nucleotides *n*_{start} − 35 to *n*_{3} − 4 and *n*_{3} to *n*_{start} + 35, where *n*_{3} is the most 5′ base pair in the 16S rRNA binding site and 4 is the standby site length.

The five energy terms are summed together to calculate the ΔG_{tot}. Notably, selecting an alternate reference energy state simply adds a sequence-independent constant to the predicted ΔG_{tot}, which becomes indistinguishable from the proportionality factor *K*.

### The simulated annealing optimization algorithm

An initial RBS sequence is randomly generated and inserted in between a pre-sequence and protein coding sequence to create a sequence *S*. The ΔG_{tot} of the sequence *S* is calculated and the objective function *O*_{old} = |ΔG_{tot} − ΔG_{target}| is evaluated. In an iterative procedure, the simulated annealing optimization algorithm randomly deletes, inserts, or replaces a nucleotide in the RBS sequence. The ΔG_{tot} and objective function *O*_{new} are then recalculated. If the ΔG_{tot} calculation of *S* invalidates the sequence constraints, then the mutation is immediately rejected. Otherwise, the mutation is accepted with probability max(1, exp([*O*_{old} − *O*_{new}]/T_{SA})), where T_{SA} is the simulated annealing temperature. The T_{SA} is continually adjusted to maintain a 5% to 20% acceptance rate.

There are three sequence constraints that prevent the optimization algorithm from generating a synthetic RBS sequence that may invalidate one of the thermodynamic model's assumptions. The first constraint calculates the energy required to unfold the 16S rRNA binding site on the mRNA sequence and rejects the ones that require more than 6 kcal/mol to unfold. The second constraint quantifies the presence of long-range nucleotide interactions. According to a growth model for random RNA sequences ^{34}, the equilibrium probability of nucleotides *i* and *j* forming a base pair in solution is proportional to *p* = |*i* − *j*|^{−1.44}. For each base pair in sequence *S*, we calculate *p*. If the minimum *p* is less than 6×10^{-3} then the sequence is rejected. Finally, the creation of new AUG or GUG start codons within the RBS sequence is disallowed.

### Strains, media, and plasmid construction

The Luria-Bertani (LB) media (10 g/L tryptone, 5 g/L yeast extract, 10 g/L NaCl) is obtained from Fisher Scientific (Pittsburgh, PA). The supplemented minimal media contains M9 minimal salts (6.8 g/L Na_{2}PO_{4}, 3 g/L KH_{2}PO_{4}, 0.5 g/L NaCl, 1 g/L NH_{4}Cl) from Sigma, 2 mM MgSO4 (Fischer Scientific), 100 μM CaCl_{2} (Fischer Scientific), 0.4% glucose (Sigma), 0.05 g/L leucine (Acros Organics, Belgium), 5 μg/mL chloramphenicol (Acros Organics), and an adjusted pH of 7.4. The expression system is a ColE1 vector with chloramphenicol resistance (derived from pProTet, Clontech). The expression cassette contains a σ^{70} constitutive promoter (BioBrick J23100), the RBS sequence, followed by the mRFP1 fluorescent protein reporter. XbaI and SacI restriction sites are located before the RBS and after the start codon. An RBS with a desired sequence is inserted into the expression vector using standard cloning techniques. Pairs of complementary oligonucleotides are designed with XbaI and SacI overhangs and the vector is digested with XbaI and SacI restriction enzymes (NEB, Ipswich, MA). Ligation of the annealed oligonucleotides with cut vector results in a nicked plasmid, which is transformed into *E. coli* DH10B cells. Sequencing is used to verify a correct clone.

The AND gate genetic circuit is composed of three plasmids: pBACr-AraT7940, pBR939b, and pAC-SalSer914 with kanamycin, ampicillin, and chloramphenicol resistance markers, respectively. The *P*_{BAD} promoter maximum expression level was modified by inserting designed synthetic RBSs on plasmid pBACr-AraT7940. Plasmid pBACr-AraT7940 was digested with BamHI and ApaLI enzymes and pairs of oligonucleotides were designed to contain the desired RBS sequence and corresponding overhangs. Ligation, transformation, selection, and sequencing proceeded as described above.

### Growth and fluorescence measurements

The fluorescent protein measurement system is composed of a constitutive promoter, a sequence containing a RBS, and the mRFP1 fluorescent protein reporter (Supplementary Figure 9). An annotated DNA sequence of the system (Genbank format) is available in the Supplementary Data.

Growth and fluorescence measurements are performed in 96-well high throughput format. A 96-well plate containing 200 μl LB and 50 μg/ml chloramphenicol is inoculated, from single colonies, with up to 30 different DH10B *E. coli* cultures in an alternating, staggered pattern that excludes the outer wells. Cultures are incubated overnight at 37°C with 250 RPM orbital shaking. A fresh 96-well plate containing 200μl supplemented minimal media is inoculated by overnight cultures using a 1:100 dilution. This plate is then incubated at 37°C in a Safire^{2} plate spectrophotometer (Tecan) with high orbital shaking. OD_{600} measurements are recorded every 3 minutes. Once a culture reaches an OD_{600} of 0.15 to 0.20 (4 to 6 hours), a sample of each culture is transferred to a new plate containing 200 μl PBS and 2 mg/ml kanamycin (Acros Organics) for flow cytometry measurements. This media replacement strategy is repeated twice more using fresh, pre-warmed plates containing supplementary minimal media (the first with a 1:10 dilution requiring 8 to 10 hours of growth and the second with a 1:7 dilution requiring 13 to 15 hours of growth). At least three samples are taken for each culture. The fluorescence distribution of each sample is measured with a LSRII flow cytometer (BD Biosciences). We use an ellipse in forward and side scatter space to gate at least 30 000 flow cytometer events. All distributions are unimodal. The autofluorescence distribution of DH10B cells is also measured. The arithmetic mean of each distribution is taken and the mean autofluorescence is substracted.

From single colonies, RBS variants of each AND gate genetic circuit are grown overnight in LB and antibiotics (50 μg/ml ampicillin, 25 μg/ml chloramphenicol, and 25 μg/ml kanamycin). A 96-well plate containing 200 μl LB, antibiotics, and sixteen different inducer concentrations (combinations of 0.0, 1.3×10^{-3}, 8.3×10^{-2}, and 1.3 mM arabinose with 0.0, 6.1×10^{-4}, 3.9×10^{-2}, and 0.62 mM sodium salicylate) are inoculated by overnight cultures using a 1:100 dilution. Plates are grown in a Safire^{2} plate spectrophotometer (Tecan) with high orbital shaking. OD_{600} and *gfp* fluorescence measurements are recorded every 10 minutes for 14 hours. Background autofluorescence is subtracted from each fluorescence measurement. This procedure is repeated twice for each variant. For each variant, the average and standard deviation of the fluorescence per OD_{600} for each inducer concentration at the final time point are then calculated.

### Data analysis

The ΔG_{spacing} is inferred from the fluorescent protein expression data *E* in the following way. The RNA sequences used to parameterize the model of ΔG_{spacing} are predicted to have identical ΔG_{mRNA}, ΔG_{mRNA:rRNA}, ΔG_{standby}, and ΔG_{start} free energies. According to Equation 1, dividing the expression of a sequence with spacing *s*_{1} over another with spacing *s*_{2} and rearranging then yields the relation: ΔG_{spacing}(*s*_{1}) − ΔG_{spacing}(*s*_{2}) = −*β*^{−1}log(*E*_{1}/*E*_{2}). The fluorescent protein expression at *s* = 5 nt was considered maximal and ΔG_{spacing}(*s* = 5) is accordingly set to zero. Using an experimentally measured value of *β* = 0.45 mol/kcal, the model of ΔG_{spacing} for each *s* is then determined.

Linear regression is used to determine the accuracy of the theory, which hypothesizes a linear relationship between the log average protein fluorescence *E* and the predicted ΔG_{tot} data. The squared correlation coefficient R^{2} and slope −*β* are calculated according to −*β* = (NΣ(x_{i}y_{i}) − Σx_{i}Σy_{i}) / (NΣ(x_{i}^{2}) − (Σx_{i})^{2}) and R^{2} = (NΣ(x_{i}y_{i}) − Σx_{i}Σy_{i})^{2} / [(NΣ(x_{i}^{2}) − (Σx_{i})^{2})(NΣ(y_{i}^{2}) − (Σy_{i})^{2})], where N is the number of average expression levels recorded, y is log *E*, and x is ΔG_{tot}. The standard deviation of *β* is calculated by substituting the log *E* data with the log(*E*+δ*E*) and log(*E*−δ*E*) data (δ*E* : standard deviation of *E*) and calculating the average difference.

## Supplementary Material

#### 1

^{(1.6M, doc)}

#### 2

^{(90K, xls)}

#### 3

^{(7.6K, txt)}

## Acknowledgments

We are grateful to all members of the Voigt lab for technical advice and continued support. This work is supported by the Pew and Packard Foundations, Office of Naval Research, NIH EY016546, NIH AI067699, NSF BES-0547637, NSF TeraGrid TG-MCB080126T, and a Sandler Family Opportunity Award. C.A.V, H.S, and E.M. are part of the NSF SynBERC ERC (www.synberc.org). E.M is supported by an NSF Graduate Research Fellowship and an ASEE National Defense Science and Engineering Graduate Fellowship.

## Footnotes

**Author Contributions:** HMS and CAV designed the study and wrote the manuscript. HMS developed the method. HMS and EAM performed the experiments.

## References

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (1.5M) |
- Citation

- Optimizing scaleup yield for protein production: Computationally Optimized DNA Assembly (CODA) and Translation Engineering.[Biotechnol Annu Rev. 2007]
*Hatfield GW, Roth DA.**Biotechnol Annu Rev. 2007; 13:27-42.* - Construction of a general vector for efficient expression of mammalian proteins in bacteria: use of a synthetic ribosome binding site.[Proc Natl Acad Sci U S A. 1981]
*Jay G, Khoury G, Seth AK, Jay E.**Proc Natl Acad Sci U S A. 1981 Sep; 78(9):5543-8.* - Sequences within ribosome binding site affecting messenger RNA translatability and method to direct ribosomes to single messenger RNA species.[Methods Enzymol. 1990]
*de Boer HA, Hui AS.**Methods Enzymol. 1990; 185:103-14.* - Visualization of evolutionary stability dynamics and competitive fitness of Escherichia coli engineered with randomized multigene circuits.[ACS Synth Biol. 2013]
*Sleight SC, Sauro HM.**ACS Synth Biol. 2013 Sep 20; 2(9):519-28. Epub 2013 Sep 4.* - Strategies for optimizing foreign gene expression in Escherichia coli.[Horiz Biochem Biophys. 1983]
*de Boer HA, Shepard HM.**Horiz Biochem Biophys. 1983; 7:205-48.*

- Genome engineering for improved recombinant protein expression in Escherichia coli[Microbial Cell Factories. ]
*Mahalik S, Sharma AK, Mukherjee KJ.**Microbial Cell Factories. 13(1)177* - Quantitative analysis of mammalian translation initiation sites by FACS-seq[Molecular Systems Biology. ]
*Noderer WL, Flockhart RJ, Bhaduri A, Diaz de Arce AJ, Zhang J, Khavari PA, Wang CL.**Molecular Systems Biology. 10(8)748* - Bridging the Gap: A Roadmap to Breaking the Biological Design Barrier[Frontiers in Bioengineering and Biotechnolo...]
*Beal J.**Frontiers in Bioengineering and Biotechnology. 287* - Synthetic Biology: A Bridge between Artificial and Natural Cells[Life. ]
*Ding Y, Wu F, Tan C.**Life. 4(4)1092-1116* - Evolution-guided optimization of biosynthetic pathways[Proceedings of the National Academy of Scie...]
*Raman S, Rogers JK, Taylor ND, Church GM.**Proceedings of the National Academy of Sciences of the United States of America. 2014 Dec 16; 111(50)17803-17808*

- Cited in BooksCited in BooksPubMed Central articles cited in books
- MedGenMedGenRelated information in MedGen
- NucleotideNucleotidePublished Nucleotide sequences
- ProteinProteinPublished protein sequences
- PubMedPubMedPubMed citations for these articles
- SubstanceSubstancePubChem Substance links

- Automated Design of Synthetic Ribosome Binding Sites to Precisely Control Protei...Automated Design of Synthetic Ribosome Binding Sites to Precisely Control Protein ExpressionNIHPA Author Manuscripts. 2009 Oct; 27(10)946

Your browsing activity is empty.

Activity recording is turned off.

See more...