- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# Physical constraints and functional characteristics of transcription factor–DNA interaction

^{†}U.G. and J.D.M. contributed equally to this work.

^{‡}To whom reprint requests should be addressed. E-mail: ude.dscu.scisyhp@dnalreg.

^{§}Present address: NEC Research Institute, 4 Independence Way, Princeton, NJ 08540.

## Abstract

We study theoretical “design principles” for transcription factor (TF)–DNA interaction in bacteria, focusing particularly on the statistical interaction of the TFs with the genomic background (i.e., the genome without the target sites). We introduce and motivate the concept of programmability, i.e., the ability to set the threshold concentration for TF binding over a wide range merely by mutating the binding sequence of a target site. This functional demand, together with physical constraints arising from the thermodynamics and kinetics of TF–DNA interaction, leads us to a narrow range of “optimal” interaction parameters. We find that this parameter set agrees well with experimental data for the interaction parameters of a few exemplary prokaryotic TFs, which indicates that TF–DNA interaction is indeed programmable. We suggest further experiments to test whether this is a general feature for a large class of TFs.

With rapid advances in the sequencing and annotation of entire genomes, the task of understanding the associated regulatory networks becomes increasingly prominent. Currently, many experimental and computational efforts are devoted to deciphering the genetic wiring diagram of a cell (1–3). Most of these efforts are focused on locating the functional DNA-binding sites of transcription factors (TFs). This knowledge, together with the genomic sequences, will provide a qualitative picture of which gene products may directly affect the expression of which genes. While obtaining such wiring diagrams is tremendously important for the eventual understanding of gene regulation at the system level, this knowledge in itself is not sufficient for the quantitative understanding of system-level effects. This has been shown dramatically in a detailed experimental study of the regulation of the *endo16* gene in sea urchin development (4), which revealed an intricate regulatory function where a dozen or so TFs control the expression of a single gene. It would have been impossible to infer even the gross qualitative features of the transcriptional control from the knowledge of the binding sites alone.

A major obstacle to progress is the lack of a quantitative understanding of the physical interaction between the TFs. However, even the simpler interaction between TFs and DNA sequences is not so well understood quantitatively: It is common to classify a potential TF-binding DNA sequence in a “digital” manner—either the sequence is designated for TF binding, or it is not. In this view of TF–DNA interaction, differences between the TF-binding sequences are only nuisances that impede straightforward bioinformatic methods of target-sequence discovery. On the other hand, there are plenty of examples where differences between target sequences are known to be *functionally important* (5). In many cases, the binding of a TF to one site occurs only in the presence of some other TF, while the binding of the same TF to a different site does not require other TFs. This flexibility in function often is accomplished by differences in the binding sequences and is believed to be the basis for combinatorial control and signal integration in gene regulation (6). Also, different binding sites of the same TF can be “tuned” to bind at different TF concentrations, as suggested by a recent study of the *Escherichia coli* flagella assembly system (7). If further experimental studies confirm that tuning of binding thresholds indeed is used genome-wide to establish desired gene-regulatory functions, then TF–DNA binding should be regarded more in an analog instead of a digital manner.

In this work, we report our theoretical study on the “design” of TF–DNA interaction, assuming the analog scheme of operation. Specifically, we impose the functional requirement that the threshold concentration for TF binding to a site can be controlled over a wide range by the choice of the sequence alone; we refer to this as the “programmability” of TF–DNA binding. Taken together with thermodynamic and kinetic constraints, this functional requirement leads to a narrow range of “optimal” TF–DNA interaction parameters. We then compare our result to experimentally known parameters for exemplary TFs to determine whether the design of these TFs indeed would allow the analog scheme of operation.

To focus our discussion, we limit ourselves exclusively to the case of bacterial TFs, which are the best characterized experimentally. We study both the equilibrium occupancy of a target sequence and the dynamics of locating the target. Von Hippel, Berg, and Winter have already discussed many aspects of these issues in a series of seminal articles (8–12). Our study is built firmly on their work but includes a number of additional issues: (*i*) the effect of sequence-specific binding to the genomic background (nontarget sequences) on the equilibrium occupation of a target sequence, (*ii*) kinetic traps arising statistically from the genomic background, and (*iii*) the desired programmability of TF–DNA binding. We adopt the model developed by von Hippel and Berg (11) and allow both the sequence-specific and nonspecific modes of TF–DNA binding. Sequence-specific binding occurs if the binding sequence is sufficiently close to the best binding sequence and is governed quantitatively by a specificity parameter. For typical bacterial TFs with binding sequences that are no more than 15 bases long, we find that our physical and functional requirements are best satisfied within a narrow regime of *intermediate specificity*, amounting to the loss of ≈2 *k*_{B}*T* for each additional base mismatch from the best binding sequence. Furthermore, the kinetic constraint favors a low threshold to nonspecific binding, while the programmability requirement pushes the threshold to larger values. The optimal tradeoff value only depends on the genome size and lies ≈16 *k*_{B}*T* above the energy of the best binding sequence for a genome of 10^{7} bases. These values correspond well with the interaction parameters of a number of well characterized TFs, which suggests that programmability of TF–DNA binding is compatible with the reality of protein–DNA interaction and may be used by the organism to accomplish biological functions. We hope to stimulate further experiments determining the interaction parameters for a wider range of TFs (see *Discussion*). These experiments could either strengthen or falsify the programmability concept depending on whether the interaction parameters are generally in agreement with our prediction.

## Model of TF–DNA Interaction

Much of our knowledge on the details of TF–DNA interaction is derived from extensive biochemical experiments on a few exemplary systems dating back to pioneering work in the late 1970s (8–10, 13–15) and continuing through recent years (16–20). Furthermore, detailed structural information is available for many TFs from various structural families (21). Based on this knowledge, quantitative models of TF–DNA interaction have been established (8, 11, 12, 17). Together with the recent availability of genomic sequences, these models can be used to characterize the thermodynamics as well as the dynamics of TFs with genomic DNA in a cell. We briefly review the primary model of TF–DNA interaction in this section, which serves to introduce our notation and formulate the problem.

Biochemical and structural experiments, e.g., using *lac* repressor (9, 14, 20), have established firmly that (*i*) TFs bind closely to the DNA with a free energy Δ*G*_{ns} (with respect to the cytoplasm) regardless of its sequence due to electrostatic interaction alone, and (*ii*) additional sequence-specific binding energy can be gained (via hydrogen bonds) if the binding sequence is close to the recognition sequence of the TF. Let the total binding (free) energy of a TF to a sequence *s→* = {*s*_{1}, *s*_{2}, …, *s*_{L}} of *L* nucleotides *s _{i}* {A,C,G,T} be Δ

*G*[

*s→*] (with respect to the cytoplasm), and let

*s→** be the best binding sequence. Δ

*G*[

*s→*] becomes sequence-independent, Δ

*G*[

*s→*] Δ

*G*

_{ns}, if

*s→*is far from

*s→**. This is believed to occur via a change in the conformation of the TF from one that allows more hydrogen-bond formation to another that brings the positive charges of the TF closer to the negatively charged DNA backbone (10).

For this study, it will be convenient to measure all energies with respect to that of the best binder, Δ*G*[*s→**]. Let us define *E*[*s→*] Δ*G*[*s→*] − Δ*G*[*s→**]. Furthermore, we will introduce the threshold energy *E*_{ns} Δ*G*_{ns} − *ΔG*[*s→**], where TF–DNA binding switches from the specific to the nonspecific mode (for *lac* repressor, *E*_{ns} ≈ 10 kcal/mol). Then given the above model of TF–DNA interaction and assuming that the TF is bound to the DNA essentially all the time,^{¶} all thermodynamic quantities regarding this TF can be computed from the partition function^{}

where β^{−1} = *k*_{B}*T* ≈ 0.6 kcal/mol and *s→ _{j}* denotes the subsequence of the genomic sequence {

*s*

_{1},

*s*

_{2},…,

*s*} from position

_{N}*j*to

*j*+

*L*− 1. The binding length of a typical bacterial TF is

*L*= 10 ~ 20 bp. The length of the genomic sequence,

*N*, is typically several million bp.

The form of the binding energy *E*[*s→*] has been studied experimentally for several TFs (16–19). In particular, recent experiments on the TF Mnt from bacteriophage P22 (16) support the earlier model (11) that the contribution of each nucleotide in the binding sequence to the total binding energy is approximately independent and additive, i.e.,

For the TFs Mnt, Cro, and λ repressor, the parameters of the “energy matrix” _{i}(*s _{i}*) have actually been determined experimentally by

*in vitro*measurements of the equilibrium binding constants

*K*[

*s→*]

*e*

^{−βE[s→]}for every

*single-nucleotide*mutant of the best binding sequence

*s→** (16, 18, 19). Due to our definition of the energy scale,

_{i}(

*s*) = 0 for

_{i}*s*=

_{i}_{i}and

_{i}(

*s*) > 0 for

_{i}*s*≠

_{i}_{i}; the latter will be referred to as “mismatch energies.” While the simple form of the binding energy (Eq. 2) certainly will not hold for all TFs, and di-, trinucleotide correlation effects are likely to be important in many cases [e.g., to some extent for

*lac*repressor (20)], the key results of our study are not sensitive to such correlations as long as there is a wide range of binding energies for different binding sequences. Thus we will adopt the simple form (Eq. 2) for this study. For the three well studied TFs, the mismatch energies are typically in the range of 1 ~ 3

*k*

_{B}

*T*. While the threshold energies

*E*

_{ns}have not been measured carefully for these TFs, it is believed that nonspecific binding does not occur until the binding sequences are at least 4–5 mismatches away from

*s→** (G. Stormo, private communication).

## Genomic Background and Target Recognition

### Thermodynamics.

Let us first consider the binding of a *single* TF to its target sequence, denoted by *s→ _{t}*. We will assume that thermal equilibrium can be reached within the relevant cellular time scale and discuss the important kinetics issue afterward. The effectiveness of the binding of the TF to its target is then described by the equilibrium binding probability

*P*, which depends not only on the binding energy

_{t}*E*

_{t}*E*[

*s→*] but also on the interaction with the rest of the genomic sequence. Let the contribution of this genomic background to the partition function be

_{t}*Z*, then the binding probability to the target is given by

_{b} where *F*_{b} = −*k*_{B}*T* ln *Z*_{b} is the effective binding energy (or free energy) of the entire genomic background. Eq. 3 is a sigmoidal function of *E _{t}* with a (soft) threshold at

*F*

_{b}, i.e., a TF binds (with probability

*P*> 0.5) if

_{t}*E*<

_{t}*F*

_{b}. Since

*E*≥ 0 by definition, we must have

_{t} in order for a target sequence to be recognized by a single TF (we consider multiple TFs below). The background contribution can be computed for any given TF and genome according to Eq. 1 if the binding-energy matrix, the threshold energy *E*_{ns}, and the genomic sequence is known. We will instead seek a description that is independent of the specifics of the genomic sequences and energy matrices. To accomplish this, we observe first that for the few well studied TFs, the interaction of the TF with the genomic background can be well approximated by the interaction of the TF with *random* nucleotide sequences of the same length and single-nucleotide frequencies *p*(*s*). This is illustrated in Fig. Fig.11*A*, where the histogram of binding energies obtained by using the binding-energy matrix _{i}(*s*) for the TF Cro on the *E. coli* genome (solid line) coincides well with the histogram of the same energy matrix applied to random nucleotide sequences (circles). Moreover, there appears to be hardly any positional correlation in the binding energies along the genome, as shown by the “energy landscape” in Fig. Fig.11*B* (see legend for details). In the following, we will therefore describe the effect of the genomic background by treating it as a random nucleotide sequence for a generic TF. In particular, we will describe the genomic background partition function by *Z*_{b} = *Z*_{sp} + *N*, where the contribution due to sequence-specific binding is

with *S*(*N*) denoting a given collection of *N* random nucleotide sequences of length *L* drawn according to the frequency *p*(*s*) for each nucleotide *s*.

*A*) Histogram of the specific binding energies for

*Cro*[solid line] on the

*E. coli*genome together with the average histogram (circles)

**...**

Even with the random sequence approximation (Eq. 5), computation of the background energy *F*_{b} = −*k*_{B}*T* ln *Z*_{b} is nontrivial in principle: From its definition, it is clear that *F*_{b} is a random variable, and its precise value will depend on the actual collection of sequences *S*(*N*). We are interested in the *typical value of F*_{b}, a reasonable approximation of which is its statistical average, −*k*_{B}*T* . [We use an overbar to denote averages over an ensemble of different sequence collections *S*(*N*).] Computing the average , however, is difficult to do for an arbitrary energy matrix _{i}(*s*) short of performing numerical simulations. An alternative is to compute the ensemble average of *Z*_{b}, i.e., = + *N* where

with the single-nucleotide frequencies *p*(*s*), and assume that

This is, for example, the approach taken by Stormo and Fields (17) in their analysis of the TF Mnt.^{**} We note in passing that can be written more compactly in terms of the density of states Ω_{sp}(*E*) for specific binding (the normalized version of the histogram in Fig. Fig.11*A*), i.e.,

Eq. 7 is based on the so-called annealed approximation ≈ ln, which is valid for the genomic sequence length *N* → ∞ but not always appropriate for finite *N*, e.g., if the partition function is dominated by a few low-energy terms. Much is known from statistical physics about systems of the type defined by the partition function *Z*_{sp} in Eq. 5, generically known as the random-energy model or REM,^{††} introduced by Derrida (22). It turns out that the annealed approximation is valid as long as the system's entropy is significantly larger than zero, reflecting the contribution of many terms in the partition sum. We will see further below that proper function of the TFs requires the system to be in a regime where the annealed approximation is safely applicable. We thus will take the validity of Eq. 7 for granted. In this case, the condition in Eq. 4 for the recognition of the target sequence by a single TF becomes

### Search Dynamics.

To carry out their function properly, TFs not only need to have a high equilibrium binding probability to their targets but also must be able to locate them in a reasonably short time (e.g., less than a few minutes) after they have been activated by an inducer or freshly produced by a ribosome. This constitutes a constraint on the “search dynamics” of TFs.

In their nonspecific binding mode, TFs are still strongly associated with the DNA but are able to diffuse (i.e., slide) randomly along the genome (8–10). However, pure 1D diffusion would be an inefficient search process, because it is very redundant (e.g., a 1D random walker always returns back to the start.) For instance, assuming generously a 1D diffusion constant of *D*_{1} ≈ 1 μm^{2}/sec (10), one finds a time *T*_{1D} ~ *N*^{2}/*D*_{1} ~ 10^{6} sec for a single TF to diffuse around a bacterial genome of length *N* ≈ 5 × 10^{6} bp (≈1 mm). Thus, to find a target within a few minutes via 1D diffusion, one would need at least 100 TFs per cell to search in parallel (so that the search length *N* is reduced by a factor of 100). On the other hand, there are well documented examples where regulation is accomplished effectively by only a few TFs in a cell (e.g., ≈10 for *lac* repressor in *E. coli*; ref. 24).

As studied in detail by Winter, Berg, and von Hippel (8–10), the search dynamics of TFs involves instead a combination of sliding along the DNA at short length scales and *hopping* between different segments of DNA (either over the dissociation barrier through the cytoplasm or by direct intersegment transfer; see Fig. Fig.22*A*). This search mode is much faster (given the high DNA concentration inside the cell), because the dynamics is essentially 3D diffusion beyond the hopping scale, and 3D diffusion is much less redundant than 1D diffusion. For example, if the TFs were not bound to the DNA at all, a single TF of a few nanometers in linear dimension would locate its target in a cell volume *V*_{cell} of several μm^{3} in the average first passage time of *T*_{3D} = *V*_{cell}/(4π*D*_{3}) ~ 10 sec, given a 3D diffusion constant on the order of *D*_{3} ~ 10 μm^{2}/sec (25). The search time *T*_{3D/1D} for the combined 1D/3D diffusion under *in vivo* conditions can be estimated to be comparable to *T*_{3D} (10). Hence, the search time is short enough to comfortably allow even a single TF to locate its target within the physiological time scale.

*A*) Schematic illustration of the search dynamics: a TF (represented by a solid ellipse) moves among genomic DNA (lines) via a combination of 1D (along the genome) and 3D (hopping between nearby segments) diffusion as illustrated by the arrows. The open

**...**

In the study of the search dynamics reviewed above, binding of the TF to the genomic background was assumed to occur at a single energy value, namely, the nonspecific energy Δ*G*_{ns} (8). On the other hand, the energy landscape of Fig. Fig.11*B* clearly shows that the random genomic background contains many isolated sites with binding energies far below Δ*G*_{ns}. These sites constitute kinetic traps that, in principle, can impede the local search process drastically if the energy difference to their surroundings is sufficiently large.^{‡‡} Thus to understand the search dynamics fully, we need to characterize the effect of kinetic traps in the genomic background: What is the constraint on the design of TF–DNA interaction imposed by requiring that the effect of kinetic traps be negligible?

At each binding sequence *s→ _{j}* with energy

*E*

_{j}*E*[

*s→*] <

_{j}*E*

_{ns}, the TF typically spends a time τ

_{j}= τ

_{0}, where τ

_{0}is the average “waiting time” of the TF at a nonspecific binding site. Along the search path of the TF, the average waiting time per binding site then is given simply by

Here we assumed as before that the genomic sequence is random such that the sequence-specific binding energy *E* can be treated as a random variable drawn from the distribution Ω_{sp}(*E*). The second term, with the help of the unit step function θ(*x*), is used to express the fact that there is no kinetic trap for the (majority of) sites with *E* > *E*_{ns}.

A comparison of Eqs. **10** and 8 for the average partition function immediately yields the important relation^{§§}

since in Eq. 8, the second term dominates for *E* > *E*_{ns}. As expected, the kinetic trap factor /τ_{0} grows exponentially with *E*_{ns}, the threshold to nonspecific binding. On the other hand, we note from = + *N* (see *Thermodynamics*) that the trap factor can be made to be of order 1 such that the dynamical analysis of refs. 8–10 remains qualitatively valid if ≤ *N*. The physical meaning of this condition is that the average effect of the kinetic traps can be rendered small if the sum of the waiting times does not exceed the order of the plain diffusion time. As we will see, this can be accomplished by choosing the binding-energy matrix _{i}(*s*) and *E*_{ns} appropriately. Combining this kinetic constraint with Eq. 9, we obtain the condition

for the rapid recognition of a target sequence by a single TF.

## Programmability of Binding Threshold

### Multiple TFs.

There are of course typically multiple copies of the same TF in the cell, and the regulatory function is accomplished if anyone of these TFs binds to the target sequence. If the cell contains *n* copies of a given TF, then the occupation probability for the target sequence, Eq. 3, is replaced by the Fermi distribution (or “Arrhenius function”) *P _{t}* = 1/[1 + , since each binding sequence can be occupied at most by one TF. The chemical potential μ(

*n*) is determined implicitly from the condition

^{¶¶}

where the quantity in brackets represents the total density of states. In the simplest scenario, where steric exclusion between TFs bound to the nontarget sequences is negligible, one has (11)

This is empirically found to be a good approximation for those TFs with known binding-energy matrices as shown in Fig. Fig.22*B*. We will adopt the form of Eq. 14 for the chemical potential of a generic TF in this study; a general argument will be given later to justify this choice even for the case where multiple target sequences are present in the same genome.

Using Eq. 14, the occupation probability can be written more succinctly, *P _{t}* = 1/[1 +

*ñ*], where

_{t}/n denotes the (soft) *threshold concentration* of the TF for occupation of the target sequence.

### Programmability.

The allowed values of the background free energy *F*_{b} for the binding of the target sequence obviously depend on the TF concentration *n*. For example, we have the condition in Eq. 4 for *n* = 1, while smaller values are allowed for *n* > 1. It thus appears that the allowed *F*_{b} values are different for the different TFs, because they would typically be present in the cell with different concentrations. On the other hand, even for a given TF species, the *desired* binding threshold may not be at a *single* concentration for different target sites but can *vary* depending on functional demands. For example, it can be desirable to turn on different genes/operons at different TF concentrations to maintain a *temporal order* in the expression of different operons as the concentration of the controlling TF gradually changes over time. This effect was observed recently for the *E. coli* flagella assembly (7) and SOS response systems (U. Alon, private communication).

As another example, consider the case where a particular TF *A* is involved in the regulation of two operons, *X* and *Y*. Suppose it is desired that *A* activates the transcription of operon *X* on its own at a concentration *n _{A}*, while operon

*Y*should be activated

*only*if

*A*is present (at the same concentration

*n*)

_{A}*together*with another TF

*B*that can bind cooperatively with

*A*. It is desirable then to have a strong binding site for

*A*in the regulatory region of operon

*X*such that its threshold

*ñ*

_{A,X}<

*n*, and a weak binding site in the regulatory region of operon

_{A}*Y*, with a threshold

*ñ*

_{A,Y}>

*n*. The latter insures that the operon

_{A}*Y*will not be activated accidentally by fluctuations in

*n*alone, and only when the TF

_{A}*B*is present would the attractive interaction between

*A*and

*B*induce the two to bind to their targets.

The above examples show that it is functionally desirable to have the ability to *set* the binding threshold *ñ _{t}* of a given TF to each of its target-sequence

*s→*. As is clear from the defining expression (Eq. 15), this can be done only through the choice of the target-sequence

_{t}individually*s→*which affects

_{t}*E*, because the other variable,

_{t}*F*

_{b}, is fixed for a given TF. We refer to the ability to control the binding threshold

*ñ*through the choice of the target-sequence

_{t}*s→*as programmability of the binding threshold. Assuming that programmability is a desirable feature of TF–DNA interaction (since sequence changes can be accomplished easily by point mutation if the functional need arises), we seek to determine the specifics of the TF–DNA interaction, e.g., the binding matrix

_{t}alone_{i}(

*s*), the length of the binding sequence

*L*, and the threshold energy

*E*

_{ns}, which allow the targets to be

*maximally programmable*.

### Two-State Model and Parameter Selection.

Specifically, let us require programmability of the binding threshold over the entire range *ñ* = 1… 10^{3}, since typical cellular TF concentrations range from a few to a few hundred per cell. The lower bound *ñ* ≈ 1 immediately imposes the condition in Eq. 4 on *F*_{b}, or, taking also the kinetic constraint into account, the condition in Eq. 12. Furthermore, to tune *ñ* throughout the desired range with a reasonable resolution, it is necessary to have the ability to change *E _{t}* from 0 to

*k*

_{B}

*T*ln 10

^{3}≈ 7

*k*

_{B}

*T*in small increments. This requires the nonzero entries of the binding-energy matrix

_{i}(

*s*) to take on small values. Which choices for the TF–DNA interaction parameters [

_{i}(

*s*),

*L*,

*E*

_{ns}] can simultaneously satisfy the latter requirement and condition (Eq. 12)?

The combined effect of these physical constraints and functional demands is understood best by simplifying the energy matrix such that we retain the essential and generic aspect of sequence-specific binding while eliminating all TF-specific details. Toward this end, we adopt the two-state model originally introduced by von Hippel and Berg (11), characterizing all of the nonzero entries of the *significant positions*^{} in the energy matrix by a *single* value, i.e.,

where is a dimensionless “discrimination energy” (in units of *k*_{B}*T*). It describes the energetic *preference* of the TF for the optimal binding sequence *s→** and is a crucial parameter controlling the specificity of the TF. Within the two-state model, the binding energy to the target *s→ _{t}* is simply times the total number of mismatches between the target and the best binder

*s→**, i.e.,

*E*[

*s→*] = |

_{t}*s→*−

_{t}*s→**|, where |… | denotes the Hamming distance between two sequences. Clearly, programmability is best satisfied with a small , which enhances the resolution of the programmable binding threshold.

The two-state model (Eq. 16) also allows an explicit evaluation of the condition in Eq. 12 via the formula Eq. 6 for . Assuming for simplicity equal single-nucleotide frequencies in the background (i.e., *p*(*s*) = 1/4), the quantity in the bracket of Eq. 6 is evaluated easily. We have (, *L*) = Nζ^{L}(), where ζ Σ_{s}*e*^{−β(s)}*p*(*s*) = (1 + 3*e*^{−})/4. Note that ζ^{−1} is in the range between 1 and 4 and can be regarded as the effective size of the nucleotide “alphabet” as “seen” by the TF in the specific binding mode. The maximum value ζ^{−1} = 4 is attained if the energy matrix has infinite discrimination, → ∞, while no discrimination can be achieved at = 0 where ζ^{−1} = 1. In Fig. Fig.33*A*, we indicate the allowed region (,*L*) ≤ 1 in the parameter space of (,*L*) with the boundary *L**() = ln *N*/ln ζ^{−1}() defined by (*L**,) = 1. From Fig. Fig.3,3, it is clear that the desire for small pushes the system to the boundary at = 1. Along the boundary, the smallest is given by the largest allowable binding length *L*. For typical bacterial TFs with binding sequences that are no longer than ≈15 bp (usually dimers), we find ≈ 2.

*A*) Plot of the region where (,

*L*) ≤ 1. The boundary

*L**() for

*N*= 10

^{7}is indicated by the solid line (see text). The dashed line ln(

*N*)/[ln ζ

^{−1}() − /(1 +

*e*

^{}/3)]

**...**

Although the result on is somewhat specific to the two-state model, the need for → 1 imposed by the programmability consideration forces the threshold energy to take on the value

(for *N* ~ 10^{7}) according to the condition in Eq. 12 independent of the specifics of the binding-energy matrix . It also follows that

such that the binding threshold is simply given by

The dependences of the *ñ* on the number of mismatches for the two-state model are shown in Fig. Fig.33*B*. We see that at the optimal parameter choice of ( = 2, *L* = 15), each mismatch increases the binding threshold *ñ* by nearly 10-fold. In principle, further fine-tuning can be accomplished by using small variations in the mismatch energies.

## Discussion

The key results of this study, that maximal programmability of the binding threshold *ñ* requires the TF–DNA interaction to satisfy the conditions in Eqs. 17 and 18, can be conveniently summarized graphically using the density of states Ω_{sp}(*E*). In Fig. Fig.4,4, the density of states is plotted with the normalization that max_{E}Ω_{sp}(*E*) = *N*, as indicated by the horizontal dotted line. The background free energy *F*_{b} can be obtained using the Legendre construction: One draws the line (the dashed line in the semilog plot of Fig. Fig.4)4) such that it just touches Ω_{sp}(*E*). *F*_{b} then can be read off as the intercept of the dashed line on the *E* axis, which should be in the vicinity of the origin according to Eq. 18. Similarly, *E*_{ns} (as given by Eq. 17) can be read off as the *E* coordinate where the dashed line intersects the horizontal dotted line.

The point where the dashed line tangents Ω_{sp}(*E*) also is physically meaningful: The *E* coordinate of the tangent point gives the ensemble-averaged binding energy *E*_{0} Σ_{E} *E*Ω_{sp}(*E*)*e*^{−βE}/*Z*_{sp}. The vertical coordinate *N*_{0} of the tangent point is given by the relation *F*_{b} = *E*_{0} − *k*_{B}*T* ln *N*_{0}, which expresses the fact that the dominant contribution to the background free energy stems from the *N*_{0} sequences of energy ≈*E*_{0} in the collection of *N* random sequences: The Boltzmann weight of those sequences with *E* > *E*_{0} is too small to contribute to the partition sum, while for *E* < *E*_{0}, there are too few sequences.

The value of *N*_{0} is an important characteristics of the system. *S* = ln *N*_{0} is known as the “entropy” of this system, and *H* = ln(*N*/*N*_{0}) is known as the “relative entropy”; the latter has been used to characterize the specificity of the TF–DNA interaction (17). As mentioned before, the annealed approximation is valid only if many terms contribute to the partition sum, i.e., if *N*_{0} 1. For the two-state model (Eq. 16), the values of and *L* corresponding to the line *N*_{0} = 1 are far from the line *L**() selected by the maximal programmability criterion; this justifies the use of the annealed approximation. At the optimal parameter of = 2 and *L* = 15, we have *N*_{0} ≈ 10^{3} 1. The corresponding relative entropy is *H* ≈ 7 (≈10 bits).

The large value of *N*_{0} also provides us with an intuitive understanding of the simple dependence (Eq. 14) of the chemical potential μ on the cellular TF concentration *n* (see Fig. Fig.22*B*). As mentioned already, the expression (Eq. 14) is obtained if multiple occupancy of the background sequences is negligible at the TF concentration *n*. Since there is a large number (i.e., *N*_{0}) binding sequences that contribute significantly to the net effect of background binding, multiple occupancy of these sequences is indeed not likely if *n* < *N*_{0}. Thus for *N*_{0} ~ *O*(10^{3}), the expression (Eq. 14) can be taken as a good approximation of the chemical potential over the typical range of cellular TF concentration *n* = 1… 10^{3}, as shown in Fig. Fig.22*B* for the three known TFs. We expect this result to hold even if there are multiple target sequences, say *m _{t}*, the binding energy

*E*of which is much lower than

_{t}*E*

_{0}as long as

*E*>

_{t}*k*

_{B}

*T*ln

*m*such that

_{t}*F*

_{b}is not affected by the addition of these target sequences to the density of states. Having μ(

*n*) independent of the number of targets is a desirable functional robustness property from a system perspective, because one wouldn't want to perturb the recognition of the TFs and the existing targets by the addition of a few new targets. It will be interesting to see to what extent this feature is preserved by studying the energetics of TFs with a large number of target sites, e.g., the catabolic repressor protein CRP in

*E. coli*(5).

Finally, we compare the values of the optimal interaction parameters according to our theory to those of the well studied TFs. From the values listed in Table Table1,1, we see that all the available data are in the neighborhood of the expectation based on the maximal programmability criterion. We do not suggest here that programmability was necessarily the selective driving force that constrained the TF–DNA interaction to its observed form (there could be other reasons, e.g., biochemical restrictions, for the interaction to be of this form). However, the rough correspondence between theory and observation does indicate that it is *possible* (and perhaps even very likely) that TFs generally have the required energetics for their binding threshold to be programmable over a wide range.

*F*

_{b}, relative entropy

*H*, and the threshold to nonspecific binding

*E*

_{ns}to the known values of these parameters for

*Mnt, Cro,*the λ repressor

*cI*, and the

*lac*repressor

*LacR*

One obvious short-coming of the above comparison is that the three TFs for which the interaction parameters are known are all from bacteriophages and may not represent typical prokaryotic TFs. It therefore will be very important to experimentally determine the interaction parameters for a variety of different TFs. The results of a sufficient number of such studies will inform us whether programmability is a generic feature of TF–DNA interaction. Knowledge of this kind can be very helpful in developing appropriate coarse-grained models of gene regulation at the system level. In particular, quantitative relations of the type suggested by Eq. 19 will be necessary for an eventual quantitative description of gene-regulatory networks. Also, this knowledge would have important implications for the evolution of gene regulation (26, 27).

## Acknowledgments

We acknowledge useful discussions with G. Stormo, P. von Hippel, and K. Sneppen on many aspects of TF–DNA interaction. We are also grateful to the hospitality of the Institute for Theoretical Physics in Santa Barbara, where some of the work was carried out. This research is supported in part by National Science Foundation Grant DMR-9971456. U.G. was supported in part by a German fellowship from the Deutscher Akademischer Austauschdienst, and T.H. was supported in part by a Burroughs Wellcome functional genomics award.

## Abbreviations

- TF, transcription factor

## Notes

This paper was submitted directly (Track II) to the PNAS office.

^{¶}*In vivo* measurements for the case of *lac* repressor found less than 10% of the TFs were unbound (15). This agrees well with an estimate based on a typical prokaryotic cell volume of 3 μm^{3}, a genome length of 5 × 10^{6} bases, and a nonspecific binding constant on the order of 10^{4} M^{−1} under physiological conditions (13), which yields a fraction of unbound TFs at a few-percent level.

^{}One also should include the reverse complement of the genomic sequence in the evaluation of the partition function *Z*. In order not to make the notation too complicated, we extend the definition of “genomic sequence” to include its complement.

^{**}In ref. 17, the nonspecific binding was not included so that = and the energy scale was shifted such that *Z*_{b} = *N.*

^{††}In many applications, including protein folding (23), the REM was introduced to *approximate* the random background interaction. The TF–DNA interaction as defined by Eq. 5 represents one of the few systems for which the REM description is *directly applicable*.

^{‡‡}Note that the additional sequence-specific binding energy to a “spurious site” in the background equally increases the kinetic barrier for sliding to a neighboring site as well as for dissociation into the cytoplasm.

^{§§}Note that this relation is actually independent of the additive form of the binding energy (Eq. 2).

^{¶¶}Here, the exclusion between overlapping binding sites can be neglected, because *n* *N*. Also, we have not included the (unimportant) exclusion between the specific and unspecific binding mode at a given site.

^{}Note that the energy matrices for most TFs contain a number of (fixed) positions that have no strong preference for any of the nucleotides. We will not consider these positions in the ensuing discussion of the two-state model and will use *L* to refer to the total number of significant positions.

## References

*et al.*(2002) Science 295

**,**1669-1677. [PubMed]

**,**757-762. [PMC free article] [PubMed]

**,**167-171. [PubMed]

**,**617-629. [PubMed]

**,**2080-2083. [PubMed]

**,**6929-6948. [PubMed]

**,**6948-6960. [PubMed]

**,**6961-6977. [PubMed]

**,**1608-1612. [PMC free article] [PubMed]

**,**723-750. [PubMed]

**,**4777-4783. [PubMed]

**,**4791-4796. [PubMed]

**,**4228-4232. [PMC free article] [PubMed]

**,**178-194. [PubMed]

**,**109-113. [PubMed]

**,**6513-6517. [PMC free article] [PubMed]

**,**439-443. [PMC free article] [PubMed]

**,**1186-1206. [PubMed]

**,**1053-1095. [PubMed]

**,**2613-2626.

**,**7524-7528. [PMC free article] [PubMed]

**,**179-183. [PubMed]

**,**197-203. [PMC free article] [PubMed]

**,**2072-2077. [PMC free article] [PubMed]

**,**386-400. [PubMed]

**,**1115-1122. [PubMed]

**National Academy of Sciences**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (207K)

- Physical limits on cooperative protein-DNA binding and the kinetics of combinatorial transcription regulation.[Biophys J. 2011]
*Geisel N, Gerland U.**Biophys J. 2011 Oct 5; 101(7):1569-79.* - Using orthologous and paralogous proteins to identify specificity-determining residues in bacterial transcription factors.[J Mol Biol. 2002]
*Mirny LA, Gelfand MS.**J Mol Biol. 2002 Aug 2; 321(1):7-20.* - Functional trends in structural classes of the DNA binding domains of regulatory transcription factors.[Pac Symp Biocomput. 2008]
*McCord RP, Bulyk ML.**Pac Symp Biocomput. 2008; :441-52.* - Lactose repressor protein: functional properties and structure.[Prog Nucleic Acid Res Mol Biol. 1998]
*Matthews KS, Nichols JC.**Prog Nucleic Acid Res Mol Biol. 1998; 58:127-64.* - The whole lactose repressor.[Science. 1996]
*Matthews KS.**Science. 1996 Mar 1; 271(5253):1245-6.*

- Computational models for large-scale simulations of facilitated diffusion[Molecular bioSystems. 2012]
*Zabet NR, Adryan B.**Molecular bioSystems. 2012 Nov; 8(11)2815-2827* - Physical constraints determine the logic of bacterial promoter architectures[Nucleic Acids Research. 2014]
*Ezer D, Zabet NR, Adryan B.**Nucleic Acids Research. 2014 Apr; 42(7)4196-4207* - The effects of transcription factor competition on gene regulation[Frontiers in Genetics. ]
*Zabet NR, Adryan B.**Frontiers in Genetics. 4197* - The Influence of Transcription Factor Competition on the Relationship between Occupancy and Affinity[PLoS ONE. ]
*Zabet NR, Foy R, Adryan B.**PLoS ONE. 8(9)e73714* - Different [E1]gene regulation strategies revealed by analysis of binding motifs[Trends in genetics : TIG. 2009]
*Wunderlich Z, Mirny LA.**Trends in genetics : TIG. 2009 Oct; 25(10)434-440*

- CompoundCompoundPubChem Compound links
- PubMedPubMedPubMed citations for these articles
- SubstanceSubstancePubChem Substance links
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree

- Physical constraints and functional characteristics of transcription factor–DNA ...Physical constraints and functional characteristics of transcription factor–DNA interactionProceedings of the National Academy of Sciences of the United States of America. Sep 17, 2002; 99(19)12015PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...