![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © The Author 2005. Published by Oxford University Press. All rights reserved Protein–DNA binding specificity predictions with structural models Center for Studies in Physics and Biology, The Rockefeller University, 1230 York Avenue, New York, NY 10021, USA 1Department of Biochemistry, University of Washington, Box 357350, Seattle, WA 98195-7350, USA *To whom correspondence should be addressed. Tel: +1 212 327 8139; Fax: +1 212 327 8544; Email: morozov/at/edsb.rockefeller.edu Received July 13, 2005; Revised September 13, 2005; Accepted September 13, 2005. The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions/at/oxfordjournals.org This article has been cited by other articles in PMC.Abstract Protein–DNA interactions play a central role in transcriptional regulation and other biological processes. Investigating the mechanism of binding affinity and specificity in protein–DNA complexes is thus an important goal. Here we develop a simple physical energy function, which uses electrostatics, solvation, hydrogen bonds and atom-packing terms to model direct readout and sequence-specific DNA conformational energy to model indirect readout of DNA sequence by the bound protein. The predictive capability of the model is tested against another model based only on the knowledge of the consensus sequence and the number of contacts between amino acids and DNA bases. Both models are used to carry out predictions of protein–DNA binding affinities which are then compared with experimental measurements. The nearly additive nature of protein–DNA interaction energies in our model allows us to construct position-specific weight matrices by computing base pair probabilities independently for each position in the binding site. Our approach is less data intensive than knowledge-based models of protein–DNA interactions, and is not limited to any specific family of transcription factors. However, native structures of protein–DNA complexes or their close homologs are required as input to the model. Use of homology modeling can significantly increase the extent of our approach, making it a useful tool for studying regulatory pathways in many organisms and cell types. INTRODUCTION Gene regulation is mediated in part by protein transcription factors (TFs) binding to cis-regulatory regions of the genome. Accurate genomewide characterization of TF binding sites is thus a necessary prerequisite to deciphering complex gene expression patterns. Probabilistic models of TF binding profiles, often called position-specific weight matrices (PWMs), are typically used as input to such predictions (1–3). With the weight matrix representation of TF binding sites, the probability P(S p) that sequence S is a binding site for the TF represented by p is given by
In principle, one should be able to predict the TF binding site profile from a structure of the protein–DNA complex or its close homolog. At first it was anticipated that structural studies would reveal a universal protein–DNA recognition code, which could be used for predicting TF binding sites based on amino acid identities at the protein–DNA interface (4,5). It became apparent, however, when more protein–DNA structures were solved and classified that despite some predominantly occurring interactions, such as Lys-G, the energetics of amino acid–base contacts depends on their structural context and, in particular, on the structural family of the DNA-binding protein (6–10). Many amino acids were observed to form favorable contacts with different bases, making it necessary to generalize a deterministic recognition code to a probabilistic binding profile based, for example, on maximizing the likelihood of observed protein–DNA contacts (11–13). Probabilistic recognition codes are more accurate when developed for a specific structural family, thereby implicitly taking protein–DNA structural context into account. Indeed, binding site profiles based on the classification of TFs into families were found to be useful in bioinformatics pattern detection algorithms (14). However, data availability has so far limited knowledge-based PWM predictions to the C2H2 zinc finger family (15,16). An alternative approach to specificity and binding affinity predictions is based on all-atom modeling of protein–DNA complexes (17–19). Starting from a known structure of the protein bound to its consensus DNA sequence, an ensemble of models is created by threading novel DNA sequences onto the binding site. Protein–DNA binding energies, ΔG, are then evaluated for each member of the structural ensemble. ΔG predictions can be either used to directly infer high-affinity binding sites in genomic sequence or converted into PWM probabilities using the Boltzmann formula. In the latter case, it is only necessary to compute ΔG values for all one-point mutations from the consensus-binding site. The main limiting factor of the structural approach to TF binding site specificity predictions is the availability of experimentally determined structures of protein–DNA complexes. The range of applicability of structural methods will significantly increase if the DNA-binding proteins can be modeled by homology. Homology modeling involves threading a protein amino acid sequence onto a suitable structural template chosen on the basis of its sequence similarity to the protein of interest. The threading procedure creates a new protein–DNA binding interface, for which ΔG and PWM probability calculations are then carried out as in native structures. Here we present a computational model for predicting protein–DNA binding affinities and specificities. The model can be applied to a wide variety of DNA-binding proteins for which there is either a native protein–DNA structure or a sufficiently close homolog. The model is based on a simple free energy function, which consists of the protein–DNA interaction energy and the DNA conformational energy. The protein–DNA interaction energy is used to describe direct readout of the DNA sequence by the protein, whereas the DNA conformational energy takes into account distortion of B-DNA shape caused by protein binding. We carried out a series of tests of our PWM and binding energy predictions. First, we checked the ability of the model to reproduce experimental binding free energy measurements. We also assessed the accuracy of the pairwise additivity approximation in our analysis. Second, we checked the ability of our algorithm to discriminate experimentally known TF binding sites from random ensembles of sequences. Third, we carried out PWM predictions for a number of TFs and compared them with experimental PWMs. For all these predictions native protein–DNA complexes were used as structural templates. Finally, the extent of applicability of homology modeling to protein–DNA binding affinity predictions was explored with several representative PWM calculations. The relative accuracy and computational efficiency of our approach allowed us to carry out numerous predictions of TF binding affinities and specificities, facilitating future experimental and computational studies of transcriptional regulation in different organisms and biological systems. METHODS Binding stability and weight matrix predictions Free energy model of protein–DNA interactions We extended Rosetta protein–nucleic acid model developed in Ref. (20) by adding sequence-specific DNA conformational energy. The free energy function employed in protein–DNA binding affinity calculations consists of the protein–DNA interaction component describing intermolecular readout of the DNA sequence by the protein, and of the DNA deformation component describing intramolecular readout of the binding site sequence:
The DNA sequence-dependent conformational energy model is based on the effective harmonic potential developed in Ref. (25). The DNA conformational energy is computed as a sum over all base pair and base step energies:
All θ distributions are self-consistently trimmed by removing all data points for which at least one of the geometric parameters is more than three standard deviations away from the average, followed by updating all averages and standard deviations. This procedure is repeated until convergence, which usually requires 2–3 iterations and removes just a few percent of the data points (28). A non-homologous, manually curated set of 101 structures used in Ref. (20) was employed to derive the force constants for the DNA conformational potential. Given free energies of the protein–DNA complex and its unbound partners, the binding free energy is computed as follows:
Weight matrix predictions based on binding free energies Computation of weight matrix probabilities
With the pairwise additivity assumption, a set of 4L predicted energies
Experimental datasets Binding free energies Table 1 collects structural data for protein–DNA complexes with binding free energy measurements available from the ProNIT database (http://dna01.bse.kyutech.ac.jp/jouhou/pronit/pronit.html) and from the literature. Each dataset in Table 1 consists of a structure of the protein–DNA complex and a series of binding free energy measurements ΔG for wild-type and mutant DNA sequences. In several cases, association or dissociation constants reported by the authors were converted into binding free energies using
Binding sites and weight matrices Table 2 collects protein–DNA structures for which binding site data and experimental weight matrix data are available. The Zif268 weight matrix was taken from a selection experiment (31). The experiment generated the G-C-G-T/g-G/A-G-G-C/a/t-G-G/T consensus sequence, which we used to create five binding site variants (in addition to the consensus sequence from the Zif268 structure). Binding sites and weight matrices for seven Escherichia coli TFs were obtained from the DPInteract database (32). For TrpR, 4 sites from DPInteract were augmented with 10 additional sites from RegulonDB (33). For Ndt80 in Saccharomyces cerevisiae we used the weight matrix from Ref. (34); Ndt80 binding sites were collected by searching promoters of the genes known to be regulated by Ndt80 with the YGNCACAAAA consensus sequence. A collection of 17 naturally occurring MAT a1/α2 binding sites plus the synthetic consensus sequence were obtained from Ref. (30). Eight Gcn4p sites were assembled from Ref. (35) and the TRANSFAC database (36). Finally, binding sites and weight matrices for the Prd homeodomain homodimer were taken from Ref. (37), whereas binding data for the rest of Drosophila melanogaster TFs were collected by E. D. Siggia and E. Emberly (unpublished data).
Prediction testing Protein–DNA interaction model based on the number of interface atomic contacts In order to test our ΔΔG and PWM predictions we developed a simple null model that exploits the structure of the protein–DNA complex but does not require any detailed predictions of protein–DNA energetics. This so-called ‘contact’ model constructs a weight matrix from the consensus DNA sequence and the number of atomic contacts N between protein side chains and DNA base pairs. In particular, we assume that the three non-consensus bases occur with equal probabilities, whereas the consensus base is favored over any non-consensus base by (N/Nmax) if N ≤ Nmax. If N > Nmax, the consensus base becomes absolutely conserved:
Probabilities defined by Equation 12 are converted into energies using:
Significance test of PWM predictions Statistical significance of PWM predictions is estimated using the ψ-test, which is a generalization of the well-known χ2-test (38):
000 alignments of random weight matrices ψ (prandom, q) ]. Each column in the random weight matrix is obtained by uniformly sampling four numbers in the (0,1) interval and enforcing normalization afterwards. The difference between ψ (prandom,q) and ψ (p, q) can be viewed as a measure of success of our predictions.RESULTS AND DISCUSSION Binding free energy predictions All-atom free energy models DNA-binding proteins employ two complementary mechanisms of binding site recognition. The intermolecular readout mechanism is based on direct interactions of protein side chains with DNA bases, whereas the intramolecular readout mechanism involves sequence-specific deformation of the DNA site by the bound protein. We developed a free energy function that takes both these mechanisms into account. Interactions of protein side chains with DNA are modeled using an all-atom representation of both protein and DNA, including all the hydrogen atoms. The protein–DNA interaction energy is a weighted sum of terms describing shape complementarity and packing at the interface, polar interactions (electrostatics and hydrogen bonds), van der Waals forces and solvation energies (Equation 3). The DNA conformational energy is calculated using a reduced geometric representation in which DNA bases and base pairs are represented by rigid bodies and their mutual orientation serves as a measure of deviation from the B-form DNA. The DNA conformational energy is a weighted sum of two terms describing base pairing and stacking of consecutive base pairs. Using this free energy function, we developed the following approach to predicting protein–DNA binding affinities. First, a suitable protein–DNA complex is identified as a structural template for computational modeling. Second, each novel DNA sequence (i.e. from a set for which experimental-binding affinity measurements are available) is threaded onto the DNA phosphate backbone with fixed DNA torsional angles. The result of this procedure is a set of initial structural models with novel DNA sequences. Third, binding free energies, ΔG, for each member of the set are computed in either of two different ways. One approach, which we shall call the static model, does not allow any side chain or DNA conformational rearrangements in the protein–DNA complex. The free energy is computed once for each initial model, and the difference in binding affinity between mutant and wild-type DNA sequences is calculated as In the other approach, called the dynamic model, we minimize the total free energy of the protein–DNA complex starting from the initial model. The protein backbone stays fixed during minimization, whereas the torsional angles of DNA and interface side chains are allowed to relax (the protein–DNA interface is defined based on amino acid-dependent distance cutoffs). The conformational search used in Gprot–dna minimization consists of 10 two-step iterations. (i) Simulated annealing of amino acid side chains at the protein–DNA interface with side chains represented as discrete backbone-dependent rotamers (39) on a fixed protein backbone, and frozen DNA conformation. (ii) Continuous minimization of amino acid side chains at the protein–DNA interface together with simultaneous conformational relaxation of DNA. Amino acid side chains are no longer represented by rotamers at this step. Experimental binding affinity data available to us are insufficient to reliably fit the weights by iterations to self-consistency when conformational rearrangement is allowed. Instead, we obtain the weights for components of the protein–DNA free energy function by maximizing the recovery of native amino acid side chains at all interface positions in a non-homologous set of protein–DNA complexes. In other words, we adopt a strategy used in protein sequence design in which rotamer conformations for all amino acids are substituted at all interface positions, and the probability of native amino acids is maximized by varying the weights (20,40). Similar to the static model, the ratio of the protein–DNA energy to the DNA conformational energy is expected to be protein family dependent and was estimated on average by requiring that the typical fluctuations from the equilibrium shape observed in the database of protein–DNA complexes be on the order of RT: Measures of prediction success and the contact model We use three alternative measures to assess the quality of ΔG predictions: a linear correlation coefficient r, an average unsigned error ε between predicted and experimental binding free energies, and a fraction of correct predictions F. Although the first two measures are computed using standard formulae, the fraction of correct predictions is based on a binary function: a prediction is considered to be correct if both ΔΔGcomp and ΔΔGexp are <1.0 kcal/mol, or >1.0 kcal/mol, or else separated by <0.3 kcal/mol. The threshold value of 1.0 kcal/mol corresponds to a ~5-fold reduction in binding affinity at room temperature. ΔG predictions are labeled as correct if they are successfully classified to be favorable or destabilizing, even if the absolute magnitudes of binding energies are not perfectly reproduced. We compare calculations of binding energies described above with predictions based on a simple contact model. Instead of modeling detailed energetics of protein–DNA complexes, the contact model uses only the consensus sequence and the number of protein–DNA base atomic contacts at the binding interface (see Methods). The energy penalty for each mutation from the consensus sequence is given by Equation 13; it is a function of Nmax (the minimum number of contacts at which a given consensus base is assumed to be absolutely conserved, cf. Equation 12) and of Emax (the maximum energy penalty for a mutation of the consensus base). Nmax and Emax are adjusted to simultaneously maximize the fraction of correct predictions F and minimize the average error ε over the ΔΔG dataset. The latter requirement is necessary since many Nmax/Emax pairs result in similar values of F. The minimum value of ε = 1.73 kcal/mol is obtained for Nmax = 15 and Emax = 3.0 kcal/mol. We find that the contact model provides a stringent test of more complicated models because, as demonstrated below, it is fairly successful in binding affinity and weight matrix predictions. ΔΔG predictions and weight fitting The extent to which the static model reproduces experimental ΔΔG measurements from the 196 point dataset used for static model weight fitting is shown in Figure 1A
The EcoRI endonuclease example clearly demonstrates why the fraction of correct prediction is a relevant measure of success. The linear correlation between prediction and experiment in a series of destabilizing mutations of the EcoRI endonuclease (41) is only 0.19, but all of the mutations are correctly predicted to be destabilizing (Figure 1A Figure 1B In Figure 2 The second prediction is for the MAT a1/α2 homeodomain heterodimer. a1 and α2 TFs bind cooperatively to repress transcriptional activity of haploid-specific genes in diploid a/α cells (44). Jin et al. (30) investigated effects of mutations in the a1–α2 binding site on in vivo repression of a heterologous promoter assayed for β-galactosidase activity in wild-type diploid a/α cells. The presence of the a1–α2 binding site in the promoter causes MAT a1/α2 dependent repression of lacZ expression. Repression ratios relative to wild type are converted into energies using the Boltzmann formula (see Methods) and compared to the experimental predictions. For the static model, the correlation coefficient is 0.45, the average error is 0.62 kcal/mol, and 44 out of 54 measurements are predicted correctly according to our definition. All but one of the incorrect predictions result in energies that are lower than corresponding experimental energies (Figure 2 For our third and fourth predictions the only available structural templates were solved by NMR rather than X-ray crystallography. AtERF1 is the ERF DNA-binding domain from Arabidopsis, which mediates gene regulation by the plant hormone ethylene (45). c-Myb is a product of the mouse protooncogene c-myb essential in proliferation and differentiation of hematopoietic cells (46). In both cases the binding affinity data were obtained by using EMSA (45,47). Surprisingly, the static model is capable of reasonable quality predictions (Figure 2 DNA conformational energy and sequence specificity Finally, we studied how well the DNA base step energy captures sequence specificity owing to indirect readout in BamHI endonuclease (49) and the PU.1 ETS domain (50). Engler et al. (49) mutated 3 bp sequences flanking the GGATCC core BamHI recognition site and measured resulting binding free energy changes. Inspection of the protein–DNA structure reveals that outside of the core binding site sequence specificity is mostly imparted by protein-phosphate backbone contacts and thus should be ‘recorded’ in the DNA shape. Using the 2.2 Å crystal structure of the BamHI/DNA complex as a modeling template we were able to reproduce experimental binding affinities measured for flanking sequence mutations with a correlation coefficient of 0.63 (Figure 3
Since most of the binding specificity for base pairs dominated by indirect readout can be captured by the DNA conformational energy, it is interesting to know what fraction of the base pairs in protein–DNA complexes has mostly phosphate backbone contacts. For these base pairs, the DNA conformational energy could be more important than the energies of direct interactions between protein side chains and DNA bases. A survey of protein–DNA contacts in structures from Figure 1 Additivity in protein–DNA interactions The assumption of independence of DNA base pair probabilities at each position in the binding site forms a basis of the weight matrix description of binding specificity (1,2). Independence of DNA base pair probabilities implies that the binding energy of a given DNA sequence is a sum of energies associated with each base pair (Equation 8). Although there is direct experimental evidence indicating that pairwise additivity is a reasonable approximation for λ, Cro and Mnt repressors (51–53), there are also reports in the literature emphasizing the importance of dinucleotide and higher order correlations (54,55). Nonetheless, it is generally believed that pairwise additive energies do provide a reasonable approximation to true protein–DNA interaction energies (1,56). In protein–DNA energy, non-additivity can only arise in the dynamic model because all atomic potentials are pairwise. When conformational rearrangement is allowed, the degree of additivity will depend on the range of protein–DNA interactions. For example, long-range electrostatic interactions may cause more deviations from pairwise additivity than relatively short-range van der Waals and hydrogen bonding interactions. The degree of conformational change at the protein–DNA interface in turn depends on the quality of experimental structures and the number of water molecules at the interface (as they are not explicitly modeled in our approach). In DNA conformational energy the base stacking term is non-additive by construction. If, however, the total binding energy predicted by our model turns out to be approximately pairwise additive, the search for binding sites in genomic sequence can be considerably simplified by constructing a weight matrix or a table of energies for all one-point mutations from the consensus sequence, instead of independently computing ΔG values for each putative binding site. In Figure 4
Binding site discrimination In order to carry out successful predictions of genomewide transcriptional regulation, computational models of protein–DNA binding free energy should be able to discriminate TF binding sites from random ensembles of sequences. Since we demonstrated that the static model works best for ΔΔG predictions, we assess its discriminatory power here by computing binding free energies of 16 TFs for which multiple DNA-binding sites are available (Table 2). Because pairwise additivity is nearly exact for the static model, we can compute binding energies of sites with arbitrary sequences using only binding energies for one-point mutations from the consensus sequence as input (Equation 15). The degree of discrimination of TF binding sites from random sites is given by the Z-score:
ΔΔG is the average binding energy for the ensemble of all possible 4L sequences (L is the length of the binding site) and σ is the standard deviation for the same ensemble.In Table 4 we show Z-scores for the binding site from the protein–DNA complex (ZPDB) and Z-scores averaged over all sites from Table 2 ( Zsite ). In addition, binding energies with sites from protein–DNA complexes are ranked for all TFs with L < 15. The overall quality of predictions is quite good, consistent with our previous ΔΔG predictions. Interestingly, the energy of the binding site from the protein–DNA structure is more favorable than the average energy of all binding sites in all cases except Trl (1yui; Table 4), showing that most experimentally characterized binding sites have lower affinities than the consensus site. Nonetheless, almost all of the inspected sites have highly favorable binding energies and thus low Z-scores. One notable exception is the integration host factor protein (Ihf) for which the average Z-score for 27 sites from Table 2 is only −1.12. The large discrepancy between the average Z-score and the PDB Z-score might be explained in this case by the major role of indirect readout in Ihf binding: the crystal structure of the Ihf–DNA complex shows that DNA is bent almost 180° and has relatively few direct contacts between side chains and DNA bases, especially in the 5′ region of the binding site. The experimental PWM is relatively non-specific, and different sites or groups of sites might utilize significantly different binding modes that are not captured well by our approach.
Ranking binding sites from the protein–DNA structure versus ensembles of random sites provides further illustration of the accuracy of our predictions: for example, the native binding site of the Ndt80 (1mnn) is 14th out of 16 777 216 sequences, whereas in Gcn4p (1ysa) the native binding site is the lowest in energy among 16 384 sequences (Table 4). Similar to ΔΔG predictions, binding site discrimination from random sequences strongly depends on the quality of the structural template: Trl (1yui) native site is ranked only 91st out of 16 384, most probably because the structure of the protein–DNA complex was determined by NMR.PWM predictions Results from the previous section show a reasonable degree of additivity in protein–DNA binding free energy predictions, most probably because of the limited role of long-range interactions in our model. Therefore, we can convert binding energies into weight matrices without significant loss of information and test our PWM predictions against experimental data. Similar to ΔΔG predictions, we compare all-atom static and dynamic models with the simpler contact model, which uses the number of atomic contacts between protein side chains and DNA base pairs as a measure of binding specificity (Equation 12). Experimental PWMs and prediction testing We constructed experimental PWMs for 20 TFs using several alternative approaches. PWMs for λR (1lmb), CroR (6cro), c-Myb (1mse) and AtERF1 (1gcc) were created by converting experimental binding free energy measurements for all one-point mutations of the binding site (Table 1) into probabilities using the Boltzmann formula at room temperature. The resulting PWMs are then constructed in a way directly comparable to computational predictions. Another and more commonly used method for constructing PWMs is based on the alignment of binding sites obtained from either a SELEX experiment or a set of promoter sequences of genes regulated by the TF (2). The quality of weight matrices obtained from such alignments depends on the number of sequences used in the alignment. In our 20 PWM dataset shown in Table 2, 4 PWMs are created from ΔΔG measurements, 14 PWMs are based on genomic sites available from the literature, the Zif268 PWM is from a SELEX experiment (31), and the EcR/Usp PWM comes from a combination of SELEX (57) and genomic sites. PWM predictions are analyzed using the ψ-test (Equation 14). ψ(p, q) is a non-negative measure of the ‘goodness of fit’ between computational probabilities and experimental frequencies (38). It is a monotonic function of prediction quality. Free parameters of PWM models The contact model replaces detailed calculations of protein–DNA energetics with an assumption that the probability of each base in the consensus sequence is directly proportional to the number of contacts made between the base pair and all protein side chains (Equation 12). As in the binding free energy contact model, protein–DNA contacts are defined as protein–DNA base atomic pairs within Rmax = 4.5 Å; contacts to the phosphate backbone are ignored. As the number of contacts N increases, the consensus base becomes more and more specific, with the rest of the probability evenly divided between the other three bases. The probability of the consensus base becomes 1 for N = Nmax, and all other probabilities become 0. Given Rmax, Nmax is a free parameter to be adjusted by minimizing ψ(p, q) averaged over a subset of TFs from Table 2. PWMs constructed from fewer than 10 binding site sequences [including Gcn4p (1ysa), Trl (1yui) and DnaA (1j1v)] are removed from the Nmax fit. The average value of ψ is at a minimum for Nmax = 20. Using the Boltzmann formula to convert energies into probabilities involves an inverse temperature factor β = 1/RT, which can also be viewed as an adjustable scaling factor. The specificity of PWM predictions depends on β: at low temperatures only a few lowest energy binding sites contribute to weight matrix probabilities, whereas at high temperatures a broader spectrum of sites is included. Therefore, incorrect predictions of low-energy sites will result in higher fitted temperatures. The static model fit with the 17 TF dataset used for adjusting Nmax in the contact model gives β = 2.25 (kcal/mol)−1. For the dynamic model fit, we additionally exclude all NMR structures (1gcc and 1mse), crystal structures with >2.5 Å resolution (6cro, 2drp, 1run and 2puc) and the Ihf–DNA complex (1ihf) because its DNA conformation cannot be reasonably expected to be modeled by relaxation with the quadratic DNA potential. The dynamic model fit over 10 remaining TFs results in β = 0.75 (kcal/mol)−1. PWM predictions and comparison with experiment The fitted values of Nmax and β are used to make PWM predictions for all 20 TFs (16 for the dynamic model as 3 NMR structures and Ihf are excluded). In Table 5 we show values of ψ computed using the contact, static and dynamic models. We compare model predictions with the average value of ψ for an ensemble of randomly generated weight matrices ( ψrandom column in Table 5). ψ is lower than the corresponding random value if a prediction is successful. For the contact model the average value of ψ over all TFs is 0.19, significantly better than the random value of 0.70. For the static and dynamic models the average over all TFs increases to 0.23 and 0.35, respectively. Note, however, that even for the least successful predictions (1b8i for the contact model, 1mj2 for the static model and 2puc for the dynamic model) ψ is much smaller than the corresponding random ψ (Table 5). Surprisingly, it is the contact model that is the most successful on average: the static model has a lower ψ in only 6 cases out of 20. This finding demonstrates that PWM predictions may not require detailed models of protein–DNA energetics if native protein–DNA complexes are available. Furthermore, allowing conformational change generally makes predictions worse, consistent with our earlier observations regarding ΔΔG predictions.
The ψ-test provides only an average measure of success and does not necessarily reflect all relevant details of probability distributions in specific columns. Hence it is useful to analyze several PWM predictions in more detail. In Figure 5
For Ndt80, ψ is 0.14 for the contact model, 0.12 for the static model and 0.20 for the dynamic model (Table 5). Figure 5A In summary, reliable PWM predictions can be carried out if the native structure of the protein–DNA complex is used as a starting point for computational modeling. The contact model which assigns base pair specificity based on the number of atomic contacts between protein amino acids and DNA bases is surprisingly successful. In some cases, the static model provides the best results: although the agreement with the experimental data is somewhat worse on average compared with the contact model, core motif probabilities are often better reproduced (Figure 5 Homology modeling Binding affinity and specificity predictions described in the previous sections require native structures of protein–DNA complexes. However, even in well-studied organisms such as S.cerevisiae and D.melanogaster there are only 10–15 suitable structures in the database. Furthermore, in most cases experimentally available protein–DNA complexes are not focused on any specific biological pathway (such as regulation of the Drosophila segmentation gene network), being instead distributed across a range of regulatory pathways and cell types. Hence the ability to model protein–DNA interactions by homology is crucial to future practical applications of our approach. A suitable modeling template is initially identified by sequence similarity to protein–DNA complexes in the structural database using Ginzu (60). Besides sequence similarity, the quality of experimental structures (such as X-ray resolution or missing atoms) is taken into account. Amino acid substitutions at the DNA-binding interface are identified by a sequence–structure alignment with the template using K*Sync (60) or ClustalW (61). The approximate pairwise additivity of base pair energies and the relatively short range nature of the free energy function make it possible to ignore amino acid substitutions elsewhere because they are less likely to mediate protein–DNA interactions. An obvious consequence of this assumption is that binding specificity does not change if all amino acids at the protein–DNA interface are conserved. To give a specific example, essentially all DNA contacting amino acids are conserved in the Rel homology region family (8) and, thus, binding sites for the embryonic polarity protein dorsal in fly bear a strong resemblance to nuclear factor κB sites from mouse and human [e.g. a dorsal weight matrix from Ref. (62) gives the T-G/T-G-G-A/T-T-T-T-T/C-C-C consensus sequence, very close to the T-G-G-G-A-A-T-T-C-C-C binding site from the structure of the mouse nuclear factor κB p50 homodimer bound to DNA]. A classification of protein–DNA complexes into families and subfamilies has been carried out on the basis of identical DNA contacting amino acids (8). The definition of interface amino acids is somewhat arbitrary: typically, amino acids are considered to be at the interface based on distance cutoffs and/or visual identification of hydrogen bonds, van der Waals contacts and favorable electrostatic interactions with DNA atoms. We adopt a simple interface definition based on the 4.5 Å cutoff between protein side chain atoms and DNA base or phosphate backbone atoms. In Figure 6
Using the Q50K Engrailed mutant as a structural template for Bcd makes homology modeling relatively easy: all amino acids are conserved at the DNA-binding interface, even though there are 28/55 amino acid substitutions and a 2 residue gap in the alignment. Because Q50K Engrailed and Bcd DNA-binding interfaces are virtually identical, the experimental PWM for Bcd is reasonably well reproduced by the contact model. Prediction of the motif 3′ of the TAAT core is further improved with the static model, but the TAAT motif becomes less specific (Figure 6A Owing to the differences between Bcd- and En-binding specificities, the contact model is not very successful in predicting the bicoid PWM starting from the wild-type En–DNA complex. The dynamic model reproduces PWM columns 8 and 9 significantly better, but in column 7 adenine is favored over cytosine, and in column 4 adenine is mixed with guanine to the extent not corroborated by the experiment (Figure 6A Giant is a TF from the leucine zipper family. It binds DNA as a homodimer, with the TTAC consensus motif at positions 3–6 and its inverted complement GTAA at positions 7–10 (Figure 6B The Bcd and Gt examples described above are representative of the future applications of our approach. Binding site specificity predictions based on structural modeling can be used in conjunction with existing bioinformatic algorithms to study regulatory gene networks in many species. Even though the requirement of having homologous structures of protein–DNA complexes with a limited number of mutations at the protein–DNA binding interface is much less restrictive, it is still likely to be the main limitation of our approach. PWM prediction by homology becomes less tractable if the number of mutations at the interface is >2 or 3, probably because the current implementation of our model does not allow protein backbone degrees of freedom to relax. Modeling TF binding specificities using distant homologs may require including rigid body motion of the TF and sampling over multiple docking conformations. CONCLUSIONS We developed a computational all-atom approach for predicting protein–DNA binding affinities and TF weight matrices. Protein–DNA energetics is described with the empirical free energy function that accounts for protein–DNA interactions (including electrostatics, solvation, hydrogen bonding, van der Waals interactions and packing) and distortion of the DNA shape caused by protein binding. Each term in the free energy function is multiplied by a weight which is adjusted to optimize the performance of the model on an experimental dataset. Free energy minimization and conformational rearrangement at the protein–DNA binding interface are either not employed at all (static model), or limited to repacking interface side chains and DNA minimization (dynamic model). Protein–DNA docking orientation and protein backbone conformation are kept fixed during energy minimization. Our approach is computationally efficient and can be applied on the genomewide scale. We demonstrated its utility by carrying out a number of ΔΔG and PWM predictions using native protein–DNA complexes as structural templates. Proteins bind DNA in a sequence-specific manner by utilizing two distinct interaction mechanisms. The mechanism of direct readout is mediated by protein side chains directly contacting DNA base atoms. Favorable protein–DNA base contacts result in base pair preferences at corresponding positions in the binding site. The mechanism of indirect readout is mediated by side chain contacts with the DNA phosphate backbone. These contacts are typically as numerous as direct protein–DNA base contacts and can exploit DNA flexibility by twisting and bending it into the shape that fits best with the binding interface presented by the protein. Since some DNA sequences are more flexible than others, DNA conformational change confers additional sequence specificity to the binding site. In cases where indirect readout predominates, our model predicts a major contribution of the DNA conformational energy to the overall binding specificity (Figure 3 None of the terms in the DNA base pair energy depend on neighboring base pairs in the absence of conformational rearrangement (except the base stacking energy) and, thus, DNA base pair energies are nearly independent in the static model and only weakly coupled in the dynamic model (Figure 4 The number of protein–DNA complexes currently available in the structural database is insufficient for modeling transcriptional regulation on a large scale. Therefore, the range of applicability of our approach depends on its accuracy in modeling TF binding specificities starting from homologous structures. Owing to the relatively short-range nature of our free energy function, it is sufficient to substitute amino acids only at the DNA-binding interface when creating protein–DNA homology models. Homology modeling should be easiest when there are no dissimilar amino acid substitutions at the interface, because in many instances TFs with conserved interfaces have identical binding specificities. Our model makes accurate predictions in such cases, but changes in binding specificity resulting from amino acid mutations are often predicted less accurately (Figure 6 In summary, the computational algorithm developed here is useful for binding affinity and weight matrix predictions if either a native structure of the protein–DNA complex or its sufficiently close homolog is available. Unlike previously reported knowledge-based approaches (15,16), our algorithm is not limited to any specific TF family and is not as data intensive. However, its accuracy strongly depends on the quality of the experimental structure used as the modeling template, and the number of amino acid substitutions at the DNA-binding interface. In future, we intend to combine structurally predicted PWMs with motif detection algorithms in order to identify TF binding sites on the genomic scale. SOFTWARE AND DATA AVAILABILITY The protein–nucleic acid interaction module is implemented in C++ in the ROSETTA software package (http://www.bakerlab.org). ROSETTA software package is freely available to academic users. We hope to foster further development of computational algorithms for protein–DNA binding specificity predictions by providing all experimental datasets used in this study (including ΔG measurements, PWMs and TF binding sites) as Supplementary Data. SUPPLEMENTARY DATA Supplementary Data are available at NAR Online. [Supplementary Material]
Acknowledgments The authors would like to thank Jonathan Widom for careful reading of the manuscript and many useful suggestions, and Eldon Emberly for providing experimental binding data for several TFs. A.V.M. would like to acknowledge a fellowship from the Leukemia and Lymphoma Society. J.J.H. is a fellow of the Jane Coffin Childs Memorial Fund for Medical Research. Funding to pay the Open Access publication charges for this article was provided by National Institutes of Health (NIH). Conflict of interest statement. None declared. REFERENCES 1. Bulyk M. Computational prediction of transcription-factor binding site locations. Genome Biol. 2003;5:201. [PubMed] 2. Stormo G.D. DNA binding sites: representation and discovery. Bioinformatics. 2000;16:16–23. [PubMed] 3. Siggia E.D. Computational methods for transcriptional regulation. Curr. Opin. Genet. Dev. 2005;15:214–221. [PubMed] 4. Seeman N.C., Rosenberg J.M., Rich A. Sequence-specific recognition of double helical nucleic acids by proteins. Proc. Natl Acad. Sci. USA. 1976;73:804–808. [PubMed] 5. Suzuki M., Yagi N. DNA recognition code of transcription factors in the helix–turn–helix, probe helix, hormone receptor, and zinc finger families. Proc. Natl Acad. Sci. USA. 1994;91:12357–12361. [PubMed] 6. Matthews B.W. Protein–DNA interaction. No code for recognition. Nature. 1988;335:294–295. [PubMed] 7. Luscombe N.M., Laskowski R.A., Thornton J.M. Amino acid–base interactions: a three-dimensional analysis of protein–DNA interactions at an atomic level. Nucleic Acids Res. 2001;29:2860–2874. [PubMed] 8. Luscombe N.M., Thornton J.M. Protein–DNA interactions: amino acid conservation and the effects of mutations on binding specificity. J. Mol. Biol. 2002;320:991–1009. [PubMed] 9. Pabo C.O., Sauer R.T. Transcription factors: structural families and principles of DNA recognition. Annu. Rev. Biochem. 1992;61:1053–1095. [PubMed] 10. Pabo C.O., Nekludova L. Geometric analysis and comparison of protein–DNA interfaces: Why is there no simple code for recognition? J. Mol. Biol. 2000;301:597–624. [PubMed] 11. Benos P.V., Lapedes A.S., Stormo G.D. Is there a code for protein–DNA recognition? Probab(ilistical)ly. Bioessays. 2002;24:466–475. [PubMed] 12. Mandel-Gutfreund Y., Margalit H. Quantitative parameters for amino acid–base interaction: implications for prediction of protein–DNA binding sites. Nucleic Acids Res. 1998;26:2306–2312. [PubMed] 13. Kono H., Sarai A. Structure-based prediction of DNA target sites by regulatory proteins. Proteins. 1999;35:114–131. [PubMed] 14. Sandelin A., Wasserman W.W. Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics. J. Mol. Biol. 2004;338:207–215. [PubMed] 15. Benos P.V., Lapedes A.S., Stormo G.D. Probabilistic code for DNA recognition by proteins of the EGR family. J. Mol. Biol. 2002;323:701–727. [PubMed] 16. Kaplan T., Friedman N., Margalit H. ab initio prediction of transcription factor targets using structural knowledge. PLoS Comput. Biol. 2005;1:e1. [PubMed] 17. Endres R.G., Schulthess T.C., Wingreen N.S. Toward an atomistic model for predicting transcription-factor binding sites. Proteins. 2004;57:262–268. [PubMed] 18. Paillard G., Lavery R. Analyzing protein–DNA recognition mechanisms. Structure. 2004;12:113–122. [PubMed] 19. Paillard G., Deremble C., Lavery R. Looking into DNA recognition: zinc finger binding specificity. Nucleic Acids Res. 2004;32:6673–6682. [PubMed] 20. Havranek J.J., Duarte C.M., Baker D. A simple physical model for the prediction and design of protein–DNA interactions. J. Mol. Biol. 2004;344:59–70. [PubMed] 21. Kortemme T., Morozov A.V., Baker D. An orientation-dependent hydrogen bonding potential improves prediction of specificity and structure for proteins and protein–protein complexes. J. Mol. Biol. 2003;326:1239–1259. [PubMed] 22. Onufriev A., Bashford S.D., Case D.A. Exploring protein native states and large-scale conformational changes with a modified generalized Born model. Proteins. 2004;55:383–394. [PubMed] 23. Lazaridis T., Karplus M. Effective energy function for proteins in solution. Proteins. 1999;35:133–152. [PubMed] 24. Wang J., Cieplak P., Kollman P.A. How well does a restrained electrostatic potential (RESP) model perform in calculating conformational energies of organic and biological molecules? J. Comput. Chem. 2000;21:1049–1074. 25. Olson W.K., Gorin A.A., Lu X., Hock L.M., Zhurkin V.B. DNA sequence-dependent deformability deduced from protein–DNA crystal complexes. Proc. Natl Acad. Sci. USA. 1998;95:11163–11168. [PubMed] 26. Lu X., Olson W.K. 3DNA: a software package for the analysis, rebuilding and visualization of three-dimensional nucleic acid structure. Nucleic Acids Res. 2003;31:5108–5121. [PubMed] 27. Lu X., El Hassan M.A., Hunter C.A. Structure and conformation of helical nucleic acids: analysis program (SCHNAaP). J. Mol. Biol. 1997;273:668–680. [PubMed] 28. Gromiha M.M., Siebers J.G., Selvaraj S., Kono H., Sarai A. Intermolecular and intramolecular readout mechanisms in protein–DNA recognition. J. Mol. Biol. 2004;337:285–294. [PubMed] 29. Berg O.G., von Hippel P.H. Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J. Mol. Biol. 1987;193:723–750. [PubMed] 30. Jin Y., Zhong H., Vershon A.K. The yeast a1 and α2 homeodomain proteins do not contribute equally to heterodimeric DNA binding. Mol. Cell. Biol. 1999;19:585–593. [PubMed] 31. Swirnoff A.H., Milbrandt J. DNA-binding specificity of NGFI-A and related zinc finger transcription factors. Mol. Cell. Biol. 1995;15:2275–2287. [PubMed] 32. Robison K., McGuire A., Church G. A comprehensive library of DNA-binding site matrices for 55 proteins applied to the complete Escherichia coli K-12 genome. J. Mol. Biol. 1998;284:241–254. [PubMed] 33. Salgado H., Gama-Castro S., Martínez-Antonio A., Díaz-Peredo E., Sánchez-Solano F., Peralta-Gil M., Garcia-Alonso D., Jimenez-Jacinto V., Santos-Zavaleta A., Bonavides-Martinez C., et al. RegulonDB (version 4.0): transcriptional regulation, operon organization and growth conditions in Escherichia coli K-12. Nucleic Acids Res. 2004;32:D303–D306. [PubMed] 34. Pierce M., Benjamin K.R., Montano S.P., Georgiadis M.M., Winter E., Vershon A.K. Sum1 and Ndt80 proteins compete for binding to middle sporulation element sequences that control meiotic gene expression. Mol. Cell. Biol. 2003;23:4814–4825. [PubMed] 35. Natarajan K., Meyer M.R., Jackson B.M., Slade D., Roberts C., Hinnebusch A.G., Marton M.J. Transcriptional profiling shows that Gcn4p is a master regulator of gene expression during amino acid starvation in yeast. Mol. Cell. Biol. 2001;21:4347–4368. [PubMed] 36. Wingender E., Chen X., Fricke E., Geffers R., Hehl R., Liebich I., Krull M., Matys V., Michael H., Ohnhauser R., et al. The TRANSFAC system on gene expression regulation. Nucleic Acids Res. 2001;29:281–283. [PubMed] 37. Wilson D.S., Guenther B., Desplan C., Kuriyan J. High resolution crystal structure of a paired (Pax) class cooperative homeodomain dimer on DNA. Cell. 1995;82:709–719. [PubMed] 38. Jaynes E.T. Probability Theory: The Logic of Science. Cambridge, UK: Cambridge University Press; 2003. 39. Dunbrack R.L., Jr, Cohen F.E. Bayesian statistical analysis of protein side-chain rotamer preferences. Protein Sci. 1997;6:1661–1681. [PubMed] 40. Kuhlman B., Baker D. Native protein sequences are close to optimal for their structures. Proc. Natl Acad. Sci. USA. 2000;97:10383–10388. [PubMed] 41. Lesser D.R., Kurpiewski M.R., Jen-Jacobson L. The energetic basis of specificity in the EcoRI endonuclease–DNA interaction. Science. 1990;250:776–786. [PubMed] 42. Hamilton T.B., Borel F., Romaniuk P.J. Comparison of the DNA binding characteristics of the related zinc finger proteins WT1 and EGR1. Biochemistry. 1998;37:2051–2058. [PubMed] 43. Chu S., DeRisi J., Eisen M., Mulholland J., Botstein D., Brown P.O., Herskowitz I. The transcriptional program of sporulation in budding yeast. Science. 1998;282:699–705. [PubMed] 44. Dranginis A.M. Binding of yeast a1 and α2 as a heterodimer to the operator DNA of a haploid-specific gene. Nature. 1990;347:682–685. [PubMed] 45. Hao D., Ohme-Takagi M., Sarai A. Unique mode of GCC box recognition by the DNA-binding domain of ethylene-responsive element-binding factor (ERF domain) in plant. J. Biol. Chem. 1998;273:26857–26861. [PubMed] 46. Lipsick J.S., Wang D.-M. Transformation by v-Myb. Oncogene. 1999;18:3047–3055. [PubMed] 47. Tanikawa J., Yasukawa T., Enari M., Ogata K., Nishimura Y., Ishii S., Sarai A. Recognition of specific DNA sequences by the c-myb protooncogene product: role of three repeat units in the DNA-binding domain. Proc. Natl Acad. Sci. USA. 1993;90:9320–9324. [PubMed] 48. Lee M.R., Kollman P.A. Free-energy calculations highlight differences in accuracy between X-ray and NMR structures and add value to protein structure prediction. Structure. 2001;9:905–916. [PubMed] 49. Engler L.E., Sapienza P., Dorner L.F., Kucera R., Schildkraut I., Jen-Jacobson L. The energetics of the interaction of BamHI endonuclease with its recognition site GGATCC. J. Mol. Biol. 2001;307:619–636. [PubMed] 50. Poon G.M.K., Macgregor R.B., Jr Base coupling in sequence-specific site recognition by the ETS domain of murine PU.1. J. Mol. Biol. 2003;328:805–819. [PubMed] 51. Sarai A., Takeda Y. λ Repressor recognizes the ~2-fold symmetric half-operator sequences asymmetrically. Proc. Natl Acad. Sci. USA. 1989;86:6513–6517. [PubMed] 52. Takeda Y., Sarai A., Rivera V.M. Analysis of the sequence-specific interactions between Cro repressor and operator DNA by systematic base substitution experiments. Proc. Natl Acad. Sci USA. 1989;86:439–443. [PubMed] 53. Fields D.N., He Y., Al-Uzri A.Y., Stormo G.D. Quantitative specificity of the Mnt repressor. J. Mol. Biol. 1997;271:178–194. [PubMed] 54. Man T.-K., Stormo G.D. Non-independence of Mnt repressor–operator interaction determined by a new quantitative multiple fluorescence relative affinity (QuMFRA) assay. Nucleic Acids Res. 2001;15:2471–2478. [PubMed] 55. Bulyk M.L., Johnson P., Church G. Nucleotides of transcription factor binding sites exert inter-dependent effects on the binding affinities of transcription factors. Nucleic Acids Res. 2002;30:1255–1261. [PubMed] 56. Benos P.V., Bulyk M.L., Stormo G.D. Additivity in protein–DNA interactions: how good an approximation is it? Nucleic Acids Res. 2002;30:4442–4451. [PubMed] 57. Vogtli M., Elke C., Imhof M.O., Lezzi M. High level transactivation by the ecdysone receptor complex at the core recognition motif. Nucleic Acids Res. 1998;26:2407–2414. [PubMed] 58. Schneider T.D., Stephens R.M. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990;18:6097–6100. [PubMed] 59. Kortemme T., Baker D. A simple physical model for binding energy hot spots in protein–protein complexes. Proc. Natl Acad. Sci. USA. 2002;99:14116–14121. [PubMed] 60. Kim D.E., Chivian D., Baker D. Protein structure prediction and analysis using the Robetta server. Nucleic Acids Res. 2004;32:W526–W531. [PubMed] 61. Thompson J.D., Higgins D.G., Gibson T.J. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22:4673–4680. [PubMed] 62. Rajewsky N., Vergassola M., Gaul U., Siggia E.D. Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo. BMC Bioinformatics. 2002;3:30. [PubMed] 63. Schroeder D., Pearce M., Fak J., Fan H., Unnerstall U., Emberly E., Rajewsky N., Siggia E.D., Gaul U. Transcriptional control in the segmentation gene network of Drosophila. PLoS Biol. 2004;2:1396–1410. 64. Fraenkel E., Rould M.A., Chambers K.A., Pabo C.O. Engrailed homeodomain–DNA complex at 2.2 Å resolution: a detailed view of the interface and comparison with other Engrailed structures. J. Mol. Biol. 1998;284:351–361. [PubMed] 65. Tucker-Kellogg L., Rould M.A., Chambers K.A., Ades S.E., Sauer R.T., Pabo C.O. Engrailed (Gln50->Lys) homeodomain–DNA complex at 1.9 Å resolution: structural basis for enhanced affinity and altered specificity. Structure. 1997;5:1047–1054. [PubMed] 66. Nekludova L., Pabo C.O. Distinctive DNA conformation with enlarged major groove is found in Zn-finger-DNA and other protein–DNA complexes. Proc. Natl Acad. Sci. USA. 1994;91:6948–6952. [PubMed] 67. Miller J.C., Pabo C.O. Rearrangement of side-chains in a Zif268 mutant highlights the complexities of zinc finger-DNA recognition. J. Mol. Biol. 2001;313:309–315. [PubMed] 68. Coskun-Ari F.F., Hill T.M. Sequence-specific interactions in the Tus–Ter complex and the effect of base pair substitutions on arrest of DNA replication in Escherichia. coli. J. Biol. Chem. 1997;272:26448–26456. [PubMed] 69. Frank D.E., Saecker R.M., Bond J.P., Capp M.W., Tsodikov O.V., Melcher S.E., Levandoski M.M., Record M.T., Jr Thermodynamics of the interactions of Lac repressor with variants of the symmetric Lac operator: effects of converting a consensus site to a non-specific site. J. Mol. Biol. 1997;267:1186–1206. [PubMed] 70. Grillo A.O., Brown M.P., Royer C.A. Probing the physical basis for trp repressor-operator recognition. J. Mol. Biol. 1999;287:539–554. [PubMed] 71. Boyer M., Poujol N., Margeat E., Royer C.A. Quantitative characterization of the interaction between purified human estrogen receptor α and DNA using fluorescence anisotropy. Nucleic Acids Res. 2000;28:2494–2502. [PubMed] 72. Gunasekera A., Ebright Y.W., Ebright R.H. DNA sequence determinants for binding of the Escherichia coli catabolite gene activator protein. J. Biol. Chem. 1992;267:14713–14720. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||
Genome Biol. 2003; 5(1):201.
[Genome Biol. 2003]Curr Opin Genet Dev. 2005 Apr; 15(2):214-21.
[Curr Opin Genet Dev. 2005]Proc Natl Acad Sci U S A. 1976 Mar; 73(3):804-8.
[Proc Natl Acad Sci U S A. 1976]Proc Natl Acad Sci U S A. 1994 Dec 20; 91(26):12357-61.
[Proc Natl Acad Sci U S A. 1994]Nature. 1988 Sep 22; 335(6188):294-5.
[Nature. 1988]J Mol Biol. 2000 Aug 18; 301(3):597-624.
[J Mol Biol. 2000]Bioessays. 2002 May; 24(5):466-75.
[Bioessays. 2002]Proteins. 2004 Nov 1; 57(2):262-8.
[Proteins. 2004]Nucleic Acids Res. 2004; 32(22):6673-82.
[Nucleic Acids Res. 2004]J Mol Biol. 2004 Nov 12; 344(1):59-70.
[J Mol Biol. 2004]J Mol Biol. 2003 Feb 28; 326(4):1239-59.
[J Mol Biol. 2003]Proteins. 2004 May 1; 55(2):383-94.
[Proteins. 2004]Proteins. 1999 May 1; 35(2):133-52.
[Proteins. 1999]Proc Natl Acad Sci U S A. 1998 Sep 15; 95(19):11163-8.
[Proc Natl Acad Sci U S A. 1998]Nucleic Acids Res. 2003 Sep 1; 31(17):5108-21.
[Nucleic Acids Res. 2003]J Mol Biol. 1997 Oct 31; 273(3):668-80.
[J Mol Biol. 1997]J Mol Biol. 2004 Mar 19; 337(2):285-94.
[J Mol Biol. 2004]J Mol Biol. 2004 Nov 12; 344(1):59-70.
[J Mol Biol. 2004]J Mol Biol. 1987 Feb 20; 193(4):723-50.
[J Mol Biol. 1987]Mol Cell Biol. 1999 Jan; 19(1):585-93.
[Mol Cell Biol. 1999]Mol Cell Biol. 1995 Apr; 15(4):2275-87.
[Mol Cell Biol. 1995]J Mol Biol. 1998 Nov 27; 284(2):241-54.
[J Mol Biol. 1998]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D303-6.
[Nucleic Acids Res. 2004]Mol Cell Biol. 2003 Jul; 23(14):4814-25.
[Mol Cell Biol. 2003]Mol Cell Biol. 1999 Jan; 19(1):585-93.
[Mol Cell Biol. 1999]Protein Sci. 1997 Aug; 6(8):1661-81.
[Protein Sci. 1997]J Mol Biol. 2004 Nov 12; 344(1):59-70.
[J Mol Biol. 2004]Proc Natl Acad Sci U S A. 2000 Sep 12; 97(19):10383-8.
[Proc Natl Acad Sci U S A. 2000]Science. 1990 Nov 9; 250(4982):776-86.
[Science. 1990]Biochemistry. 1998 Feb 17; 37(7):2051-8.
[Biochemistry. 1998]Science. 1998 Oct 23; 282(5389):699-705.
[Science. 1998]Mol Cell Biol. 2003 Jul; 23(14):4814-25.
[Mol Cell Biol. 2003]Nature. 1990 Oct 18; 347(6294):682-5.
[Nature. 1990]Mol Cell Biol. 1999 Jan; 19(1):585-93.
[Mol Cell Biol. 1999]J Biol Chem. 1998 Oct 9; 273(41):26857-61.
[J Biol Chem. 1998]Oncogene. 1999 May 13; 18(19):3047-55.
[Oncogene. 1999]Proc Natl Acad Sci U S A. 1993 Oct 15; 90(20):9320-4.
[Proc Natl Acad Sci U S A. 1993]Structure. 2001 Oct; 9(10):905-16.
[Structure. 2001]J Mol Biol. 2001 Mar 23; 307(2):619-36.
[J Mol Biol. 2001]J Mol Biol. 2003 May 9; 328(4):805-19.
[J Mol Biol. 2003]Genome Biol. 2003; 5(1):201.
[Genome Biol. 2003]Bioinformatics. 2000 Jan; 16(1):16-23.
[Bioinformatics. 2000]Proc Natl Acad Sci U S A. 1989 Sep; 86(17):6513-7.
[Proc Natl Acad Sci U S A. 1989]J Mol Biol. 1997 Aug 15; 271(2):178-94.
[J Mol Biol. 1997]Nucleic Acids Res. 2001 Jun 15; 29(12):2471-8.
[Nucleic Acids Res. 2001]Proc Natl Acad Sci U S A. 1989 Sep; 86(17):6513-7.
[Proc Natl Acad Sci U S A. 1989]Bioinformatics. 2000 Jan; 16(1):16-23.
[Bioinformatics. 2000]Mol Cell Biol. 1995 Apr; 15(4):2275-87.
[Mol Cell Biol. 1995]Nucleic Acids Res. 1998 May 15; 26(10):2407-14.
[Nucleic Acids Res. 1998]Nucleic Acids Res. 1990 Oct 25; 18(20):6097-100.
[Nucleic Acids Res. 1990]Mol Cell Biol. 1995 Apr; 15(4):2275-87.
[Mol Cell Biol. 1995]J Mol Biol. 2002 Nov 1; 323(4):701-27.
[J Mol Biol. 2002]Proc Natl Acad Sci U S A. 2002 Oct 29; 99(22):14116-21.
[Proc Natl Acad Sci U S A. 2002]Nucleic Acids Res. 2004 Jul 1; 32(Web Server issue):W526-31.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 1994 Nov 11; 22(22):4673-80.
[Nucleic Acids Res. 1994]J Mol Biol. 2002 Jul 26; 320(5):991-1009.
[J Mol Biol. 2002]BMC Bioinformatics. 2002 Oct 24; 3():30.
[BMC Bioinformatics. 2002]J Mol Biol. 1998 Nov 27; 284(2):351-61.
[J Mol Biol. 1998]Structure. 1997 Aug 15; 5(8):1047-54.
[Structure. 1997]Cell. 1995 Sep 8; 82(5):709-19.
[Cell. 1995]Proc Natl Acad Sci U S A. 1994 Jul 19; 91(15):6948-52.
[Proc Natl Acad Sci U S A. 1994]J Mol Biol. 1998 Nov 27; 284(2):351-61.
[J Mol Biol. 1998]J Mol Biol. 2002 Nov 1; 323(4):701-27.
[J Mol Biol. 2002]PLoS Comput Biol. 2005 Jun; 1(1):e1.
[PLoS Comput Biol. 2005]Biochemistry. 1998 Feb 17; 37(7):2051-8.
[Biochemistry. 1998]J Mol Biol. 2001 Oct 19; 313(2):309-15.
[J Mol Biol. 2001]Mol Cell Biol. 2003 Jul; 23(14):4814-25.
[Mol Cell Biol. 2003]Mol Cell Biol. 1999 Jan; 19(1):585-93.
[Mol Cell Biol. 1999]J Biol Chem. 1998 Oct 9; 273(41):26857-61.
[J Biol Chem. 1998]Proc Natl Acad Sci U S A. 1993 Oct 15; 90(20):9320-4.
[Proc Natl Acad Sci U S A. 1993]J Mol Biol. 2001 Mar 23; 307(2):619-36.
[J Mol Biol. 2001]J Mol Biol. 2003 May 9; 328(4):805-19.
[J Mol Biol. 2003]Nucleic Acids Res. 1990 Oct 25; 18(20):6097-100.
[Nucleic Acids Res. 1990]Nucleic Acids Res. 1990 Oct 25; 18(20):6097-100.
[Nucleic Acids Res. 1990]