![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||
Copyright Gao, Skolnick. This is an open-access article distributed under the
terms of the Creative Commons Attribution License, which permits unrestricted use,
distribution, and reproduction in any medium, provided the original author and
source are credited. From Nonspecific DNA–Protein Encounter Complexes to the
Prediction of DNA–Protein Interactions Center for the Study of Systems Biology, School of Biology, Georgia
Institute of Technology, Atlanta, Georgia, United States of America Ilya Vakser, Editor University of Kansas, United States of America * E-mail: skolnick/at/gatech.edu Conceived and designed the experiments: MG JS. Performed the experiments: MG.
Analyzed the data: MG. Contributed reagents/materials/analysis tools: MG.
Wrote the paper: MG JS. Received December 9, 2008; Accepted February 26, 2009. This article has been cited by other articles in PMC.Abstract DNA–protein interactions are involved in many essential biological
activities. Because there is no simple mapping code between DNA base pairs and
protein amino acids, the prediction of DNA–protein interactions is a
challenging problem. Here, we present a novel computational approach for
predicting DNA-binding protein residues and DNA–protein interaction
modes without knowing its specific DNA target sequence. Given the structure of a
DNA-binding protein, the method first generates an ensemble of complex
structures obtained by rigid-body docking with a nonspecific canonical B-DNA.
Representative models are subsequently selected through clustering and ranking
by their DNA–protein interfacial energy. Analysis of these encounter
complex models suggests that the recognition sites for specific DNA binding are
usually favorable interaction sites for the nonspecific DNA probe and that
nonspecific DNA–protein interaction modes exhibit some similarity to
specific DNA–protein binding modes. Although the method requires as
input the knowledge that the protein binds DNA, in benchmark tests, it achieves
better performance in identifying DNA-binding sites than three previously
established methods, which are based on sophisticated machine-learning
techniques. We further apply our method to protein structures predicted through
modeling and demonstrate that our method performs satisfactorily on protein
models whose root-mean-square Cα deviation from native is up to 5
Å from their native structures. This study provides valuable
structural insights into how a specific DNA-binding protein interacts with a
nonspecific DNA sequence. The similarity between the specific
DNA–protein interaction mode and nonspecific interaction modes may
reflect an important sampling step in search of its specific DNA targets by a
DNA-binding protein. Author Summary Many essential biological activities require interactions between DNA and
proteins. These proteins usually use certain amino acids, called DNA-binding
sites, to recognize their specific DNA targets. To facilitate the search of its
specific DNA targets, a DNA-binding protein often associates with nonspecific
DNA and then diffuses along the DNA. Due to the weak interactions between
nonspecific DNA and the protein, structural characterization of nonspecific
DNA–protein complexes is experimentally challenging. This paper
describes a computational modeling study on nonspecific DNA–protein
complexes and comparative analysis with respect to specific
DNA–protein complexes. The study found that the specific DNA-binding
sites on a protein are typically favorable for nonspecific DNA and that
nonspecific and specific DNA–protein interaction modes are quite
similar. This similarity may reflect an important sampling step in the search
for the specific DNA target sequence by a DNA-binding protein. On the basis of
these observations, a novel method was proposed for predicting DNA-binding sites
and binding modes of a DNA-binding protein without knowing its specific DNA
target sequence. Ultimately, the combination of this method and protein
structure prediction may lead the way to high throughput modeling of
DNA–protein interactions. Introduction DNA-binding proteins play an essential role in many fundamental biological
activities, including DNA transcription, replication, packaging, repair and
rearrangement. Interactions relevant to these activities typically involve specific
binding sites on both proteins and DNA. Over the past several decades, many efforts
have been made in order to understand basic principles that determine the specific
DNA-protein interactions. It is well-known that there does not exist a simple
recognition code between protein amino acids and DNA base pairs [1]–[4]. This poses a great
challenge for the prediction of DNA-protein interactions. The daunting task of elucidating DNA-protein interactions can be addressed with the
assistance of computational modeling. Methods for docking the complex from separated
protein/DNA structures have been developed [5]–[7]. As an
early example, the Monte Carlo program MONTY has been applied to sample
configurations of a single DNA-protein complex in the vicinity of its native state
[6].
The development of an efficient geometric recognition algorithm [8], which allows a global search for optimal surface
complementarity though rigid body rotation and translation, greatly advanced the
molecular docking field. An implementation of the algorithm, FTDOCK, was applied to
DNA-protein docking [5], with encouraging benchmark results reported on
modeling eight DNA/repressor complexes starting from unbound protein structures and
canonical B-DNA. A more recent approach, HADDOCK, starts with a similar rigid body
docking procedure, followed by semi-flexible refinement [7]. Excellent docking
models were obtained for three examples by HADDOCK. The docking methods assume the availability of both protein and DNA structures. Given
only the structure of a DNA-binding protein, it is of interest to determine the
DNA-binding protein residues without the knowledge of the associated specific DNA
sequence and structure with which the protein interacts. In the last few years,
several methods have been developed to address this problem [9]–[15]. Most
focus on analyzing characteristic patterns of DNA-binding residues from the solved
structures of complexes. Standard machine-learning techniques, such as Support
Vector Machine [10],[13] and neural networks
[9],[14], have been adopted to differentiate DNA-binding
residues from non-DNA-binding residues, using features like sequence composition,
evolutionary profile, solvent accessibility, and electrostatic potential. Recently,
a knowledge-based method DBD-Hunter that combines structural comparison and
evaluation of a statistical pair potential was proposed for predicting DNA-binding
proteins and associated binding residues [11]. The method yields an
accuracy of 87% on DNA-binding site prediction in comprehensive
benchmarks. However, the method is limited by the availability of appropriate
DNA-protein complex structures to be used as templates. In this study, we present a novel approach for predicting the protein residues that
bind DNA and DNA-protein interaction modes, given the structure of a DNA-binding
protein as the input. We systematically docked 44 specific DNA-binding proteins in
both holo (DNA-bound) and apo (DNA-free) forms to a nonspecific canonical B-DNA
molecule. Using energy evaluation and model clustering, we obtained representative
complex models that provide structural insights into how DNA-binding proteins
interact with a nonspecific DNA sequence. For about 80% of the proteins,
the sites for specific DNA recognition are among the favorable interaction sites for
nonspecific DNA binding. Furthermore, the interaction modes observed in the top
ranked, nonspecific DNA-protein encounter complexes bear a certain similarity to the
specific DNA-protein binding mode in the experimental structure. The biological
implications of this similarity are discussed. Moreover, we demonstrate that our
approach achieves better performance than three established methods based on
machine-learning techniques. In addition to experimental structures, we show that
our method can be applied to predicted protein models, generated by the
state-of-the-art modeling program TASSER [16]. Satisfactory results
were obtained for protein models with a root-mean square deviation, RMSD, ≤5
Å of their Cα atoms from their native holo-structures. We also
show that our method can be further improved by considering conformational changes
of DNA. Results DNA-Binding Site The apo- and holo-structures of 44 non-redundant specific DNA-binding proteins
(Table
S1) are docked separately to a nonspecific B-DNA composed of 16
dA·dT base pairs, following the modeling procedure illustrated in
Figure 1A = 1.0) and a random model
(MCC = 0.0). As a representative example, Figure 1B and 1C
Analysis of docking solutions suggests that specific DNA-binding sites on
proteins are typically among the energetically favorable sites for sampling the
nonspecific DNA. As shown in Figure
2
One can utilize this observation to predict specific DNA-binding sites on protein
through analyzing nonspecific DNA-protein docking solutions. Figure 3A and 3B
To further improve model selection, we introduced a clustering procedure and
compared various model selection schemes shown in Figure 3C and 3D
Our method can readily take advantage of known information about DNA-binding
sites, such as data collected from mutagenesis studies, NMR experiments, or
sequence conservation analysis. The information can be used to derive contact
restraints for model filtration [5],[7]. To illustrate this
point, we randomly picked native DNA-binding protein residues and filtered all
models in which these residues do not contact DNA. When applying more than one
such restraint, we obtained significantly better top one models (Figure 4
DNA–Protein Interaction Mode Next, we compare interaction modes between representative nonspecific DNA-protein
encounter complexes and the native (experimental) specific DNA-protein
complexes. For this comparison, we need a mapping between the nonspecific DNA
and the specific DNA complexed with the protein in the native structure. The
mapping was obtained by gaplessly threading the nonspecific DNA along the native
DNA with a scoring function that maximizes the overlap of the DNA-protein
residue contacts. Then, the native DNA-protein contacts observed in the model
were counted, and the RMSD of native interfacial residues relative to their
positions in the model was calculated by optimally superposing these interfacial
residues. For each protein, the best result of top five clustering models is
shown in Figure 5A
From the prediction prospective, we may define a DNA-protein complex model as
acceptable if the model satisfies one of the following two conditions: (i)
Fnat≥30%, or (ii) Fnat≥10% and
RMSDint≤4 Å, the criteria adopted from the Critical
Assessment of PRedicted Interactions (CAPRI) [19]. Using these
criteria, the predicted DNA-binding modes for 71%/86% of
APO/HOLO proteins can be classified as acceptable, resulting in a mean
RMSDint of 3.9/3.1 Å and a mean Fnat of
37%/44%. Three examples of predicted nonspecific DNA-protein complex models based on
apo-structures are compared with the corresponding native specific DNA-protein
complex structures in Figure
5B–D The second example from Saccharomyces cerevisiae Ndt80 is a
DNA-binding domain belonging to the immunoglobulin-fold family of transcription
factors [22],[23] (Figure 5C The third example is a type II restriction endonuclease, EcoRV (Figure 5D Application to Predicted Protein Models Our approach was further validated on predicted protein models. First, the
sequences of these 44 DNA-binding proteins were input into the threading
algorithm PROSPECTOR_3.0 [26]. Depending on the confidence levels of the
structural templates identified, proteins were classified into two groups: 30
Easy targets, which typically have good quality templates, and 14 Hard targets,
which usually do not have a reliable template hit. Note that we excluded from
the template library any structure that shares>30% global
sequence identity with a given target. The best template, ranked by the TM-score
structural similarity metric [27], has a mean RMSD of 7.9 Å with
respect to the native holo-structure over about 92% alignment
coverage, and the mean sequence identity of these templates is 19%.
After TASSER runs for model assembly and refinement [16], the mean RMSDs of
the top TASSER model and of the best of top five models were improved to 6.9
Å and 6.4 Å over the regions aligned with the templates.
Overall, the mean TM-scores of the top and the best of five top models compared
against the native holo structure are 0.61 and 0.63; the latter is
~9% higher than the average TM-score of the best threading
templates. Systematic model improvement over the best templates is evident, as
an improved structural model was obtained in 37 of 44 cases. For reach protein, the top TASSER model was employed for docking and subsequent
analysis. The number of proteins whose top TASSER model has a RMSD≤5.0
Å from the native holo-structures is 24 (55%); all but one
are from the easy set (Figure
6A
One example, the DNA-binding domain from an E. coli group IV
σ factor, is illustrated in Figure 6C Comparison with Other DNA–Protein Pair Potentials In addition to the DNA-protein energy function described above, we also tested
the performance of three other statistical pair potentials proposed previously,
including two quasichemical potentials, one at the residue, QCRes
[5] and
two others at the all-atom level, QCAA
[29]
and RAPDF [30] (see Methods). While the residue-level quasichemical potential uses a single
distance cutoff of 4.5 Å, the all-atom potentials are distance
dependent up to 10 Å. Since in previous studies, the potentials were
derived from relatively small data sets, we re-parameterized these three
potentials with the same set of 179 crystal complex structures used for our
functional-group level quasichemical potential derivation [11]. Then, for each target
from the APO/HOLO sets, the top 2500 docking solutions described above were
re-ranked according to the energies calculated with the new potentials. Table 2 shows the results of
binding site and mode predictions for the best of the top five models. On
average, our energy function outperforms these three potentials. The mean MCC
for the binding site prediction is 0.59/0.51 for the APO/HOLO sets using our
energy function without clustering, compared with 0.55/0.47, from both the
residue and all-atom quasichemical potentials, and 0.40/0.24 from the
conditional probability scoring function RAPDF. Correspondingly, our energy
function selected acceptable binding complex models in
77%/59% of the cases, whereas the residue-based and the
two all-atom potentials selected acceptable models in
71%/50%, 71%/55%, and
40%/32% of the cases, respectively. These results suggest
that detailed all-atom representations do not necessarily have an advantage over
simplified residue or functional-group level potentials when applied to rank
docking solutions from a non-specific DNA sequence. We also note that the
clustering models, which have a mean MCC of 0.62/0.54, are significantly better
than models selected by the three potentials (Wilcoxon signed-rank tests
P<0.04).
Comparison with Other DNA-Binding Site Prediction Methods Our approach was compared with three established methods [9],[13],[14] that
predict DNA-binding sites based on protein structures. Note that none of these
three methods is capable of predicting the DNA-protein interaction mode. For the
purpose of comparison, all calculations were carried out on the same set, AS62
[9], composed of DNA-binding protein structures in
their holo-forms. As shown in Table 3, the top model from our approach already yields better
results than previous methods on average. The mean MCC of our top model is 0.53,
compared to 0.49 obtained independently by the Kuznetsov group [13]
and by Tjong and Zhou's method named DISPLAR [14]. Moreover, the best
of our top five models significantly improves the DNA-binding site prediction
with a mean MCC of 0.62 and a mean accuracy of 87%, leading the
results from the Kuzentsov method or DISPLAR by about one standard deviation
unit. The latter two methods perform better than that proposed by Ahmad
et al.
[9].
This reason can be partially attributed to the fact that the Ahmad et
al. did not use position-specific sequence profiles in their
method.
We further compared the performance of our method on apo structures with DISPLAR.
The predictions of DNA-binding sites of 44 proteins structures from the APO set
were performed using the DISPLAR webserver. The averages of MCC/accuracy by
DISPLAR are 0.39/82.5%, which are slightly lower than
0.40/82.7% from the results by the first ranked model of our method.
The difference is statistically insignificant. However, the performance of the
best of the top five models by our method, 0.54/86.7%, is
significantly better than that of DISPLAR (Wilcoxon signed-rank test
P<0.001). In practice, the multiple (but limited number
of) models generated by our method can be filtered through incorporation of
existing experimental studies on binding-sites, thereby further improving the
prediction. Effects of Conformational Changes The difference between the predicted docking model and the native complex
structure may be explained by two main reasons: First, nonspecific instead of
specific DNA was used for docking. Second, rigid-body docking does not consider
the conformational changes of either the DNA or the protein. The effects of
conformational changes in protein are clear as holo-structures consistently
produce models closer to the native state than those using apo-structures. In
principle, by also taking DNA conformational changes into account, one should be
able to obtain improved models. The flexibility problem can be partially addressed through docking the protein to
a library of DNA in various conformations [7]. To explore this
idea, we constructed a DNA library composed of three poly dA·dT B-DNA
structures, whose backbone RMSDs range from 1 to 3 Å with respect to
the canonical B-DNA used above, and the canonical B-DNA itself (see Table S2).
For convenience, we name the canonical B-DNA as D0, and the DNA library as Dlib.
Using Dlib, we obtained complex models generated by docking the protein to each
DNA in the library. For each of the four protein-DNA combinations, the same
docking procedure described above was followed, and the top five clustering
models were selected and pooled together. From this pool of twenty clustering
models we selected top five models according to their interfacial energy. As
shown in Figure 7
One can further estimate the upper limit of such improvement by docking holo
protein structures to nonspecific DNA that adopts the native specific-DNA
conformation, though in general one cannot assume that the nonspecific DNA
associates with the protein in exactly the same conformation as the specific
DNA. In this estimation, we took the native DNA structures from the 44 complex
structures and mutated all base pairs into dA·dT with the program
3DNA [31].
We name this set of DNA structures Dnat. Each protein structure from the HOLO
set was then docked to the corresponding DNA structure in Dnat. The resulting
average MCC for binding site prediction from the best of top five clustering
models is 0.71 (Figure 7 Discussion How a DNA-binding protein locates its specific DNA target sequence is a fundamental,
unsolved problem in biology. It has been proposed that association with nonspecific
DNA sequences and subsequent travel along the sequence facilitates the search for
the specific DNA target sequence [17],[18]. In this regard, it
has been shown that specific DNA-binding proteins, such as transcription factors and
restriction endonucleases, can locate target sites at rates several orders of
magnitude faster than that estimated by random three-dimensional diffusion, through
mechanisms known collectively as facilitated diffusion [17],[18]. A crucial step of
the facilitated diffusion processes involves the association of the protein with a
nonspecific DNA sequence; this is followed by one-dimensional sliding along the DNA
or hopping over short distances to accelerate the search for a specific DNA target
sequence. Despite recent advances that provide visualizations of protein sliding
along DNA [33], the structural details of how a DNA-binding protein
associates with a nonspecific DNA remain elusive, primarily due to weak interactions
between nonspecific DNA and the protein. Indeed, due to the fact that the
interactions are nonspecific, there exist only a few solved atomic structures for
nonspecific DNA-protein complexes [24],[34]. Our study provides
useful structural insights into how a specific DNA-binding protein interacts with a
nonspecific DNA sequence during the facilitated diffusion process. The similarity
between the specific DNA-protein interaction mode and nonspecific interaction modes
may reflect an important sampling step in search of its specific DNA targets by a
DNA-binding protein. By systematically studying encounter complexes of 44 specific DNA-binding proteins
with a nonspecific DNA molecule, we found that the vast majority of these
DNA-binding proteins favorably interact with nonspecific DNA at the same binding
sites for their specific DNA targets. Using APO/HOLO-structures for docking and a
pair potential for energy ranking, we obtained at least one near-native model among
the top ten models for 77%/84% of APO/HOLO proteins. In these
models, protein residues that contact the nonspecific DNA coincide with those that
contact the specific DNA with a MCC>0.5. By introducing a clustering
procedure, the most native-like model among the top five cluster representatives has
an average MCC of 0.54/0.62 when APO/HOLO structures are used. Moreover, the
DNA-protein interaction modes observed in these models resemble the corresponding
native binding modes with specific DNA. The average interfacial RMSD is 4.6/3.4
Å, and the fraction of native contacts observed is
33%/41% for APO/HOLO proteins, respectively. Our results therefore suggest that a DNA-binding protein frequently samples
nonspecific DNA using the same binding sites as used for specific DNA recognition.
The results are consistent with a recent Langevin dynamics study on the diffusion of
three DNA-binding proteins along nonspecific DNA [35], and are also
consistent with the few available atomic structures of DNA-binding proteins in
complex with both specific and nonspecific DNA [24],[34]. One interesting
example is the endonuclease EcoRV, which locates a specific cleavage site through a
combination of 1D sliding along nonspecific sequence and 3D jumping [36],[37]. The
nonspecific DNA recognition observed in our top model and in a crystal structure of
the nonspecific DNA-EcoRV complex involves the same set of protein residues which
also participate in specific DNA recognition [24]. However, the majority
of native contacts formed in the cognate DNA-protein complex structure are lost in
our model, largely due to the absence of the dramatic bending exhibited by the
cognate DNA. The overlap of nonspecific and specific DNA interaction sites on the protein surface
allows us to predict DNA-binding residues. The best of top five models generated
with holo-structures have an average MCC of 0.62, which is 15% higher
than the average MCC of 0.54 obtained with apo-structures. Despite the notable
difference, the performance of our method is satisfactory for apo-structures. This
validation on apo-structures has important practical applications. Going beyond the
DNA-binding site prediction, our method also provides models for the DNA-protein
interaction modes. For 86%/71% of HOLO/APO structures, at
least one of the top five models exhibits an interaction mode somewhat similar to
the native binding mode, with a mean RMSDint of 3.1/3.9 Å and a
Fnat of 44%/37%. These complex models are acceptable using
CAPRI criteria [19]. The performance of our method in DNA-binding site prediction has been compared with
three machine-learning based methods. We note that the top model by our method
already performs better than the other methods in terms of MCC and overall accuracy.
While machine learning based methods typically provide only one model for
assessment, our method generates a limited number of representative models for
selection. This can be a great advantage for practical application, since
incorporation of existing experimental studies on binding-sites may greatly improve
model selection. On average, the best of our top five models by our method achieves
a MCC of 0.62 and accuracy of 87%, which is significantly better than the
MCC of 0.49 and accuracy of 81% of DISPLAR [14], the best among other
methods. In addition, our method has the advantage of predicting the binding mode,
an ability that the machine-learning methods lack. A downside of our method,
however, is that it is computationally more demanding than machine-learning methods,
typically requiring hours versus minutes of computation time for
one target. Nevertheless, given the widespread availability of computational
resources, this is not a significant limitation. Despite these successes, the method is not designed for predicting the specific DNA
sequence recognized by a DNA-binding protein; this is a related, yet very
challenging problem. Knowledge-based distance-dependent contact potentials at the
residue [4]
or the all-atom level [29],[30],[38], and
physics-based all-atom potentials [39],[40], have been applied to predict DNA specificity.
While these studies have reported success on a few cases, they are limited to known
atomic complex structures or models from closely related complex structures with
almost identical DNA-binding interface. Nevertheless, they suggest that a successful
approach must address structural flexibility and cooperativity among partners that
form a DNA-protein complex. Another interesting question is whether one can use the current approach to determine
DNA-binding function given a protein structure. To explore this issue, we applied
the method to ~3,000 non-DNA-binding proteins collected previously [11].
Unfortunately, we were not able to derive a practical interfacial energy threshold
to differentiate DNA-binding proteins from non-DNA-binding proteins, despite the
notable difference of average interfacial energy. For DNA-binding function
prediction, the knowledge based approach DBD-Hunter [11], which requires that the
structure of a target protein be related to that of a known DNA binding protein,
seems more appropriate. Future efforts may involve expanding the template library
for DBD-Hunter by adding complex structure models obtained from the current
approach. In the post-genomic era, the rapid progress of structural genomics projects has
greatly advanced our knowledge about structural biology. Each year thousands of new
protein structures have been determined and deposited to the PDB. In principle, the
accumulation of protein structures enables a practical solution to the folding
problem through template based modeling [16]. Using the
well-established modeling method, TASSER, we have obtained a top ranked protein
model within 5 Å from their native structures for over half of the 44
DNA-binding proteins. These models were constructed and refined from
homologous/analogues templates with less than 30% sequence identity. We
have demonstrated that one can satisfactorily predict DNA-binding sites using these
good models. The average MCC and accuracy are 0.51 and 84% for the best
of top five complex models. This is roughly comparable to the performance when
experimentally solved apo-structures are used. Ultimately, the combination of
modeling and DNA-protein docking may lead the way to the high throughput prediction
of DNA-protein interactions. Methods Data Sets APO/HOLO sets A total of 44 pairs of DNA-binding protein structures determined both in the
DNA-bound (HOLO) and unbound (APO) forms were selected from a previous study
[11] using the following criteria: (i) the holo-
and apo-structures share>90% global sequence identity;
(ii) the protein is bound to a specific DNA molecule in the holo-form; (iii)
the protein chain length is less than 400 residues; and (iv) the DNA bound
to protein has more than 7 and less than 40 base pairs. These proteins
include 29 transcription factors, 12 enzymes, and 3 other types of
DNA-binding proteins (Table S1). All
share<35% global sequence identity among each other. Protein–DNA Complex Modeling A flowchart of the modeling protocol is provided in Figure 1
is a statistical pair potential at the functional group level
[11],
and is a surface burial term given by −0.02
kT/Å2 × Buried Surface Area (BSA). BSA was
calculated with the program NACCESS [41]. The statistical
pair potential was developed from an analysis in 179 DNA-protein complex
structures [11]. For each target, we derive a corresponding
potential by excluding any homologous protein with >35%
sequence identity from the 179 complex set and repeat the analysis. The top 2500
energy-ranked models were retained for clustering, which uses the coordinates of
the COM of DNA-binding protein residues. The clustering procedure starts by
selecting the top energy-ranked model as a clustering seed. All models within a
COM distance of 6 Å from the seed are assigned to this cluster, and
removed from subsequent clustering. We then repeat this procedure until no model
is left. Finally, the clusters were ranked using the average energy of all
members in each cluster. From each cluster, we select the lowest energy model as
the representative model.Model Assessment A protein residue is assigned to be DNA-binding (or DNA-interacting) if at least
one heavy atom from the protein residue is within 4.5 Å of at least
one heavy atom from the DNA. Using this definition, about 18% of
protein residues can classified as true DNA-binding in the analysis of the HOLO
set. Given the imbalanced nature of the DNA-binding residues and non-DNA-binding
residues, the Matthews correlation coefficient is a suitable metric for
assessing overlap or prediction of DNA-binding residues between an encounter
complex and the native complex. The MCC is defined by [42]
In the DNA-binding mode analysis, we mapped the nonspecific DNA to the specific
DNA by maximizing DNA-protein contact overlap. A DNA-protein contact is defined
at the residue level. The RMSD between two structures was calculated using the
coordinates of backbone Cα and/or DNA C1′ atoms. The
interfacial RMSD was calculated for interfacial protein/DNA residues observed in
the native specific-DNA-protein complex structure. Protein Structure Modeling The structures of the 44 proteins from the APO/HOLO sets were predicted following
the TASSER methodology [16]. Briefly, a target sequence was threaded
against a non-redundant protein structure library by the program PROSPECTOR_3
[26], and the resulting structure templates are used
for subsequent model assembly and refinement by the program TASSER, which uses a
Monte Carlo replica exchange algorithm for sampling. Note that we excluded any
template that shares>30% global sequence identity with the
target. The replica trajectories were clustered and representative models
generated from these clusters. We built all-atom protein models from the
reduced-atom TASSER models with the program PULCHRA [43]. In this study,
the top ranked TASSER model is employed for DNA-docking. Statistical Pair Potentials Four knowledge-based statistical DNA-protein pair potentials were developed from
an analysis of 179 non-redundant DNA-protein complex crystal structures [11]. These
include three quasichemical potentials at the residue [5], functional-group
[11],
and all-atom [29] levels, and another all-atom potential
(termed RAPDF, residue-specific all-atom conditional probability discriminatory
function) using a different reference state [30]. RAPDF was
originally derived using the Bayesian probability formalism [30],[44]; it can be
expressed equivalently under the Boltzmann distribution formalism. Here, we
introduce all these potentials using the Boltzmann formalism, which assumes that
the frequencies of observed pair interaction states follow a Boltzmann
distribution [45]. Consequently, the pair interaction energy
E can be deduced from the inverse of Boltzmann's law
and are the observed and expected frequencies of the
αβ pair at the distance d, respectively. For
residue and functional-group level potentials, the distance d
is defined as the minimum distance between a pair of heavy atoms from the
corresponding the αβ pair; and a single distance cutoff of 4.5
Å was used. Multiple distance bins from 3 Å to 10
Å with a bin width of 1 Å were employed for the two all-atom
potentials. The observed frequency can be obtained by
denotes the number of observed αβ contact
pairs at the distance d. For quasichemical potentials, the
expected frequency is given by
and are the mole fractions of type α and β. The
mole fraction for each type is the overall mole fraction in the entire template
library, following a scheme known as the composition-independent scale [46].
For RAPDF, the expected frequency is estimated by
For a DNA-protein complex structure, the corresponding DNA-protein interfacial
energy is the summation of all observed pair interactions in the structure. The RAPDF parameterization was performed using the program implemented previously
[30]. In a benchmark test on the DNA-protein docking
decoy set compiled by Robertson and Varani [30], our new set of
RAPDF parameters yield an average Z-score of −11.0 for the native
complex structures, slightly better than the previous average Z-score of
−9.6 obtained by parameters determined on a smaller set composed of 52
DNA-protein complex structures. Availability A web-server implementation of the method described here is available at
http://cssb.biology.gatech.edu/skolnick/webservice/DP-dock/. Table S1 List of the DNA-binding proteins in the APO/HOLO sets (0.10 MB DOC) Click here for additional data file.(95K, doc) Table S2 List of four B-DNA structures used in the DNA library (0.06 MB DOC) Click here for additional data file.(60K, doc) Acknowledgments We thank Dr. Gabriele Varani for providing us his scoring program. Footnotes The authors have declared that no competing interests exist. This work was supported by the National Institutes of Health (Grant No.
GM-37408). The funders had no role in study design, data collection and
analysis, decision to publish, or preparation of the manuscript. References 1. Luscombe NM, Austin SE, Berman HM, Thornton JM. An overview of the structures of protein-DNA complexes. Genome Biol. 2000;1:REVIEWS001. [PubMed] 2. Matthews BW. Protein-DNA interaction - no code for recognition. Nature. 1988;335:294–295. [PubMed] 3. Pabo CO, Nekludova L. Geometric analysis and comparison of protein-DNA interfaces: why
is there no simple code for recognition? J Mol Biol. 2000;301:597–624. [PubMed] 4. Sarai A, Kono H. PROTEIN-DNA recognition patterns and predictions. Annu Rev Biophys Biomol Struct. 2005;34:379–398. [PubMed] 5. Aloy P, Moont G, Gabb HA, Querol E, Aviles FX, et al. Modelling repressor proteins docking to DNA. Proteins. 1998;33:535–549. [PubMed] 6. Knegtel RMA, Antoon J, Rullmann C, Boelens R, Kaptein R. MONTY - A Monte-Carlo approach to protein-DNA recognition. J Mol Biol. 1994;235:318–324. [PubMed] 7. van Dijk M, van Dijk ADJ, Hsu V, Boelens R, Bonvin A. Information-driven protein-DNA docking using HADDOCK: it is a
matter of flexibility. Nucleic Acids Res. 2006;34:3317–3325. [PubMed] 8. Katchalski-Katzir E, Shariv I, Eisenstein M, Friesem AA, Aflalo C, et al. Molecular-surface recognition - Determination of geometric fit
between proteins and their ligands by correlation techniques. Proc Natl Acad Sci U S A. 1992;89:2195–2199. [PubMed] 9. Ahmad S, Gromiha MM, Sarai A. Analysis and prediction of DNA-binding proteins and their binding
residues based on composition, sequence and structural information. Bioinformatics. 2004;20:477–486. [PubMed] 10. Bhardwaj N, Lu H. Residue-level prediction of DNA-binding sites and its application
on DNA-binding protein predictions. FEBS Letters. 2007;581:1058–1066. [PubMed] 11. Gao M, Skolnick J. DBD-Hunter: a knowledge-based method for the prediction of
DNA-protein interactions. Nucleic Acids Res. 2008;36:3978–3992. [PubMed] 12. Jones S, Shanahan HP, Berman HM, Thornton JM. Using electrostatic potentials to predict DNA-binding sites on
DNA-binding proteins. Nucleic Acids Res. 2003;31:7189–7198. [PubMed] 13. Kuznetsov IB, Gou ZK, Li R, Hwang SW. Using evolutionary and structural information to predict
DNA-binding sites on DNA-binding proteins. Proteins. 2006;64:19–27. [PubMed] 14. Tjong H, Zhou HX. DISPLAR: an accurate method for predicting DNA-binding sites on
protein surfaces. Nucleic Acids Res. 2007;35:1465–1477. [PubMed] 15. Yan CH, Terribilini M, Wu FH, Jernigan RL, Dobbs D, et al. Predicting DNA-binding sites of proteins from amino acid
sequence. BMC Bioinformatics. 2006;7:262. [PubMed] 16. Zhang Y, Skolnick J. Automated structure prediction of weakly homologous proteins on a
genomic scale. Proc Natl Acad Sci U S A. 2004;101:7594–7599. [PubMed] 17. Halford SE, Marko JF. How do site-specific DNA-binding proteins find their targets? Nucleic Acids Res. 2004;32:3040–3052. [PubMed] 18. von Hippel PH, Berg OG. Facilitated target location in biological-systems. J Biol Chem. 1989;264:675–678. [PubMed] 19. Mendez R, Leplae R, Lensink MF, Wodak SJ. Assessment of CAPRI predictions in rounds 3–5 shows
progress in docking procedures. Proteins. 2005;60:150–169. [PubMed] 20. Billeter M, Qian Y, Otting G, Muller M, Gehring WJ, et al. Determination of the 3-dimensional structure of the Antennapedia
homeodomain from Drosophila in solution by H-1 nuclear-magnetic-resonance
spectroscopy. J Mol Biol. 1990;214:183–197. [PubMed] 21. Fraenkel E, Pabo CO. Comparison of X-ray and NMR structures for the Antennapedia
homeodomain-DNA complex. Nat Struct Biol. 1998;5:692–697. [PubMed] 22. Lamoureux JS, Glover JNM. Principles of protein-DNA recognition revealed in the structural
analysis of Ndt80-MSE DNA complexes. Structure. 2006;14:555–565. [PubMed] 23. Lamoureux JS, Stuart D, Tsang R, Wu C, Glover JNM. Structure of the sporulation-specific transcription factor Ndt80
bound to DNA. EMBO J. 2002;21:5721–5732. [PubMed] 24. Winkler FK, Banner DW, Oefner C, Tsernoglou D, Brown RS, et al. The crystal-structure of EcoRV endonuclease and of its complexes
with cognate and non-cognate dna fragments. EMBO J. 1993;12:1781–1795. [PubMed] 25. Horton NC, Perona JJ. Crystallographic snapshots along a protein-induced DNA-bending
pathway. Proc Natl Acad Sci U S A. 2000;97:5729–5734. [PubMed] 26. Skolnick J, Kihara D, Zhang Y. Development and large scale benchmark testing of the PROSPECTOR_3
threading algorithm. Proteins. 2004;56:502–518. [PubMed] 27. Zhang Y, Skolnick J. TM-align: a protein structure alignment algorithm based on the
TM-score. Nucleic Acids Res. 2005;33:2302–2309. [PubMed] 28. Lane WJ, Darst SA. The structural basis for promoter-35 element recognition by the
group IV sigma factors. PLoS Biol. 2006;4:e269. doi:10.1371/journal.pbio.0040269. [PubMed] 29. Donald JE, Chen WW, Shakhnovich EI. Energetics of protein-DNA interactions. Nucleic Acids Res. 2007;35:1039–1047. [PubMed] 30. Robertson TA, Varani G. An all-atom, distance-dependent scoring function for the
prediction of protein-DNA interactions from structure. Proteins. 2007;66:359–374. [PubMed] 31. Lu XJ, Olson WK. 3DNA: a software package for the analysis, rebuilding and
visualization of three-dimensional nucleic acid structures. Nucleic Acids Res. 2003;31:5108–5121. [PubMed] 32. Havranek JJ, Duarte CM, Baker D. A simple physical model for the prediction and design of
protein-DNA interactions. J Mol Biol. 2004;344:59–70. [PubMed] 33. Gorman J, Greene EC. Visualizing one-dimensional diffusion of proteins along DNA. Nat Struct Mol Biol. 2008;15:768–774. [PubMed] 34. Kalodimos CG, Biris N, Bonvin A, Levandoski MM, Guennuegues M, et al. Structure and flexibility adaptation in nonspecific and specific
protein-DNA complexes. Science. 2004;305:386–389. [PubMed] 35. Givaty O, Levy Y. Protein sliding along DNA: dynamics and structural
characterization. J Mol Biol. 2009;385:1087–1097. [PubMed] 36. Bonnet I, Biebricher A, Porte PL, Loverdo C, Benichou O, et al. Sliding and jumping of single EcoRV restriction enzymes on
non-cognate DNA. Nucleic Acids Res. 2008;36:4118–4127. [PubMed] 37. Stanford NP, Szczelkun MD, Marko JF, Halford SE. One- and three-dimensional pathways for proteins to reach
specific DNA sites. EMBO J. 2000;19:6546–6557. [PubMed] 38. Liu ZJ, Mao FL, Guo JT, Yan B, Wang P, et al. Quantitative evaluation of protein-DNA interactions using an
optimized knowledge-based potential. Nucleic Acids Res. 2005;33:546–558. [PubMed] 39. Morozov AV, Havranek JJ, Baker D, Siggia ED. Protein-DNA binding specificity predictions with structural
models. Nucleic Acids Res. 2005;33:5781–5798. [PubMed] 40. Siggers TW, Honig B. Structure-based prediction of C2H2 zinc-finger binding
specificity: sensitivity to docking geometry. Nucleic Acids Res. 2007;35:1085–1097. [PubMed] 41. Hubbard SJ, Thornton JM. ‘NACCESS’, Computer Program, Department of
Biochemistry and Molecular Biology. University College London; 1993. 42. Matthews BW. Comparison of predicted and observed secondary structure of T4
phage lysozyme. Biochim Biophys Acta. 1975;405:442–451. [PubMed] 43. Rotkiewicz P, Skolnick J. Fast procedure for reconstruction of full-atom protein models
from reduced representations. J Comput Chem. 2008;29:1460–1465. [PubMed] 44. Samudrala R, Moult J. An all-atom distance-dependent conditional probability
discriminatory function for protein structure prediction. J Mol Biol. 1998;275:895–916. [PubMed] 45. Sippl MJ. Knowledge-based potentials for proteins. Curr Opin Struct Biol. 1995;5:229–235. [PubMed] 46. Skolnick J, Kolinski A, Ortiz A. Derivation of protein-specific pair potentials based on weak
sequence fragment similarity. Proteins. 2000;38:3–16. [PubMed] 47. Humphrey W, Dalke A, Schulten K. VMD: visual molecular dynamics. J Mol Graph. 1996;14:33–38. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||
Genome Biol. 2000; 1(1):REVIEWS001.
[Genome Biol. 2000]Annu Rev Biophys Biomol Struct. 2005; 34():379-98.
[Annu Rev Biophys Biomol Struct. 2005]Proteins. 1998 Dec 1; 33(4):535-49.
[Proteins. 1998]Nucleic Acids Res. 2006; 34(11):3317-25.
[Nucleic Acids Res. 2006]J Mol Biol. 1994 Jan 7; 235(1):318-24.
[J Mol Biol. 1994]Proc Natl Acad Sci U S A. 1992 Mar 15; 89(6):2195-9.
[Proc Natl Acad Sci U S A. 1992]Bioinformatics. 2004 Mar 1; 20(4):477-86.
[Bioinformatics. 2004]BMC Bioinformatics. 2006 May 19; 7():262.
[BMC Bioinformatics. 2006]FEBS Lett. 2007 Mar 6; 581(5):1058-66.
[FEBS Lett. 2007]Proteins. 2006 Jul 1; 64(1):19-27.
[Proteins. 2006]Nucleic Acids Res. 2007; 35(5):1465-77.
[Nucleic Acids Res. 2007]Proc Natl Acad Sci U S A. 2004 May 18; 101(20):7594-9.
[Proc Natl Acad Sci U S A. 2004]Nucleic Acids Res. 2004; 32(10):3040-52.
[Nucleic Acids Res. 2004]J Biol Chem. 1989 Jan 15; 264(2):675-8.
[J Biol Chem. 1989]Proteins. 1998 Dec 1; 33(4):535-49.
[Proteins. 1998]Nucleic Acids Res. 2006; 34(11):3317-25.
[Nucleic Acids Res. 2006]J Mol Graph. 1996 Feb; 14(1):33-8, 27-8.
[J Mol Graph. 1996]Proteins. 2005 Aug 1; 60(2):150-69.
[Proteins. 2005]J Mol Biol. 1990 Jul 5; 214(1):183-97.
[J Mol Biol. 1990]Nat Struct Biol. 1998 Aug; 5(8):692-7.
[Nat Struct Biol. 1998]Structure. 2006 Mar; 14(3):555-65.
[Structure. 2006]EMBO J. 2002 Nov 1; 21(21):5721-32.
[EMBO J. 2002]EMBO J. 1993 May; 12(5):1781-95.
[EMBO J. 1993]Proc Natl Acad Sci U S A. 2000 May 23; 97(11):5729-34.
[Proc Natl Acad Sci U S A. 2000]Proteins. 2004 Aug 15; 56(3):502-18.
[Proteins. 2004]Nucleic Acids Res. 2005; 33(7):2302-9.
[Nucleic Acids Res. 2005]Proc Natl Acad Sci U S A. 2004 May 18; 101(20):7594-9.
[Proc Natl Acad Sci U S A. 2004]PLoS Biol. 2006 Sep; 4(9):e269.
[PLoS Biol. 2006]Proteins. 1998 Dec 1; 33(4):535-49.
[Proteins. 1998]Nucleic Acids Res. 2007; 35(4):1039-47.
[Nucleic Acids Res. 2007]Proteins. 2007 Feb 1; 66(2):359-74.
[Proteins. 2007]Nucleic Acids Res. 2008 Jul; 36(12):3978-92.
[Nucleic Acids Res. 2008]Proteins. 2007 Feb 1; 66(2):359-74.
[Proteins. 2007]Bioinformatics. 2004 Mar 1; 20(4):477-86.
[Bioinformatics. 2004]Proteins. 2006 Jul 1; 64(1):19-27.
[Proteins. 2006]Nucleic Acids Res. 2007; 35(5):1465-77.
[Nucleic Acids Res. 2007]Proteins. 2006 Jul 1; 64(1):19-27.
[Proteins. 2006]Nucleic Acids Res. 2007; 35(5):1465-77.
[Nucleic Acids Res. 2007]Bioinformatics. 2004 Mar 1; 20(4):477-86.
[Bioinformatics. 2004]Nucleic Acids Res. 2006; 34(11):3317-25.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2003 Sep 1; 31(17):5108-21.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2006; 34(11):3317-25.
[Nucleic Acids Res. 2006]J Mol Biol. 2004 Nov 12; 344(1):59-70.
[J Mol Biol. 2004]Nucleic Acids Res. 2004; 32(10):3040-52.
[Nucleic Acids Res. 2004]J Biol Chem. 1989 Jan 15; 264(2):675-8.
[J Biol Chem. 1989]Nat Struct Mol Biol. 2008 Aug; 15(8):768-74.
[Nat Struct Mol Biol. 2008]EMBO J. 1993 May; 12(5):1781-95.
[EMBO J. 1993]Science. 2004 Jul 16; 305(5682):386-9.
[Science. 2004]J Mol Biol. 2009 Jan 30; 385(4):1087-97.
[J Mol Biol. 2009]EMBO J. 1993 May; 12(5):1781-95.
[EMBO J. 1993]Science. 2004 Jul 16; 305(5682):386-9.
[Science. 2004]Nucleic Acids Res. 2008 Jul; 36(12):4118-27.
[Nucleic Acids Res. 2008]EMBO J. 2000 Dec 1; 19(23):6546-57.
[EMBO J. 2000]Proteins. 2005 Aug 1; 60(2):150-69.
[Proteins. 2005]Nucleic Acids Res. 2007; 35(5):1465-77.
[Nucleic Acids Res. 2007]Annu Rev Biophys Biomol Struct. 2005; 34():379-98.
[Annu Rev Biophys Biomol Struct. 2005]Nucleic Acids Res. 2007; 35(4):1039-47.
[Nucleic Acids Res. 2007]Proteins. 2007 Feb 1; 66(2):359-74.
[Proteins. 2007]Nucleic Acids Res. 2005; 33(2):546-58.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2005; 33(18):5781-98.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2008 Jul; 36(12):3978-92.
[Nucleic Acids Res. 2008]Proc Natl Acad Sci U S A. 2004 May 18; 101(20):7594-9.
[Proc Natl Acad Sci U S A. 2004]Nucleic Acids Res. 2008 Jul; 36(12):3978-92.
[Nucleic Acids Res. 2008]Bioinformatics. 2004 Mar 1; 20(4):477-86.
[Bioinformatics. 2004]Proteins. 2006 Jul 1; 64(1):19-27.
[Proteins. 2006]Proteins. 1998 Dec 1; 33(4):535-49.
[Proteins. 1998]Nucleic Acids Res. 2003 Sep 1; 31(17):5108-21.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2008 Jul; 36(12):3978-92.
[Nucleic Acids Res. 2008]Biochim Biophys Acta. 1975 Oct 20; 405(2):442-51.
[Biochim Biophys Acta. 1975]Proc Natl Acad Sci U S A. 2004 May 18; 101(20):7594-9.
[Proc Natl Acad Sci U S A. 2004]Proteins. 2004 Aug 15; 56(3):502-18.
[Proteins. 2004]J Comput Chem. 2008 Jul 15; 29(9):1460-5.
[J Comput Chem. 2008]Nucleic Acids Res. 2008 Jul; 36(12):3978-92.
[Nucleic Acids Res. 2008]Proteins. 1998 Dec 1; 33(4):535-49.
[Proteins. 1998]Nucleic Acids Res. 2007; 35(4):1039-47.
[Nucleic Acids Res. 2007]Proteins. 2007 Feb 1; 66(2):359-74.
[Proteins. 2007]J Mol Biol. 1998 Feb 6; 275(5):895-916.
[J Mol Biol. 1998]Proteins. 2007 Feb 1; 66(2):359-74.
[Proteins. 2007]