Logo of bioinfoLink to Publisher's site
Bioinformatics. 2010 Aug 1; 26(15): 1857–1863.
Published online 2010 Jun 4. doi:  10.1093/bioinformatics/btq295
PMCID: PMC2905551

Structure-based prediction of DNA-binding proteins by structural alignment and a volume-fraction corrected DFIRE-based energy function

Abstract

Motivation: Template-based prediction of DNA binding proteins requires not only structural similarity between target and template structures but also prediction of binding affinity between the target and DNA to ensure binding. Here, we propose to predict protein–DNA binding affinity by introducing a new volume-fraction correction to a statistical energy function based on a distance-scaled, finite, ideal-gas reference (DFIRE) state.

Results: We showed that this energy function together with the structural alignment program TM-align achieves the Matthews correlation coefficient (MCC) of 0.76 with an accuracy of 98%, a precision of 93% and a sensitivity of 64%, for predicting DNA binding proteins in a benchmark of 179 DNA binding proteins and 3797 non-binding proteins. The MCC value is substantially higher than the best MCC value of 0.69 given by previous methods. Application of this method to 2235 structural genomics targets uncovered 37 as DNA binding proteins, 27 (73%) of which are putatively DNA binding and only 1 protein whose annotated functions do not contain DNA binding, while the remaining proteins have unknown function. The method provides a highly accurate and sensitive technique for structure-based prediction of DNA binding proteins.

Availability: The method is implemented as a part of the Structure-based function-Prediction On-line Tools (SPOT) package available at http://sparks.informatics.iupui.edu/spot

Contact: ude.iupui@uohzqy

1 INTRODUCTION

DNA binding proteins are proteins that make specific binding to either single or double-stranded DNA. They play an essential role in transcription regulation, replication, packaging, repair and rearrangement. With completion of many genome projects and many more in progress, more and more proteins are discovered with unknown function (Jaroszewski et al., 2009). The structures for some of those function-unknown proteins are solved because of structural genomics projects (Burley, 2000). Functional annotations of these proteins are particularly challenging because the goal of structural genomics is to cover the sequence space of proteins so that homology modeling becomes a reliable tool for structure prediction of any proteins and, thus, many targets in structural genomics have low sequence identity to the proteins with known function. Therefore, it is necessary to develop computational tools that utilize not only sequence but also structural information for function prediction (Ahmad et al., 2004; Gao and Skolnick, 2008; Lee et al., 2007a; Punta and Ofran, 2008; Sadowski and Jones, 2009; Watson et al., 2005).

Many methods have been developed for structure-based prediction of DNA binding proteins. These include function prediction through homology and structural comparisons (Bhardwaj et al., 2005; Ferrer-Costa et al., 2005a, b; Lee et al., 2007b; Pazos and Sternberg, 2004; Shanahan et al., 2004). Others explore sequence and structural features of DNA binding and non-binding proteins with sophisticated machine learning methods such as neural network (Ahmad et al., 2004; Mahony et al., 2006; Stawiski et al., 2003; Tjong and Zhou, 2007), logistic regression (Lee et al., 2006) and support vector machines (Bhardwaj et al., 2005; Cai and Lin, 2003; Kumar et al., 2008; Langlois et al., 2007; Tjong and Zhou, 2007).

Recently, Gao and Skolnick (2008) proposed a new two-step approach, called DBD-Hunter (Gao and Skolnick, 2008), for structure-based prediction of DNA binding proteins. In DBD-Hunter, the structure of a target protein is first structurally aligned to known protein–DNA complexes and the aligned complex structures are used to build the complex structures between DNA and the target protein. The predicted complex structures are, then, employed for judging DNA binding or not by structural similarity scores (TM-Score) and predicted protein–DNA binding affinities. TM-align (Zhang et al., 2005) and a contact-based statistical energy function are employed in the first and second steps of DBD-Hunter, respectively. DBD-Hunter is found to substantially improve over the methods based on sequence comparison only (PSI-BLAST), structural alignment only (TM-align) and a logistic regression technique (Szilagyi and Skolnick, 2006).

In this study, we investigate if one can further improve the prediction of DNA binding proteins by employing a different statistical energy function for predicting binding affinity. Our knowledge-based energy function is distance-dependent and built on a distance-scaled, finite, ideal gas reference (DFIRE) state originally developed for proteins (Yang and Zhou, 2008a, b; Zhou and Zhou, 2002) and extended to protein–DNA interactions (Xu et al., 2009; Zhang et al., 2005). Here, we introduce a new volume-fraction correction for the DFIRE energy function in extracting protein–DNA statistical energy function from protein–DNA complex structures. This volume fraction correction term, unlike previously introduced one (Xu et al., 2009), is atom-type dependent to better account for the fact that protein and DNA atom types are unmixable and occupy in physically separated volumes. In addition to introduction of a new energy function, we further optimize protein–DNA binding affinity by performing DNA mutation. These two techniques lead to a highly accurate and sensitive tool for structure-based prediction of DNA binding proteins.

2 METHODS

2.1 Datasets

We employed the datasets compiled by Gao and Skolnick (2008). One positive and one negative datasets for training are 179 DNA binding proteins (DB179) and 3797 non-DNA binding proteins (NB3797), respectively. These structures were obtained based on 35% sequence identity cutoff, a resolution of 3 Å or better, a minimum length of 40 residues for proteins, 6 bp for DNA and 5 residues interacting with DNA (within 4.5 Å of the DNA molecule). As in Gao and Skolnick (2008), we use significantly larger number of non-DNA binding proteins in order to reduce false positive rate because DNA binding proteins are only a small fraction of all proteins. APO and HOLO testing datasets are made up of 104 DNA binding proteins whose structures are determined in the absence and presence of DNA, respectively. A maximum of 35% sequence identity was also employed in selecting these 104 proteins. For APO/HOLO datasets, 93 APO–DB179 pairs and 92 HOLO–DB179 pairs have sequence identity >35%. These pairs are excluded from target–template pairs during testing. An additional test set of 1697 proteins (the SG1697 set) was compiled from structural genome targets with a sequence identity cutoff at 90% by Gao and Skolnick (2008) from the January 2008 PDB release. We further updated the release on November 2009 and obtained 2235 chains (the SG2235 set). This was done by queried ‘structural genomic’ words in the PDB databank, resulting in 2447 PDB entries. These PDB entries were divided into protein chains and clustered by the CD-HIT (Li and Godzik, 2006). For the clusters that contain a protein chain in SG1679, we chose the protein chain as the representation. For other clusters, we randomly chose one protein chain. There are 538 additional proteins and a total of 2235 protein chains.

To provide an additional test set and examine the effect of a larger database of DNA binding proteins, we have also updated DNA binding proteins from DB179 to DB250. This updated dataset of DNA binding proteins is selected from PDB released on December 2009 based on the same criteria that produced DB179. After removing the chains with high sequence identity (>35%) with any chain contained in DB179 and with each other, we obtained 71 additional protein–DNA complexes. This leads to an additional test dataset DB71 and an expanded training set DB250 (DB179 + DB71).

2.2 Knowledge-based energy function

We employ a knowledge-based energy function to predict the binding affinity of a protein–DNA complex. We have developed a knowledge-based energy function for proteins based on the DFIRE that satisfies the following equation (Zhou and Zhou, 2002):

equation image
(1)

where R is the gas constant, T=300 K, α=1.61, Nobs(i, j, r) is the number of ij pairs within the spherical shell at distance r observed in a given structure database, rcut is the cutoff distance, Δrcut is the bin width at rcut. The value of α(1.61) was determined by the best fit of rα to the actual distance-dependent number of ideal-gas points in finite protein-size spheres.

Equation (1) for proteins was initially applied to protein–DNA interactions unmodified with 19 atom types for both proteins and DNA (DDNA; Zhang et al., 2005). In DDNA2 (Xu et al., 2009), a low count correction is made to Nobs(i, j, r)

equation image
(2)

In addition, we employed residue/base-specific atom types with a distance-dependent volume-fraction correction defined as An external file that holds a picture, illustration, etc.
Object name is btq295i1.jpg. This volume fraction correction was made to take into account the fact that DNA and protein atoms with residue/base-specific atom types do not mix with each other. However, we found that DDNA2 is unable to go beyond existing techniques for predicting DNA binding proteins. To further improve DDNA2, we introduce atom-type-dependent volume fractions: An external file that holds a picture, illustration, etc.
Object name is btq295i2.jpg. Our final equation for the statistical energy function is

equation image
(3)

where we have introduced a parameter β. Physically, β should be around 1/2 so that volume fraction is counted once. We will employ it as an adjustable parameter here for the same reason that makes α < 2: proteins are finite in size. As in DDNA2, we will use residue/base-specific atom types (167 atom types for proteins and 82 for DNA) and rcut = 15 Å, Δr = 0.5 Å. We also set the factor η arbitrarily to 0.01 to control the magnitude of the energy score. For convenience, we shall label the volume-fraction corrected DFIRE as DDNA3.

2.3 Training of the method for predicting DNA binding proteins

DB179 is used to generate the DDNA3 statistical energy function [Equation (3)]. To avoid overfitting, we employed the leave-one-out scheme to train DDNA3 statistical energy function. A target protein is chosen from DB179/NB3797. The TM-align program is employed to make a structural alignment between this target protein with a protein in DB179 (except itself if it is in DB179). If the alignment score (TM-score) is greater than a threshold, the proposed complex structure between the target protein and DNA is obtained by replacing the template protein from its protein–DNA complex structure. The binding affinity between DNA and the target protein is evaluated by the DDNA3 energy function [Equation (3)]. Instead of using template DNA sequences, we perform exhaustive mutations of DNA base pairs to search for the highest binding affinity. DNA bases are paired by X3DNA software package (Lu and Olson, 2003). Unlike mutations in proteins, DNA mutation is relatively easy because the dihedral angles of bases are unchanged. The conformations of mutated bases are built using default bond length, bond angle and dihedral angle parameters as defined in AMBER98 force field (Cheatham et al., 1999). A DNA base, if it does not have a corresponding pairing base, is not mutated. If the highest binding affinity is greater than an optimized threshold, the target protein is considered as a DNA binding protein. The method described above has two important differences from DBD-hunter: the use of our distance-dependent energy function and the search for the strongest binding DNA fragment.

2.4 Evaluation of the method for predicting DNA binding proteins

The measures of the method performance are: sensitivity [SN = TP/(TP + FN)]; specificity [SP = TN/(TN + FP)]; accuracy [AC = (TP + TN)/(TP + FN + TN + FP)]; and precision [PR = TP/(TP + FP)]. In addition, we employed a Matthews correlation coefficient (MCC)

equation image
(4)

Here, TP, TN, FP and FN refer to true positives, true negatives, false positives and false negatives, respectively.

3 RESULTS

3.1 Training based on DB179/NB3797 (DDNA3)

We have optimized volume-fraction exponent β, TM-score and binding affinity thresholds to achieve the highest MCC values. Optimization is performed by a grid-based search. The grids for β and TM-score are 0.02 and 0.01, respectively. For the binding affinity threshold, the lowest energy of each aligned complex under different TM-score thresholds is calculated and these energy values are considered sequentially as the energy threshold. We found that the highest MCC is 0.73 for β = 0.4, the structural similarity threshold of 0.60 and the energy threshold of −11.6. The corresponding accuracy, precision and sensitivity are 98%, 91% and 60%, respectively. The effect of a knowledge-based energy function can be revealed by replacing DDNA3 with DDNA2. The optimized MCC value (structural similarity threshold of 0.53 and energy threshold of −4.2) is 0.61. (Note, there is no β parameter in DDNA2.) The corresponding accuracy, precision and sensitivity are 97%, 85% and 55%, respectively. It is clear that the reference state of a statistical energy function has a significant impact on the performance in predicting DNA binding proteins. The largest improvement is 6% improvement in precision, the fraction of correct prediction in all prediction. The overall performance of DDNA3 significantly improves over that of DBD-Hunter, which has an MCC of 0.64, 98% accuracy, 84% precision and 55% sensitivity, respectively.

Figure 1 shows sensitivity as a function of false positive rate. Our results were obtained by fixing structural similarity threshold and varying the energy threshold. It is clear that DDNA3 yields a substantially higher sensitivity than either DDNA2 or DBD-Hunter for a given false positive rate.

Fig. 1.
Sensitivity versus false positive rate, given by DDNA3 (filled black circles) and DDNA2 (open red circles) reveals the importance of an appropriate reference state for method performance in predicting DNA binding proteins. The results of other methods ...

The predicted binding complexes can be employed to examine predicted DNA binding residues. An amino acid residue is considered as a DNA binding residue, if any heavy atom of that residue is <4.5 Å away from any heavy atom of a DNA base. Predicted binding residues from template-based modeling can be compared to actual binding residues. For the training set (179 DB and 3797 NB proteins), there are 108 predicted DB proteins with 11 false positives. For these 108 predicted complexes, specificity, accuracy, precision, sensitivity and MCC of predicting DNA binding residues are 94%, 89%, 74%, 68% and 0.64, respectively. For a comparison, DDNA2 has predicted 99 DB proteins and the corresponding performance in predicting DNA binding residues are 93%, 88%, 75%, 67% and 0.63, respectively. These performances are similar to a specificity of 93%, an accuracy of 90%, a precision of 71% and a sensitivity of 72% achieved by DBD-hunter. Similar performance in predicting DNA binding residues is due to the same structural alignment (TM-align) method used in the first step by three methods. The slight difference in binding residue prediction is caused by two reasons: the change in the number of predicted DNA binding proteins and possibly different templates recognized by different methods.

3.2 TM-score-dependent energy threshold (DDNA3O)

Obviously, one threshold for energy and one for structural similarity (TM-score) are too simple to capture the complex relation between structure and binding affinity. For example, one expects that the binding-energy requirement should be stronger for less similar structures but weaker for highly similar structures between template and query. This has led Gao and Skolnick (2008) to develop TM-score-dependent energy thresholds (nine energy thresholds for nine TM-score bins ranging from 0.40 to 1.0 to maximize MCC value in each bin), and they finally set a minimum TM-score cutoff at 0.55 for achieving the maximum MCC value. If we followed their method, we found the same minimum TM-score cutoff at 0.55 for the maximum MCC value of 0.76. We revise this method slightly here. Instead of optimizing the MCC for each TM-score bin, we search for the energy threshold for a given TM-score bin by optimizing the MCC value for the TM-score bin plus all other bins with higher TM-scores. The results are shown in Table 1. This revised method leads to the same maximum MCC value of 0.76 but with a minimum TM-score cutoff at 0.52, slightly increased sensitivity (two additional true positives) without increase of false positives. To distinguish this further optimized method, we labeled it as DDNA3O. DDNA3O yields a sensitivity of 64% and specificity of 99.8%. By comparison, the corresponding optimized DBD-Hunter with the same dataset has a MCC value of 0.69 with the corresponding sensitivity of 58% and specificity of 99.5%, while the DDNA3 has a MCC value of 0.73 with sensitivity of 60% and specificity of 99.7%. Thus, the most significant improvement from DDNA3 to DDNA3O is significant increase in sensitivity (from 60% to 64%) also with slight reduction in rate of false positives (from 11/3797 to 8/3797).

Table 1.
Optimized TM-score-dependent energy thresholds based on DB179 and NB3797 (DDNA3O)

There are 114 complexes predicted as DNA binding proteins by DDNA3O. For these 114 complexes, predicted DNA binding residues are compared to native complexes. The specificity, accuracy, precision, sensitivity and MCC are 95%, 90%, 77%, 69% and 0.67, respectively.

3.3 Test on the APO104/HOLO104 datasets

The methods trained above (DDNA3 and DDNA3O) are applied to predict DNA binding proteins of APO104/HOLO104 datasets. The numbers of positive prediction are 50 by DDNA3 and 53 by DDNA3O (out of 104) for the APO sets, and 61 by DDNA3 and 62 by DDNA3O (out of 104) for the HOLO sets, respectively. That is, using monomer structures, rather than the complex structures, leads to a reduction of 11% in sensitivity (from 59% for the HOLO to 48% for the APO set) by DDNA3 and 9% by DDNA3O (from 60% to 51%). The corresponding sensitivity values for DDNA2 are 43.3% (45/104) and 53.8% (56/104) for the APO and HOLO sets, respectively. The performance of DBD-Hunter (47% for the APO and 55% for the HOLO sets) is somewhat in between DDNA2 and DDNA3. The test confirms an increase in sensitivity by DDNA3O over by DDNA3 for the APO set, in particular.

A more detailed analysis on predictions made by DDNA3O shows that there is an overlap of 50 predictions between the APO and HOLO sets. Figure 2 shows one example of the test on target proteins 1mjkA (contained in APO104) and 1mjmA (contained in HOLO104). 1mjkA and 1mjmA are the structures of the same methionine repressor protein in the absence and presence of DNA fragment, respectively. There is a small conformational change before and after DNA binding (TM-score between the two is 0.93). This small conformational change apparently does not prohibit the successful match to the same template protein 1ea4A with strong binding affinity.

Fig. 2.
(a) Structural comparison between APO target protein 1mjkA (green) and template protein 1ea4A (red). The TM-score between them is 0.79 and the interaction energy between 1mjkA and template DNA is −20.9. (b) Structural comparison between HOLO target ...

On the other hand, there are 12 correctly predicted HOLO targets but incorrectly predicted APO targets as shown in Table 2. The difference is caused by significant local conformational change in binding regions (high TM-align score but low binding affinity). An example (1le8A in HOLO and corresponding 1f43A in APO) is shown in Figure 3a, where significant change in binding regions (from red in APO to green in HOLO) leads to incorrect prediction despite insignificant structural change in non-binding regions of the protein. In another more extreme case (Fig. 3b), disordered region in APO structure (1jyfA) changes to ordered binding domain in HOLO structure (1efaA).

Table 2.
Targets are predicted as DNA binding on HOLO set but not on APO set
Fig. 3.
(a) Structural comparison between APO target 1f43A and HOLO target 1le8A. Red: fragment of binding domain of 1f43A. Green: fragment of binding domain of 1le8A. Orange: template DNA of 2bamB. (b) Structural comparison between APO target 1jyfA (red) and ...

Another cause of incorrect prediction in APO and correct prediction in HOLO is large overall structural change. The large overall structural changes lead to poor structural alignment to templates so that their TM-scores are lower than the threshold. For example, despite 90% sequence identity, TM-score between 1q39A in APO and 1k3w in HOLO structures is only 0.55 and leads to the poor alignment of APO structure to template (best is 0.48 in TM-score). We also discovered a technical reason for an APO target (1rxr_). We are unable to use the template employed for the corresponding HOLO target because the sequence identity between the template and its respective APO target is slightly higher than 35%.

There are also three targets identified as DNA binding proteins correctly in the APO set but not in the HOLO set. All three (1llzA, 1bf5A and 1esgA) are just outside of arbitrary boundaries generated by optimization. This highlights the empirical nature of the proposed approach.

3.4 Test on the DB71 dataset

The additional 71 proteins contained in the updated protein/DNA complex structural dataset (DB71) offer a challenging test set. DDNA3 (DDNA3O) predicts 34 (39) out of 71 proteins as DNA binding proteins. Thus, the sensitivity is 34/71 (48%) by DDNA3 and 55% by DDNA3O. DDNA3O continues to make significant improvement in sensitivity over DDNA3. This 55% sensitivity is 5% lower than the sensitivity of 60% for the HOLO dataset but is higher than the sensitivity of 51% for the APO dataset. This suggests that >50% new complex structures are recognizable by DDNA3O with DB179 as templates for protein–DNA complexes for all the sets tested (APO, HOLO and DB71).

3.5 The effect of a larger, updated dataset of DNA binding proteins (DDNA3U)

To examine the effect of a larger dataset of DNA binding proteins, we use DB250 and NB3797 as the training set. We found that for this larger, updated dataset, the highest MCC is 0.75 with the same or similar values for three parameters (β = 0.4, TM-score threshold of 0.55 and energy threshold of −13.7) as DDNA3. This result highlights the stability of trained parameters with a 40% increase in DNA binding proteins. The corresponding accuracy, precision and sensitivity are 97%, 87% and 67%, respectively. In particular, 45 out of 71 additional proteins outside DB179 are recognized as DNA binding by DB250-trained DDNA3 (DDNA3U), compared to 34 by DB179-trained DDNA3. However, the sensitivity increases at the cost of more false positives (26, more than doubled from 11 for DB179-trained DDNA3).

Application of this newly trained method to APO104 and HOLO104 sets leads to 52 (50%) and 64 (62%) predicted DNA binding proteins, respectively. That is, a 40% expansion of DNA binding proteins (from 179 to 250) leads to about 3% improvement in sensitivity. However, as Figure 1 indicates, newly trained DDNA3 (labeled as DDNA3U) yields higher sensitivity only when false positive rate >0.005. That is, at a lower false positive rate, a larger template database in fact decreases sensitivity and precision.

One can employ TM-score-dependent energy thresholds to the updated DB250/NB3797 databases. The resulting DDNA3UO further increases the number of true positives from 167 to 176 but the number of false positives also increases from 26 to 34. Since we are interested in predicting DNA binding proteins with very low false positive rate (<0.005), we will employ the methods (DDNA3 and DDNA3O) trained by DB179 to structural genomics targets.

To further examine the possibility of overfitting in DDNA3U, we perform a 10-fold cross-validation tests on the DB250/NB3797. That is, all the binding and non-binding sets are randomly divided into 10 folds. Each time, one fold is chosen as the test set while the other nine folds are employed for all training including the statistics of potential energy function, the structure templates for protein–DNA binding, and re-training of the threshold parameters. The test is repeated for 10 times. The method performance is analyzed by 1000 times of bootstrap resampling (Angarica et al., 2008). We found that the average MCC value is 0.70±0.02 with the accuracy of 97%, the precision of 88% and the sensitivity of 58%, respectively. It is clear that the only significant change from the leave-one-out results is the reduction of sensitivity from 67% to 58%. This is likely caused by the reduced number of templates in the 10-fold cross-validation. Indeed, if 249 templates are permitted to use, the average MCC value is 0.72±0.02. Thus, our results are reasonably robust with different training.

3.6 Application to structural genomics targets

As shown in Table 3, application of DDNA3 leads to 32 DNA binding proteins from SG1697. Among them, 19 out of 32 proteins (59%) are putative DNA binding proteins, 3 out of 32 proteins (10%) are annotated to having other functions, while others (31%) have unknown function. DDNA3O decreases the prediction of DNA binding proteins from 32 to 27 without change on the number of putative DNA binding proteins (19) but reduces the number of proteins with other annotated function from 3 to 1 and with unknown functions from 10 to 7. This result further confirms the improvement of DDNA3O over DDNA3. By comparison, DBD-Hunter predicts 37 DNA binding proteins. Among the 37 proteins, there are 18 (48.6%) putative DNA binding proteins, 3 (8.1%) with other putative functions and 16 (43.2%) with unknown function. All the putative functions are from the annotations in the NCBI database.

Table 3.
Structural Genomics targets (SG1697) predicated as DNA binding proteins by DBD-Hunter, DDNA3 and DDNA3O

The overlap between predicted proteins by DDNA3O and DBD-Hunter is only 19 proteins, 15 (79%) of which are putative DNA binding proteins. The large fraction of putative DNA binding proteins in overlapped predictions highlights significant improvement in confidence of prediction when a consensus prediction is made. Meanwhile, only 70% (19/27) proteins predicted by DDNA3O overlap with those by DBD-Hunter highlights that the energy function plays a significant role in prediction. There are four putative DNA binding proteins (1ug2A, 1y9bA, 2cqxA and 2fb1A) predicted by DDNA3O but missed by DBD-Hunter. Similarly, there are three putative DNA binding proteins (2hytA, 2iaiA and 2od5A) predicted by DBD-Hunter but missed by DDNA3O. The complete list of predicted DNA binding proteins is shown in Table 4. Table 4 includes 10 additional predicted proteins from SG2235, 8 of which are putative DNA binding proteins. That is, 73% (27/37) of predicted proteins from SG2235 are putative DNA binding proteins. This result confirms the prediction quality of the proposed DDNA3O technique.

Table 4.
Targets are predicted as DNA binding proteins by DDNA3O from SG1697 and SG2235 with function annotated in NCBI database

4 DISCUSSION

We have developed a highly accurate method (DDNA3O) to predict DNA binding proteins. This is accomplished by developing a new statistical energy function for predicting DNA binding proteins. We found that introducing an atom-type-dependent volume fraction correction and DNA mutation in the DFIRE statistical energy function leads to a significant improvement in the performance in predicting DNA binding proteins (MCC = 0.76 for DB179/NB3797 by DDNA3O). This is a significant improvement from MCC of 0.69 given by optimized DBD-Hunter. Application of DDNA3O to structural genome targets confirms the accuracy of the proposed method with 73% potentially correct prediction of DNA binding proteins (annotated as putative DNA binding), 3% potentially false positives (function annotated but not DNA binding) and the rest unknown.

For DDNA3, the effect of DNA mutation is small for improving the MCC value of the training set (from 0.72 to 0.73) but is significant for improving the sensitivity from 46/104 (44%) to 50/104 (48%) of the APO test set. We further find that the mutation leads to no significant improvement in sequence identity between template DNA sequence and wild-type DNA sequence. The sequence identities to wild-type DNA sequences before and after mutation are both close to the random value of 25%. One possible reason is the absence of structural refinement for protein during mutation. This result also suggests that DDNA3 is not yet specific enough to identify binding DNA bases.

In principle, exhaustive mutations of DNA base pairs can lead to significant increase in computing time for a long DNA segment. However, because our energy function does not consider base–base interaction by assuming a rigid DNA structure before and after binding, the computing requirement for the exhaustive mutations of DNA base pairs is only four times more than that without base mutations.

One potential concern is insufficient statistics due to the small number of complex structures for deriving the DDNA3 energy function. We have addressed this question by employing the leave-one-out (for both DB179 and DB250 sets) and 10-fold cross-validation (for the DB250 set) techniques. The consistency between different training and test sets provides the confidence about the energy functions obtained.

Another concern is potential overfitting due to five threshold parameters in DDNA3O because of the small number of true positives for each TM-score bins (Table 1). This concern is reduced somewhat as the energy threshold mostly satisfies the expectation that less similar structures (low TM-scores) requires stricter energy thresholds. Moreover, there is a consistent improvement in sensitivity from training (DB179) to test (APO/HOLO104, DB71 and structural genomics targets). This consistency makes the improvement statistically significant. However, one certainly cannot completely rule out overfitting. More studies as larger dataset becomes available are certainly needed.

One advantage of the proposed structure-based prediction method is the prediction of protein–DNA complex structures. The predicted complex structures allow prediction of DNA binding residues. High specificity and accuracy (>90%) are achieved for binding residue prediction even for the APO structures (protein structures in the absence of DNA).

The success of DDNA3O is limited by the availability of protein–DNA complexes as templates. A 40% expansion of template databases from 179 to 250 proteins leads to significant improvement in sensitivity if false positive rate >0.005 (Fig. 1) but also slightly decreases sensitivity if false positive rate <0.005. Thus, there is a clear need to further improve the energy function that discriminates binding from non-binding proteins. The rigid-body approximation employed here likely has limited the performance of DDNA3O. Introducing flexibility to DNA and proteins to DDNA3 is in progress.

Funding: National Institutes of Health (grant R01 GM 085003).

Conflict of Interest: none declared.

REFERENCES

  • Ahmad S, et al. Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information. Bioinformatics. 2004;20:477–486. [PubMed]
  • Angarica VE, et al. Prediction of TF target sites based on atomistic models of protein-DNA complexes. BMC Bioinformatics. 2008;9:436. [PMC free article] [PubMed]
  • Bhardwaj N, et al. Kernel-based machine learning protocol for predicting DNA-binding proteins. Nucleic Acids Res. 2005;33:6486–6493. [PMC free article] [PubMed]
  • Burley SK. An overview of structural genomics. Nat. Struct. Biol. 2000;7:932–934. [PubMed]
  • Cai Y.-d, Lin SL. Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence. Biochim. Biophys. Acta. 2003;1648:127–133. [PubMed]
  • Cheatham TE, et al. A modified version of the cornell et al. force field with improved sugar pucker phases and helical repeat. J. Biomol. Struct. Dyn. 1999;16:845–862. [PubMed]
  • Ferrer-Costa C, et al. PMUT: a web-based tool for the annotation of pathological mutations on proteins. Bioinformatics. 2005a;21:3176–3178. [PubMed]
  • Ferrer-Costa C, et al. HTHquery: a method for detecting DNA-binding proteins with a helix-turn-helix structural motif. Bioinformatics. 2005b;21:3679–3680. [PubMed]
  • Gao M, Skolnick J. DBD-hunter: a knowledge-based method for the prediction of DNA-protein interactions. Nucleic Acids Res. 2008;36:3978–3992. [PMC free article] [PubMed]
  • Jaroszewski L, et al. Exploration of uncharted regions of the protein universe. PLoS Biol. 2009;7:e1000205. [PMC free article] [PubMed]
  • Kumar M, et al. Prediction of RNA binding sites in a protein using SVM and PSSM profile. Proteins. 2008;71:189–194. [PubMed]
  • Langlois RE, et al. Learning to translate sequence and structure to function: identifying DNA binding and membrane binding proteins. Ann. Biomed. Eng. 2007;35:1043–1052. [PMC free article] [PubMed]
  • Lee D, et al. Predicting protein function from sequence and structure. Nat. Rev. Mol. Cell Biol. 2007a;8:995–1005. [PubMed]
  • Lee D, et al. Predicting protein function from sequence and structure. Nat. Rev. Mol. Cell Biol. 2007b;8:995–1005. [PubMed]
  • Lee H, et al. Diffusion kernel-based logistic regression models for protein function prediction. Omics. 2006;10:40–55. [PubMed]
  • Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22:1658–1659. [PubMed]
  • Lu X.-J, Olson WK. 3DNA: a software package for the analysis, rebuilding and visualization of three-dimensional nucleic acid structures. Nucleic Acids Res. 2003;31:5108–5121. [PMC free article] [PubMed]
  • Mahony S, et al. Self-organizing neural networks to support the discovery of DNA-binding motifs. Neural Netw. 2006;19:950–962. [PubMed]
  • Pazos F, Sternberg M.JE. Automated prediction of protein function and detection of functional sites from structure. Proc. Natl Acad. Sci. USA. 2004;101:14754–14759. [PMC free article] [PubMed]
  • Punta M, Ofran Y. The rough guide to in silico function prediction, or how to use sequence and structure information to predict protein function. PLoS Comput. Biol. 2008;4:e1000160. [PMC free article] [PubMed]
  • Sadowski MI, Jones DT. The sequence-structure relationship and protein function prediction. Curr. Opin. Struct. Biol. 2009;19:357–362. [PubMed]
  • Shanahan HP, et al. Identifying DNA-binding proteins using structural motifs and the electrostatic potential. Nucleic Acids Res. 2004;32:4732–4741. [PMC free article] [PubMed]
  • Stawiski EW, et al. Annotating nucleic acid-binding function based on protein structure. J. Mol. Biol. 2003;326:1065–1079. [PubMed]
  • Szilagyi A, Skolnick J. Efficient prediction of nucleic acid binding function from low-resolution protein structures. J. Mol. Biol. 2006;358:922–933. [PubMed]
  • Tjong H, Zhou H.-X. DISPLAR: an accurate method for predicting DNA-binding sites on protein surfaces. Nucleic Acids Res. 2007;35:1465–1477. [PMC free article] [PubMed]
  • Watson JD, et al. Predicting protein function from sequence and structural data. Curr. Opin. Struct. Biol. 2005;15:275–284. [PubMed]
  • Xu B, et al. An all-atom knowledge-based energy function for protein-DNA threading, docking decoy discrimination, and prediction of transcription-factor binding profiles. Proteins. 2009;76:718–730. [PMC free article] [PubMed]
  • Yang Y, Zhou Y. Specific interactions for ab initio folding of protein terminal regions with secondary structures. Proteins. 2008a;72:793–803. [PubMed]
  • Yang Y, Zhou Y. Ab initio folding of terminal segments with secondary structures reveals the fine difference between two closely related all-atom statistical energy functions. Protein Sci. 2008b;17:1212–1219. [PMC free article] [PubMed]
  • Zhang C, et al. A knowledge-based energy function for protein-ligand, protein-protein, and protein-DNA complexes. J. Med. Chem. 2005;48:2325–2335. [PubMed]
  • Zhou H, Zhou Y. Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction. Protein Sci. 2002;11:2714–2726. [PMC free article] [PubMed]

Articles from Bioinformatics are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Save items

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • Compound
    Compound
    PubChem chemical compound records that cite the current articles. These references are taken from those provided on submitted PubChem chemical substance records. Multiple substance records may contribute to the PubChem compound record.
  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles
  • Substance
    Substance
    PubChem chemical substance records that cite the current articles. These references are taken from those provided on submitted PubChem chemical substance records.

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...