![]() | ![]() |
Formats:
|
||||||||||||||||||
Copyright © 2007 The Author(s). DISPLAR: an accurate method for predicting DNA-binding sites on protein surfaces Department of Physics and Institute of Molecular Biophysics and School of Computational Science, Florida State University, Tallahassee, FL 32306, USA *To whom correspondence should be addressed. Phone: +1 850 6451336, Fax: +1 850 6447244, Email: zhou/at/sb.fsu.edu Received August 24, 2006; Revised November 25, 2006; Accepted December 27, 2006. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract Structural and physical properties of DNA provide important constraints on the binding sites formed on surfaces of DNA-targeting proteins. Characteristics of such binding sites may form the basis for predicting DNA-binding sites from the structures of proteins alone. Such an approach has been successfully developed for predicting protein–protein interface. Here this approach is adapted for predicting DNA-binding sites. We used a representative set of 264 protein–DNA complexes from the Protein Data Bank to analyze characteristics and to train and test a neural network predictor of DNA-binding sites. The input to the predictor consisted of PSI-blast sequence profiles and solvent accessibilities of each surface residue and 14 of its closest neighboring residues. Predicted DNA-contacting residues cover 60% of actual DNA-contacting residues and have an accuracy of 76%. This method significantly outperforms previous attempts of DNA-binding site predictions. Its application to the prion protein yielded a DNA-binding site that is consistent with recent NMR chemical shift perturbation data, suggesting that it can complement experimental techniques in characterizing protein–DNA interfaces. INTRODUCTION Protein–DNA interactions play central roles in a wide range of biological processes such as gene regulation and DNA replication and repair. A fundamental question is how recognition is achieved, both on the DNA side and on the protein side. On the DNA side, recognition by a protein involves features that distinguish a short stretch of nucleotides, to which the protein specifically binds, from other nucleotide sequences on the DNA. On the protein site, recognition by a DNA involves features that distinguish a patch of residues, to which the DNA binds, from other areas on the protein surface. This article presents a method for predicting the DNA-binding site on a protein surface. The method is called DISPLAR, or DNA-Interaction Site Prediction from a List of Adjacent Residues. DISPLAR is built on our method, PPISP, developed previously for protein–protein interaction site prediction (1,2). The approach is based on a number of distinguishing features that residues in protein–protein or protein–DNA interfaces have over non-interface residues on the protein surface. In the case of protein–DNA interfaces, such distinguishing features have been reported before. These include enrichment of positively charged Arg and Lys residues (3–8) and sequence conservation (9). The former can be easily rationalized by the negatively charged phosphate group on each nucleotide; the latter can be rationalized by structural and functional requirements on the interface. In DISPLAR, like in PPISP, these two features are captured by position-specific sequence profiles as obtained by running PSI-blast (10). In addition, the solvent accessibility of interface residues is also distinct from that of non-interface residues and is used as an input for DISPLAR as well as for PPISP. These input parameters are used to train a neural network for prediction. The approach of PPISP seems to be ideally suited for adaptation to DNA-binding site prediction. The binding partners, i.e. different DNA, have common structural and physical properties. All DNA share the basic double-helix architecture; structural variability due to local bending and twisting is much less compared to variability in the case of proteins from different folds. Variability among nucleotides also seems to be much less than among amino acids. There are only four different nucleotides compared to 20 amino acids. More importantly, the variable part of each nucleotide, i.e. the base, is involved in base pairing and less exposed than the constant part, i.e. the phosphate. The latter, as noted earlier, carries a negative charge. In contrast, in the case of amino acid, the variable part, i.e. the side chain, is usually more exposed than the constant part, i.e. the backbone, in folded proteins. In short, the partners in protein–DNA recognition are much more uniform than those in protein–protein recognition. Since a neural network is trained to learn common features of interface residues and has been found to work well for protein–protein interface prediction, it may be expected that the same approach would work well for predicting DNA-binding site. Several attempts at predicting DNA-binding have been made previously. Stawiski et al. (4) used percentages of Arg and Lys residues and other physical properties to classify whether a protein is nucleic-acid binding. Jones et al. (5) used electrostatic potential surface of DNA-binding proteins to predict the binding surface patch. Of a set of 56 proteins, 38 (i.e. 68%) had top-ranked patches with more than 70% residues that are actually DNA-contacting. Keil et al. (11) used electrostatic potential and other physical properties to classify protein surface patches as protein, DNA, ligand or non-binding. Ferrer-Costa et al. (12) used electrostatic potential to classify whether proteins with the helix-turn-helix motif are DNA-binding. Tsuchiya et al. (13) also used electrostatic potential to classify whether a protein is DNA-binding. Recently Kummerfeld and Teichmann (14) used homology to predict transcription factors. Two methods that have the most resemblance to DISPLAR are by Ahmad et al. (6) and by Kuznetsov et al. (15). Like our method, the predictions by these two groups are at the residue level (i.e. whether a residue is DNA-contacting), instead of the patch or protein level. Like our method, Ahmad et al. also used PSI-blast sequence profiles and solvent accessibility as input to train neural networks, while Kuznetsov et al. used similar input to train a support vector machine predictor. However, there are important differences between these two methods and DISPLAR, resulting in the latter's much higher accuracy. Ahmad et al. reported coverage of 40% of actual DNA-contacting residues by their predictions, and from their reported data, a very low accuracy for positive prediction, at 13%, is obtained. The method of Kuznetsov et al. applied to our set of 264 proteins has a coverage of 60% of actual DNA-contacting residues and an accuracy of 56% for positive prediction. In comparison, DISPLAR test results show a coverage of 60% and an accuracy of 76%. The high level of prediction accuracy suggests that DISPLAR can complement experimental techniques in characterizing protein–DNA interfaces. As an illustration, we applied the method to the prion protein, which has recently been shown to interact with DNA (16). The predicted DNA-binding site agrees well with NMR chemical shift perturbation data. MATERIALS AND METHODS Generation of the data set All 1091 entries containing both protein chains and DNA chains were downloaded from the Protein Data Bank (May 2006 release) (http://www.rcsb.org/). To obtain a representative data set, sequence alignment between protein chains from different PDB entries was made by the PSI-blast program (10) with a default (10−3) e-value. When a match was identified, the ratio of the number of aligned identical residues to the total length of the query entry was calculated as the sequence identity. Redundant entries were removed manually at an identity threshold of 50%, with the entry having the highest resolution typically retained as representative. In addition, entries with all protein chains shorter than 40 residues were not included; such chains could not yield a position-specific scoring matrix by PSI-blast. At the end a representative set of 264 PDB entries was obtained (listed in Supplementary Table S1). Not included in this set were two nonhomologous entries (1i6h and 1w36) of protein complexes, each with more than a total of 2000 residues; these two entries were later used in an additional test of DISPLAR. Throughout this study, only protein chains constituting a single copy of a complete biologically significant multimer in each PDB entry were used. Such chain information was found from ‘REMARK 350 BIOMOLECULE’ of PDB files (or similar remarks in older PDB files). All protein chains with less than 40 residues were discarded, again because they could not yield a position-specific scoring matrix by PSI-blast. Among the 264 PDB entries, 139 have a single protein chain and the remaining 125 have at least two chains. In all there are 428 protein chains. The total number of residues is 80 983. For each PDB entry, protein residues that contact DNA chains were found. A contact was defined as a pair of heavy atoms across the protein–DNA interface with a distance less than 5 Å. There are a total of 11 305 DNA-contacting, or, interface, residues.Within the data set of 264 protein entries, 140 were found to not have any homologs. Of the remaining 124 entries, those having homologs with sequence identities in the brackets of <10, 10–20, 20–30, 30–40 and 40–50% numbered 6, 17, 38, 29 and 34, respectively. We focused on protein surface residues. For this purpose, exposed surface areas of residues in each protein multimer were calculated using the DSSP program (17), and surface residues were taken to be the ones with exposed surface areas at more than 10% of maximum values (1). The ratio of exposed surface area and the maximal value will be referred to as the solvent accessibility for each residue. With the threshold of 10% solvent accessibility, 56 093 were classified as surface residues; among these, 10 062 were interface residues. The percentage of interface residues among surface residues is 18%. The 10 062 interface residues will be collectively referred to as the interface group; the remaining 46 031 non-interface surface residues will be referred to as the non-interface group.Statistics of interface and non-interface surface residues Residues in the interface and non-interface groups were separately collected according to amino acid types. From these the percentages of the 20 types of amino acids in the interface and non-interface groups were calculated. For each type of amino acid in either the interface or non-interface group, the average solvent accessibility was calculated. As already alluded to, sequence profiles were obtained as the position-specific scoring matrix produced by PSI-blast (10). The search was limited to three rounds with the default e-value threshold (10−3). The database consisted of 3 625 149 non-redundant protein sequences (May 2006 release of NCBI nr at ftp://ftp.ncbi.nlm.nih.gov/blast/db/). The substitution matrix was BLOSUM62 (18). The position-specific scoring matrix for each query sequence has Q × A elements, where Q is the length of the query sequence and A is the size (i.e. 20) of the amino acid alphabet. If position q (=1 to Q) of the query sequence is occupied by amino acid type a (=1 to A), then sequence conservation at this position was measured by the (q, a) element of the scoring matrix. The higher this element, the less frequent the query amino acid's substitution in the multiple sequence alignment and hence the more conserved the amino acid for the particular position. For type a amino acid, the conservation score was taken as the average of the (q, a) elements over query positions which were occupied by type a amino acid and were either in the interface or non-interface group.Neural network architecture DISPLAR was largely adapted from the latest implementation of PPISP (2). Unless otherwise indicated, model parameters were inherited from that implementation. The predictor had two types of input: solvent accessibility and sequence profile. Prediction for each residue was based on the input variables of the residue itself plus 14 of its closest spatial neighbors. The solvent-accessibility input for each residue was averaged over the residue and six of its closest spatial neighbors. The sequence-profile input for each residue (say at position q) consisted of the 20 elements in the qth row of PSI-blast position-specific scoring matrix. Two feed-forward, back-propagation neural networks were used consecutively as before. The first network had 15 × 21 input nodes, in which the first quantity was the window size, i.e. one for the residue under consideration plus 14 for its spatial neighbors, and the second quantity was the number of input variables for each residue in the window (one for solvent accessibility plus 20 for sequence profile). The first network was completed with a hidden layer of 150 nodes, and an output layer of two nodes (one for predicting interface and one for predicting non-interface). The input layer of the second network had 15 × 3 nodes, in which the first quantity was window size and the second quantity consisted of the two output values of the first network plus the solvent accessibility. The second network had 30 hidden nodes and again two output nodes. Training of the neural networks amounted to modifying the weight matrix, which was assigned random values initially. Training, cross-training and test sets In most previous prediction studies, the same proteins were used for selecting the optimal protocol and also for reporting the prediction performance (1,2,15,19,20). The dual use of the test proteins likely leads to inflated performance scores. To avoid this pitfall, for the purpose of reporting prediction performance, we randomly divided the data set of 264 protein entries into 10 groups. In turn, 8 groups were pooled for training; one of the two remaining groups was used for cross-training; and the last group was used for testing. Training resulted in a list of weight matrices (up to 20 rounds). Cross-training entailed selecting an optimal collection of weight matrices from different rounds for building consensus predictions (described below). Testing involved obtaining predictions for the group not used either in training or cross-training. With the three-tier division of the data set into 10 groups, each group was part of a training set 45 times, and used for cross-training 9 times and for testing also 9 times. For each residue, the majority outcome of the 9 test results was taken as the final prediction. The three-tier division of the data set avoids the use of the same proteins for both optimizing prediction protocol and reporting performance scores. A priori it was not clear this division was the best use of the data set for making new predictions. Therefore we also investigated using the data set in the more traditional way (1,2), with 239 of the 264 entries constituting a single training set and the remaining 25 entries reserved for cross-training. For unequivocal identification, these training and cross-training sets are referred to as ‘two-tier.’ To lessen any possible cross-contamination between training and cross-training, in selecting the two-tier cross-training set, we set an upper bound of 30% sequence identity. That is, we ensured that all entries in the two-tier cross-training set either are nonhomologous or have no more than 30% sequence identities among themselves or with any entry in the training set. The two-tier cross-training set has a total of 5004 surface residues, of which 870 are DNA-contacting (Table 1).
Trimming of non-interface residues There is an imbalance of interface and non-interface residues (the former accounts for just 18% of all surface residues in our data set of 264 proteins), randomly trimming some of the non-interface residues in the training process may improve accuracy (2,15). Training was carried out without and with one-third trimming of non-interface residues. Both sets of results were used to build consensus predictions (described next). Consensus prediction from different neural-network weight matrices Either with or without non-interface trimming, different rounds of neural network training result in different coverage and accuracy. Typically, the number of DNA-contacting predictions would initially increase with the increase in the round of training, leading to increasing coverage but decreasing accuracy; excessive training then leads to decrease in coverage. Our last implementation of PPISP (2) suggested that taking the consensus of positive predictions from different weight matrices may enhance accuracy at a given coverage. This approach was taken here. The consensus approach consisted of two steps: (1) clustering of all positive predictions using different weight matrices, and (2) selecting a cluster or clusters as the final predictions. In the first step, each positively-predicted residue was assigned a consensus score, defined as the number of times positive predictions were made by the different weight matrices. These residues were then sorted according to consensus score. Starting with the batch having the highest consensus score, residues were clustered if they were among the 19 nearest neighbors of each other. Then the next batch of residues with the second highest consensus score was used to grow the clusters and add new clusters. The process was continued until all the positive predictions were clustered. When a cluster was composed of predictions from different batches, the highest consensus score among all predictions within the cluster was assigned to the cluster. For later reference the maximum consensus score among all clusters is denoted as σmax. The number of predictions in a cluster is referred to as the cluster size. In the second step, clusters were selected according to consensus score and cluster size. First of all, clusters were eliminated if their consensus scores were less than σmax − 5. The largest size (smax) of the remaining clusters was then found. All clusters with the maximum consensus score were automatically retained. Clusters with consensus scores between σmax − 5 and σmax − 1 were then eliminated if their sizes were less than either 4 or smax − 4. Assessment of predictions The performance of DISPLAR was assessed by coverage and accuracy. If Npr residues are predicted to be DNA-contacting, of which ntp are true positives (i.e. among Ndc actual DNA-contacting residues) and the remaining nfp are false positives, then coverage is ntp/Ndc. For defining accuracy, we loosened the criterion of ‘true positive’ by counting as positive four nearest neighbors of the Ndc actual DNA-contacting residues. If the number of true positives using this loose criterion is , then accuray is .Optimal collection of weight matrices We attempted to exhaustively search for the optimal collection of weight matrices. This was done in two stages. The first stage involved only training without non-interface trimming. All possible combinations of weight matrices from the first round to the round in which coverage reached maximum (as reported on the cross-training set) were applied to the crossing-training set. Among those with coverage above a threshold, the combination of weight matrices with the highest accuracy was selected. There were three possible coverage thresholds. The highest was 58%; when prediction did not reach this coverage, the threshold was successively lowered to 50 and 40%. In the second stage, the selected list of weight matrices from the first stage was added to all possible combinations of weight matrices of training with one-third non-interface trimming, again from the beginning round to the round in which coverage reached maximum. Applied to the cross-training set, the combination of weight matrices with the highest coverage among those with accuracies within 1 or 3 percentage points of the highest accuracy was selected as the final collection of weight matrices. The same two-stage optimization procedure was used for both the three-tier and two-tier divisions of the data set. The only difference was in the final collection of weight matrices, with the three-percentage-point accuracy window for the former and the one-percentage-point accuracy window for the latter. The optimal collection for the two-tier cross-training set was composed of weight matrices from rounds 3 and 13 of training without non-interface trimming and rounds 5 and 6 of training with one-third non-interface trimming. RESULTS AND DISCUSSION Characteristics of DNA-contacting residues As noted in the Introduction, a number of properties distinguishing DNA-contacting residues from non-contacting residues on proteins have been reported in previous studies. As such distinctions form the basis of DISPLAR, the database for constructing the prediction method was analyzed to find the level of contrast between interface and non-interface residues. Figure 1
The contrast in solvent accessibility between the interface and non-interface groups also shows an interesting pattern (Figure 1 Compared to similar statistical analysis for protein–protein interfaces (1,2), the contrasts between the interface and non-interface groups shown here appear to be significantly stronger and better correlated among the three different measures. The hope is then that the neural network approach will work even better for DNA-binding site prediction. Overall assessment of predictions With the three-tier division of the data set, the accuracies of the 10 test sets averaged 76.4%, with a standard deviation of 4.7%; the corresponding coverages averaged 60.1%, with a standard deviation of 5.3%. The variations of accuracy and coverage were partly related to their anticorrelation: higher accuracy corresponded to lower coverage. The test-set results were regrouped according to homology levels of the protein entries. The 140 entries without homologs had an accuracy of 73.7% and coverage of 51.3%. In comparison, protein entries with homologs had higher accuracy, averaging 79.7%, and higher coverage, averaging 71.6%, among identity brackets of 10–20, 20–30, 30–40 and 40–50%. Variations of accuracy and coverage among identity levels fell within the standard deviations. The insensitivity to identity level suggests that the better predictions were not due to homology between test protein and training set per se. Instead it points to benefits from alignments with other DNA-binding proteins in the generation of PSI-blast sequence profiles. Nevertheless it is quite encouraging that DISPLAR yielded prediction accuracy over 70% at a coverage of over 50% for DNA-binding proteins without other homologs in the PDB. We can now list a number of important differences between DISPLAR and the method of Ahmad et al. (6) [these authors have recently adapted their method for predicting DNA-binding sites from protein sequence only (21); such predictions were also done in two other studies (19,20)]. We eliminated buried residues from the data set. They included only two sequential neighbors whereas we included 14 spatial neighbors. We added a second neural network. Our method benefited from a much more exhaustive training set and a much more exhaustive sequence database for generating sequence profiles. Another technical reason for the poor performance of their method is that they used a 3.5-Å cutoff for defining DNA-contacting whereas we used 5 Å. The shorter cutoff distance leads to an excessively small fraction (~6.5%) of interface residues among the data set. Such a small interface fraction makes it trivial to predict non-interface residues and leads to a tendency for over-predicting interface residues to ensure a reasonable coverage (Ahmad et al.'s positive interface predictions were three times the actual interface residues). The over-prediction was masked in their study because they chose to include negative predictions in accuracy assessment. In our opinion, only positive interface predictions are meaningful for accuracy assessment, since the goal is to identify DNA-binding sites. This point is especially important because of the imbalance between interface and non-interface residues.An obvious difference between DISPLAR and the method of Kuznetsov et al. (15) is the use of neural networks versus support vector machine (SVM). In implementing the predecessor of DISPLAR, i.e. PPISP, we compared neural network and SVM predictions and did not find the latter to be better (2), even though in another study we found the two methods to be competitive in predicting solvent accessibility (22). The more substantive difference between DISPLAR and the method of Kuznetsov et al. lies in the use of structural information. As noted, the list of 14 spatial neighbors is coded in DISPLAR. In contrast, Kuznetsov et al. used six sequential neighbors and included information of spatial neighbors in the form of occurrence frequencies for the 20 types of amino acids within a 12-Å sphere around each residue. This use of spatial information appears to have limited value, improving accuracy by just a few percentage points (15). Kuznetsov et al. has provided their method in a web server (http://lcg.rit.albany.edu/dp-bind/). Applying their method on our data set of 264 protein entries, the coverage and accuracy (calculated in the same way as for our predictions) are found to be 60 and 56%, respectively. At the same coverage of 60%, the gap of 20 percentage points from our average prediction accuracy is over five times the latter's standard deviation, thus clearly demonstrating better performance of our method. Predictions for the two-tier cross-training set To help resolve whether the three-tier division or the two-tier division was a better use of the data set, test results and cross-training results from the three-tier training were gathered for the two-tier cross-training set of 25 protein entries. The accuracy and coverage of the test results were 64.7 and 79.6%, respectively. The cross-training results showed only slight increases in accuracy and coverage, at 64.8 and 80.2%, respectively. In comparison, the cross-training resulting from the two-tier training had accuracy and coverage of 63.9 and 84.2%, respectively (Table 1). While the difference in accuracy of 4% is within the standard deviation (4.7%) found from the three-tier test sets, other comparisons also consistently showed modestly better performance for the two-tier training. These included interface predictions for the prion protein and two large DNA-binding proteins and classification of proteins into DNA binding and non-binding. We therefore concluded that the two-tier training was superior and from here on, results from the two-tier training are reported. We also used the two-tier cross-training set to investigate contributing factors to the performance of DISPLAR. One such factor is consensus prediction, based on the weight matrices from rounds 3 and 13 of training without non-interface trimming and rounds 5 and 6 of training with one-third non-interface trimming. Without non-interface trimming, the highest coverage was obtained in round 14; that coverage was 58.5% and the corresponding accuracy was 79.7%. With one-third non-interface trimming, the highest coverage was expectedly raised, to 64.0%, in round 12, but the corresponding accuracy was lowered, to 75.9%. The consensus prediction had statistically higher coverage than the best single training without non-interface trimming and statistically higher accuracy than the best single training with non-interface trimming. For a multimeric protein, in generating the position-specific scoring matrix there are two alternatives. One is to use the individual chains of the protein as separate query sequences and then concatenate the resulting scoring matrices. The other is to concatenate the sequences first and then generate a single scoring matrix. We found the scoring matrix of the first alternative to be more robust. The contrast in sequence conservation between the interface and non-interface groups is stronger, and the predictions of interface residues are more accurate. Apparently using the separate chains as queries allows PSI-blast to focus the search on the chains, generating higher quality alignments. This method is what was used in generating the results reported in Table 1. A similar method was used for predicting protein–protein interfaces of a multimeric protein complex (2). We also found the second neural network to be very useful. The idea of a second network was inherited from neural network predictions of protein secondary structures (23). The second network plays the role of reconciling conflicting predictions for (sequentially or spatially) neighboring residues. We found that in DISPLAR the second neural network indeed plays this role. The predictions from the first network tend to be scattered throughout the protein surface. After the second network, the predictions are more clustered, and the accuracy is much higher. Detailed comparison of predicted and actual interface residues on four proteins The accuracies and coverages of the 25 protein entries in the two-tier cross-training set are listed in Table 1. The coverages for individual entries range from 30% (for 1s40A) to 100% (for 1dh3A,C), while the accuracy is above 50% for all but two entries (1briL and 1rfiB). To illustrate the range of prediction quality, we now present detailed comparison between predicted and actual DNA-contacting residues for four proteins. They include a worst-case scenario (PDB 1brn), for which both the coverage and accuracy were low; a representative (PDB 1gd2) of the successful cases with both high coverage and high accuracy; and two (PDB 1s40 and 1u1q) of the more typical situations with medium coverage and high accuracy. Figure 2
Pap1 is a basic region leucine zipper transcription factor that binds the consensus DNA sequence TTACGTAA. In the structure of the complex (PDB 1gd2; Figure 2 In the complex between the DNA-binding domain of yeast telomere-binding protein Cdc13 and a cognate telomeric single-stranded DNA (PDB 1s40; Figure 2 The UP1 region (residues 1–195) of heterogeneous ribonucleoprotein A1 contains two RNA recognition motifs, which have high affinity for both single-stranded RNA and the telomeric sequence d(TTAGGG)n. In the complex between UP1 and d(TTAGGG)2 (Figure 2 Prediction with unbound protein structures Fourteen of the 25 proteins in the two-tier cross-training set have unbound structures deposited in the PDB (see Table 1). These provided an opportunity to apply DISPLAR in a real situation. At the outset it should be noted that DISPLAR and its predecessor PPISP by design only include input parameters that are not particularly sensitive to binding-induced conformational changes, and the preservation of prediction coverage and accuracy using unbound structures has been demonstrated for PPISP (1,2). Indeed, the solvent accessibility, the property that is most likely to be affected by conformational changes upon binding DNA, calculated using the 14 unbound structures show the same distinction between interface and non-interface residues as seen in Figure 1 For 12 of the 14 proteins, the root-mean square deviations (RMSD) of Cα atoms between bound and unbound structures are below 2.5 Å. The two exceptions are 1f5e/2alc and 1lei/1ikn, representing two different types of gross conformational changes. The former is a case of global distortion, with the overall RMSD of 5.8 Å distributed throughout the protein structure (Figure 3
The second type of gross conformational changes is rearrangement between protein domains. The DNA-bound structure 1lei has two different chains (A and B), both consisting of two domains. Only three of the domains are present in the unbound structure 1ikn (missing the N-terminal domain of chain B; the C-terminal domain of chain B is labeled as chain C in 1ikn). The N-terminal domain of chain A (residues 19–188) in 1lei superimposes to its counterpart in 1ikn with a RMSD of 1.1 Å; the C-terminal domain of chain A (residues 191–291) and the C-terminal domain of chain B (residues 245–350) together in 1lei superimpose to their counterparts in 1ikn with a RMSD of 0.8 Å. However, these two portions experience a relative rotation of about 180° upon binding DNA (Figure 3 Å for the three domains together. In the bound structure, all four domains contact DNA, with the N- and C-terminal domains of chain A contributing 21 and 4 residues and the N- and C-terminal domains of chain B contributing 9 and 5 residues, respectively, to the DNA-binding site. Correspondingly DISPLAR predicted 14, 9, 22 and 3 DNA-contacting residues for these four domains using the bound structure. With the N-terminal domain of chain B missing in the unbound structure, DISPLAR predicted 19 and 1 residue, respectively, for the N- and C-terminal domains of chain A, and nothing for the C-terminal domain of chain B. The results using the unbound structure are probably as good as can be expected based on those using the bound structure, demonstrating that DISPLAR also performs well when binding-induced domain rearrangements occur.Specific versus non-specific DNA binding In addition to specific DNA sequences, many proteins also bind to non-specific DNA. In the dataset of 264 proteins, there are structures for a few non-specific complexes. Three of the proteins, the λ Cro repressor, the lac repressor headpiece dimer and the DNA-adenine methyltransferase, have structures for both specific and non-specific complexes (28–32). The binding sites for specific and non-specific DNA largely overlap, with the same set of residues switching from electrostatic interactions with the DNA backbone in a non-specific complex to specific interactions with base pairs in a cognate DNA sequence. There are also many additional residues that interact with DNA in the specific complexes. The numbers of DNA-contacting surface residues are 18, 57 and 33 in the three specific complexes, compared to 12, 49 and 17 in the non-specific complexes. The numbers of interface residues that are in common in specific and non-specific complexes are 7, 38 and 5, respectively. The lac repressor headpiece dimer is in the two-tier cross-training set. Using the unbound structure (PDB 1lqc), DISPLAR predicted mostly residues that contact DNA in both the specific and non-specific complexes. With the specific complex (PDB 1l1m) as target, the coverage was 77% and accuracy was 100%. Another protein in the two-tier cross-training set is Sac7d (PDB 1xyi and 1xx8 for the bound and unbound structures, respectively), which is a small chromatin protein that binds to DNA without any particular sequence preference (33,34). DISPLAR predictions using both the bound and the unbound structures had a coverage of >40% and an accuracy >90%. These values fall within the range of DISPLAR performance shown in Table 1. Application to prion protein Lima et al. (16) recently showed that the prion protein binds DNA and used NMR chemical shift perturbation to characterize the binding interface. In the structured region (residues 125–228) of the protein, 15 residues are implicated for DNA binding as indicated by large changes in 1H or 15N chemical shifts upon DNA binding. We applied DISPLAR to this protein (PDB 1b10) and obtained 23 predicted DNA-contacting residues. As shown in Figure 4
Classification of DNA binding and non-binding proteins An inherent assumption in using DISPLAR is that a protein is known to bind DNA. What would DISPLAR predict if it is applied to a non-binding protein? Can DISPLAR prediction results be used to predict whether a protein is a binder or non-binder? To answer these questions, we applied DISPLAR to the full set of 264 DNA-binding proteins and to a set of 250 non-binders collected by Stawiski et al. (4). For this purpose we used consensus predictions based on weight matrices from rounds 3 and 13 of training without non-interface trimming. This consensus approach resulted in less number of positive predictions than the one including weight matrices also from rounds 5 and 6 of training with non-interface trimming; we thought less positive predictions would be helpful for obtaining more balanced success rates for classifying both binders and non-binders. An immediate difference in DISPLAR results between the binders and non-binders is that only three of the 264 proteins in the former group had no positive predictions, but 100 of the 250 proteins in the latter group had no positive predictions. Of the proteins with positive predictions, the predicted residues also show very different characteristics. First, the binder group has a total of 80 983 residues, of which 56 093 are on the surface, and 11 050 (or 20%) were predicted as DNA-contacting. In contrast, the non-binder group has a total of 41 091 residues, of which 25 820 are on the surface, and only 2307 (or 9%) were predicted as DNA-contacting. Second, the distributions of the positive predictions among the 20 types of amino acids were different. The distribution of the binder group reflected that of the DNA-binding interface, whereas the distribution of the non-binder group appeared similar to the non-interface of DNA-binding proteins (see Figure 1
These large differences between the two groups motivated us to develop a classifier using DISPLAR prediction results. First of all, a protein without any positive predictions was automatically classified as a non-binder. If positive predictions were obtained, then the results were processed and fed to a neural network for further classification. This neural network had 23 inputs for each protein, 20 of which were the percentages of the 20 types of amino acids among the positive predictions. The remaining three inputs were: the average number of neighboring predictions, the percentage of positive predictions among all surface residues and the percentage of surface residues among all residues. One hundred and forty-nine binders were randomly picked to mix with 149 of the non-binders to train the neural network. The non-binder that was left out was then tested. In all, only 42 of the non-binders were misclassified, giving a success rate of 83% for classifying non-binders. A final training was carried out with 149 binders and all the 150 non-binders. Tested on the remaining 112 binders, 15 were found to be misclassified. On account of the three binders that were misclassified due to the lack of positive DISPLAR predictions, the overall success rate for classifying binders was 86%. These success rates are competitive against classification methods that directly use structural data (4–6). That this level of success was achieved using prediction results provides another demonstration of the accuracy of DISPLAR. Application to RNA-binding proteins Many proteins bind to both RNA and DNA. It is thus interesting to see whether DISPLAR could also predict RNA-binding sites. We collected a representative set of 106 RNA-binding proteins with sequence identity less than 50%. DISPLAR had modest success on these proteins, with a coverage of 31.3% of the 3695 actual RNA-contacting residues and a prediction accuracy of 54.1%. Using the two-tier approach of DISPLAR, we randomly picked 86 of the RNA-binding proteins to train a neural network. When tested on the remaining 20 proteins (with less than 30% identity among themselves and with the training set), the coverage and accuracy improved to 57.1 and 63.3%, respectively. The accuracy is significantly less than the counterpart for DNA-binding proteins. The difference may be partly due to the smaller training set and partly due to higher diversity among RNA-binding proteins than among DNA-binding proteins. Application to large protein–nucleic acid complexes Most nucleic-acid-targeting proteins form multi-subunit complexes in their biological processes. Two of such large complexes, the RNA polymerase II elongation complex and the RecBCD–DNA complex, have their structures determined (PDB 1i6h and 1w36) (35,36). These two complexes, without any homologs in the data set of 264 protein entries, were not included in the development of DISPLAR partly because of the concern that the scarcity of large complexes would not allow for accurate predictions on them and partly because of the thought that alternative approaches, such as one focusing on one subunit or one domain therein at a time, might be more suited. Once DISPLAR was found to be quite accurate on the test sets, we became curious about its applicability to the two large protein–DNA complexes. The two complexes (1i6h and 1w36) have 10 and 3 protein subunits, respectively; we took each subunit as a separate test protein. The prediction of DNA-contacting residues appears very encouraging. Figure 6
Predictions for the RecBCD–DNA complex are shown in Figure 6 We also applied DISPLAR to the largest complex in the PDB, ribosome, which turned out to be a very easy target. Using training with RNA-binding proteins, we predicted 1560 of the of 2938 surface residues on the large ribosomal subunit of Haloarcula marismortui (37) to be RNA-contacting (Figure 6 Further studies We have shown that protein residues making up a binding site for DNA have strong characteristics, such as enrichment of Arg and Lys and depletion of Asp and Glu, and based on these characteristics we have developed a method, DISPLAR, for predicting residues that form the DNA-binding site. Mutations of DNA-contacting residues, such as those on the tumor repressor protein P53 (38), may be directly involved in human diseases. DISPLAR can thus be used to predict such disease mutations. Perhaps most importantly, the predictions of DISPLAR can be used to guide the docking of a protein and its cognate DNA to build a structure for the complex (39). Such an approach has already been shown to be successful for protein–protein complexes (40) and seems promising for protein–DNA complexes. The performance of DISPLAR can be further improved in several respects. Dividing the data set into subgroups with similar properties for separate training has been found to be useful in PPISP (2). Such a strategy may be adapted for protein–DNA complexes; the division could be based on clustering the interfaces through spatial relations of protein residues and DNA bases (41,42). Additional spatial features, such as the electrostatic potential surface (5,11–13) and the protein surface curvature (11), may also increase accuracy. Besides neural networks, the input data can be used to train other predictors such as support vector machine (15), and the results of different predictors can be pooled to give an ensemble prediction (22). These improvements will be explored in the future. The prediction of DNA-binding sites on protein surfaces by DISPLAR complements work on prediction of protein-binding sites on DNA. A number of methods have been developed to predict DNA sequences recognized by transcription factors (TF), including position-specific weight matrix (43) and threading of DNA sequences through a TF–DNA complex either by a statistical potential (44,45) or by an atomistic energy function (7,46,47). Work on both the protein side and the DNA side will contribute to our understanding of their interactions. The DISPLAR web server can be found at http://pipe.scs.fsu.edu/displar.html. SUPPLEMENTARY DATA Supplementary Data is available at NAR Online. [Supplementary Material]
ACKNOWLEDGEMENTS This work was supported in part by NIH grant GM58187. Funding to pay the Open Access publication charge was provided by the NIH (grant GM58187). Conflict of interest statement. None declared. REFERENCES 1. Zhou H-X, Shan Y. Prediction of protein interaction sites from sequence profile and residue neighbor list. Proteins. 2001;44:336–343. [PubMed] 2. Chen H, Zhou H-X. Prediction of interface residues in protein-protein complexes by a consensus neural network method: test against NMR data. Proteins. 2005;61:21–35. [PubMed] 3. Luscombe NM, Laskowski RA, Thornton JM. Amino acid-base interactions: a three-dimensional analysis of protein–DNA interactions at an atomic level. Nucl. Acids Res. 2001;29:2860–2874. [PubMed] 4. Stawiski EW, Gregoret LM, Mandel-Gutfreund Y. Annotating nucleic acid-binding function based on protein structure. J. Mol. Biol. 2003;326:1065–1079. [PubMed] 5. Jones S, Shanahan HP, Berman HM, Thornton JM. Using electrostatic potentials to predict DNA-binding sites on DNA-binding proteins. Nucl. Acids Res. 2003;31:7189–7198. [PubMed] 6. Ahmad S, Gromiha MM, Sarai A. Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information. Bioinformatics. 2004;20:477–486. [PubMed] 7. Havranek JJ, Duarte CM, Baker D. A simple physical model for the prediction and design of protein-DNA interactions. J. Mol. Biol. 2004;344:59–70. [PubMed] 8. Lejeune D, Delsaux N, Charloteaux B, Thomas A, Brasseur R. Protein-nucleic acid recognition: statistical analysis of atomic interactions and influence of DNA structure. Proteins. 2005;61:258–271. [PubMed] 9. Luscombe NM, Thornton JM. Protein-DNA interactions: amino acid conservation and the effects of mutations on binding specificity. J. Mol. Biol. 2002;320:991–1009. [PubMed] 10. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res. 1997;25:3389–3402. [PubMed] 11. Keil M, Exner TE, Brickmann J. Pattern recognition strategies for molecular surfaces: III. Binding site prediction with a neural network. J. Comput. Chem. 2004;25:779–789. [PubMed] 12. Ferrer-Costa C, Shanahan HP, Jones S, Thornton JM. HTHquery: a method for detecting DNA-binding proteins with a helix-turn-helix structural motif. Bioinformatics. 2005;21:3679–3680. [PubMed] 13. Tsuchiya Y, Kinoshita K, Nakamura H. PreDs: a server for predicting dsDNA-binding site on protein molecular surfaces. Bioinformatics. 2005;21:1721–1723. [PubMed] 14. Kummerfeld SK, Teichmann SA. DBD: a transcription factor prediction database. Nucl. Acids Res. 2006;34:D74–D81. [PubMed] 15. Kuznetsov IB, Gou Z, Li R, Hwang S. Using evolutionary and structural information to predict DNA-binding sites on DNA-binding proteins. Proteins. 2006;64:19–27. [PubMed] 16. Lima LMTR, Cordeiro Y, Tinoco LW, Marques AF, Oliveira CLP, Sampath S, Kodali R, Choi G, Foguel D, et al. Structural insights into the interaction between prion protein and nucleic acid. Biochemistry. 2006;45:9180–9187. [PubMed] 17. Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983;22:2577–2637. [PubMed] 18. Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. U.S.A. 1992;89:10915–10919. [PubMed] 19. Wang L, Brown SJ. BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences. Nucl. Acids Res. 2006;34:W243–W248. [PubMed] 20. Yan C, Terribilini M, Wu F, Jernigan RL, Dobbs D, Honavar V. Predicting DNA-binding sites of proteins from amino acid sequence. BMC Bioinformatics. 2006;7:262. [PubMed] 21. Ahmad S, Sarai A. PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics. 2005;6:33. [PubMed] 22. Chen H, Zhou H-X. Prediction of solvent accessibility and sites of deleterious mutations from protein sequence. Nucl. Acids Res. 2005;33:3193–3199. [PubMed] 23. Rost B, Sander C. Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol. 1993;232:584–599. [PubMed] 24. Buckle AM, Fersht AR. Subsite binding in an RNase: structure of a barnase-tetranucleotide complex at 1.76-Å resolution. Biochemistry. 1994;33:1644–1653. [PubMed] 25. Fujii Y, Shimizu T, Toda T, Yanagida M, Hakoshima T. Structural basis for the diversity of DNA recognition by bZIP transcription factors. Nat. Struct. Biol. 2000;7:889–893. [PubMed] 26. Mitton-Fry RM, Anderson EM, Theobald DL, Glustrom LW, Wuttke DS. Structural basis for telomeric single-stranded DNA recognition by yeast Cdc13. J. Mol. Biol. 2004;338:241–255. [PubMed] 27. Myers JC, Shamoo Y. Human UP1 as a model for understanding purine recognition in the family of proteins containing the RNA recognition motif (RRM). J. Mol. Biol. 2004;342:743–756. [PubMed] 28. Albright RA, Matthews BW. Crystal structure of λ-Cro bound to a consensus operator at 3.0 Å resolution. J. Mol. Biol. 1998;280:137–151. [PubMed]29. Albright RA, Mossing MC, Matthews BW. Crystal structure of an engineered Cro monomer bound nonspecifically to DNA: possible implications for nonspecific binding by the wild-type protein. Protein Sci. 1998;7:1485–1494. [PubMed] 30. Kalodimos CG, Bonvin AM, Salinas RK, Wechselberger R, Boelens R, Kaptein R. Plasticity in protein-DNA recognition: lac repressor interacts with its natural operator O1 through alternative conformations of its DNA-binding domain. EMBO J. 2002;21:2866–2876. [PubMed] 31. Kalodimos CG, Biris N, Bonvin AMJJ, Levandoski MM, Guennuegues M, Boelens R, Kaptein R. Structure and flexibility adaptation in nonspecific and specific protein-DNA complexes. Science. 2004;305:386–389. [PubMed] 32. Horton JR, Liebert K, Hattman S, Jeltsch A, Cheng X. Transition from nonspecific to specific DNA interactions along the substrate-recognition pathway of Dam methyltransferase. Cell. 2005;121:349–361. [PubMed] 33. Chen CY, Ko TP, Lin TW, Chou CC, Chen CJ, Wang AH. Probing the DNA kink structure induced by the hyperthermophilic chromosomal protein Sac7d. Nucl. Acids Res. 2005;33:430–438. [PubMed] 34. Bedell JL, Edmondson SP, Shriver JW. Role of a surface tryptophan in defining the structure, stability, and DNA binding of the hyperthermophile protein Sac7d. Biochemistry. 2005;44:915–925. [PubMed] 35. Gnatt AL, Cramer P, Fu J, Bushnell DA, Kornberg RD. Structural basis of transcription: an RNA polymerase II elongation complex at 3.3 Å resolution. Science. 2001;292:1876–1882. [PubMed]36. Singleton MR, Dillingham MS, Gaudier M, Kowalczykowski SC, Wigley DB. Crystal structure of RecBCD enzyme reveals a machine for processing DNA breaks. Nature. 2004;432:187–193. [PubMed] 37. Schmeing TM, Huang KS, Kitchen DE, Strobel SA, Steitz TA. Structural insights into the roles of water and the 2′ hydroxyl of the P site tRNA in the peptidyl transferase reaction. Mol. Cell. 2005;20:437–448. [PubMed] 38. Bullock AN, Fersht AR. Rescuing the function of mutant P53. Nat. Rev. Cancer. 2001;1:68–76. [PubMed] 39. van Dijk M, van Dijk ADJ, Hsu V, Boelens R, Bonvin AMJJ. Information-driven protein–DNA docking using HADDOCK: it is a matter of flexibility. Nucl. Acids Res. 2006;34:3317–3325. [PubMed] 40. van Dijk ADJ, de Vries SJ, Dominguez C, Chen H, Zhou H-X, Bonvin AMJJ. Data-driven docking: HADDOCK's adventures in CAPRI. Proteins. 2005;60:232–238. [PubMed] 41. Pabo CO, Nekludova L. Geometric analysis and comparison of protein-DNA interfaces: why is there no simple code for recognition? J. Mol. Biol. 2000;301:597–624. [PubMed] 42. Siggers TW, Silkov A, Honig B. Structural alignment of protein-DNA interfaces: insights into the determinants of binding specificity. J. Mol. Biol. 2005;345:1027–1045. [PubMed] 43. Bulyk M. Computational prediction of transcription-factor binding site locations. Genome Biol. 2003;5:201. [PubMed] 44. Mandel-Gutfreund Y, Margalit H. Quantitative parameters for amino acid-base interaction: implications for prediction of protein-DNA binding sites. Nucl. Acids Res. 1998;26:2306–2312. [PubMed] 45. Kono H, Sarai A. Structure-based prediction of DNA target sites by regulatory proteins. Proteins. 1999;35:114–131. [PubMed] 46. Endres RG, Schulthess TC, Wingreen NS. Toward an atomistic model for predicting transcription-factor binding sites. Proteins. 2004;57:262–268. [PubMed] 47. Paillard G, Lavery R. Analyzing protein-DNA recognition mechanisms. Structure. 2004;12:113–122. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||
Proteins. 2001 Aug 15; 44(3):336-43.
[Proteins. 2001]Proteins. 2005 Oct 1; 61(1):21-35.
[Proteins. 2005]Nucleic Acids Res. 2001 Jul 1; 29(13):2860-74.
[Nucleic Acids Res. 2001]J Mol Biol. 2003 Feb 28; 326(4):1065-79.
[J Mol Biol. 2003]Nucleic Acids Res. 2003 Dec 15; 31(24):7189-98.
[Nucleic Acids Res. 2003]J Mol Biol. 2003 Feb 28; 326(4):1065-79.
[J Mol Biol. 2003]Nucleic Acids Res. 2003 Dec 15; 31(24):7189-98.
[Nucleic Acids Res. 2003]J Comput Chem. 2004 Apr 30; 25(6):779-89.
[J Comput Chem. 2004]Bioinformatics. 2005 Sep 15; 21(18):3679-80.
[Bioinformatics. 2005]Bioinformatics. 2005 Apr 15; 21(8):1721-3.
[Bioinformatics. 2005]Bioinformatics. 2004 Mar 1; 20(4):477-86.
[Bioinformatics. 2004]Proteins. 2006 Jul 1; 64(1):19-27.
[Proteins. 2006]Biochemistry. 2006 Aug 1; 45(30):9180-7.
[Biochemistry. 2006]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Biopolymers. 1983 Dec; 22(12):2577-637.
[Biopolymers. 1983]Proteins. 2001 Aug 15; 44(3):336-43.
[Proteins. 2001]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Proc Natl Acad Sci U S A. 1992 Nov 15; 89(22):10915-9.
[Proc Natl Acad Sci U S A. 1992]Proteins. 2005 Oct 1; 61(1):21-35.
[Proteins. 2005]Proteins. 2001 Aug 15; 44(3):336-43.
[Proteins. 2001]Proteins. 2005 Oct 1; 61(1):21-35.
[Proteins. 2005]Proteins. 2006 Jul 1; 64(1):19-27.
[Proteins. 2006]Nucleic Acids Res. 2006 Jul 1; 34(Web Server issue):W243-8.
[Nucleic Acids Res. 2006]BMC Bioinformatics. 2006 May 19; 7():262.
[BMC Bioinformatics. 2006]Proteins. 2001 Aug 15; 44(3):336-43.
[Proteins. 2001]Proteins. 2005 Oct 1; 61(1):21-35.
[Proteins. 2005]Proteins. 2005 Oct 1; 61(1):21-35.
[Proteins. 2005]Proteins. 2006 Jul 1; 64(1):19-27.
[Proteins. 2006]Proteins. 2005 Oct 1; 61(1):21-35.
[Proteins. 2005]Proteins. 2001 Aug 15; 44(3):336-43.
[Proteins. 2001]Proteins. 2005 Oct 1; 61(1):21-35.
[Proteins. 2005]Proteins. 2001 Aug 15; 44(3):336-43.
[Proteins. 2001]Proteins. 2005 Oct 1; 61(1):21-35.
[Proteins. 2005]Bioinformatics. 2004 Mar 1; 20(4):477-86.
[Bioinformatics. 2004]BMC Bioinformatics. 2005 Feb 19; 6():33.
[BMC Bioinformatics. 2005]Nucleic Acids Res. 2006 Jul 1; 34(Web Server issue):W243-8.
[Nucleic Acids Res. 2006]BMC Bioinformatics. 2006 May 19; 7():262.
[BMC Bioinformatics. 2006]Proteins. 2006 Jul 1; 64(1):19-27.
[Proteins. 2006]Proteins. 2005 Oct 1; 61(1):21-35.
[Proteins. 2005]Nucleic Acids Res. 2005; 33(10):3193-9.
[Nucleic Acids Res. 2005]Proteins. 2005 Oct 1; 61(1):21-35.
[Proteins. 2005]J Mol Biol. 1993 Jul 20; 232(2):584-99.
[J Mol Biol. 1993]Biochemistry. 1994 Feb 22; 33(7):1644-53.
[Biochemistry. 1994]Nat Struct Biol. 2000 Oct; 7(10):889-93.
[Nat Struct Biol. 2000]J Mol Biol. 2004 Apr 23; 338(2):241-55.
[J Mol Biol. 2004]J Mol Biol. 2004 Sep 17; 342(3):743-56.
[J Mol Biol. 2004]Proteins. 2001 Aug 15; 44(3):336-43.
[Proteins. 2001]Proteins. 2005 Oct 1; 61(1):21-35.
[Proteins. 2005]J Mol Biol. 1998 Jul 3; 280(1):137-51.
[J Mol Biol. 1998]Protein Sci. 1998 Jul; 7(7):1485-94.
[Protein Sci. 1998]EMBO J. 2002 Jun 17; 21(12):2866-76.
[EMBO J. 2002]Science. 2004 Jul 16; 305(5682):386-9.
[Science. 2004]Cell. 2005 May 6; 121(3):349-61.
[Cell. 2005]Nucleic Acids Res. 2005; 33(1):430-8.
[Nucleic Acids Res. 2005]Biochemistry. 2005 Jan 25; 44(3):915-25.
[Biochemistry. 2005]Biochemistry. 2006 Aug 1; 45(30):9180-7.
[Biochemistry. 2006]J Mol Biol. 2003 Feb 28; 326(4):1065-79.
[J Mol Biol. 2003]J Mol Biol. 2003 Feb 28; 326(4):1065-79.
[J Mol Biol. 2003]Nucleic Acids Res. 2003 Dec 15; 31(24):7189-98.
[Nucleic Acids Res. 2003]Bioinformatics. 2004 Mar 1; 20(4):477-86.
[Bioinformatics. 2004]Science. 2001 Jun 8; 292(5523):1876-82.
[Science. 2001]Nature. 2004 Nov 11; 432(7014):187-93.
[Nature. 2004]Nature. 2004 Nov 11; 432(7014):187-93.
[Nature. 2004]Mol Cell. 2005 Nov 11; 20(3):437-48.
[Mol Cell. 2005]Nat Rev Cancer. 2001 Oct; 1(1):68-76.
[Nat Rev Cancer. 2001]Nucleic Acids Res. 2006; 34(11):3317-25.
[Nucleic Acids Res. 2006]Proteins. 2005 Aug 1; 60(2):232-8.
[Proteins. 2005]Proteins. 2005 Oct 1; 61(1):21-35.
[Proteins. 2005]J Mol Biol. 2000 Aug 18; 301(3):597-624.
[J Mol Biol. 2000]J Mol Biol. 2005 Feb 4; 345(5):1027-45.
[J Mol Biol. 2005]Nucleic Acids Res. 2003 Dec 15; 31(24):7189-98.
[Nucleic Acids Res. 2003]J Comput Chem. 2004 Apr 30; 25(6):779-89.
[J Comput Chem. 2004]Genome Biol. 2003; 5(1):201.
[Genome Biol. 2003]Nucleic Acids Res. 1998 May 15; 26(10):2306-12.
[Nucleic Acids Res. 1998]Proteins. 1999 Apr 1; 35(1):114-31.
[Proteins. 1999]J Mol Biol. 2004 Nov 12; 344(1):59-70.
[J Mol Biol. 2004]Proteins. 2004 Nov 1; 57(2):262-8.
[Proteins. 2004]