• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of pnasPNASInfo for AuthorsSubscriptionsAboutThis Article
Proc Natl Acad Sci U S A. Jun 3, 2008; 105(22): 7708–7713.
Published online May 29, 2008. doi:  10.1073/pnas.0707032105
PMCID: PMC2409390

Coevolution at protein complex interfaces can be detected by the complementarity trace with important impact for predictive docking


Protein surfaces are under significant selection pressure to maintain interactions with their partners throughout evolution. Capturing how selection pressure acts at the interfaces of protein–protein complexes is a fundamental issue with high interest for the structural prediction of macromolecular assemblies. We tackled this issue under the assumption that, throughout evolution, mutations should minimally disrupt the physicochemical compatibility between specific clusters of interacting residues. This constraint drove the development of the so-called Surface COmplementarity Trace in Complex History score (SCOTCH), which was found to discriminate with high efficiency the structure of biological complexes. SCOTCH performances were assessed not only with respect to other evolution-based approaches, such as conservation and coevolution analyses, but also with respect to statistically based scoring methods. Validated on a set of 129 complexes of known structure exhibiting both permanent and transient intermolecular interactions, SCOTCH appears as a robust strategy to guide the prediction of protein–protein complex structures. Of particular interest, it also provides a basic framework to efficiently track how protein surfaces could evolve while keeping their partners in contact.

Keywords: evolution, prediction, protein–protein interaction

The modular assembly of proteins is a key determinant in the regulation of biological systems. Combinations of inter- and intramolecular interactions hold the cell machineries and govern the flow of information transmitted through cell signaling pathways. To unravel the complexity of cell organization, atlases of the physical interactome have been obtained for several model organisms (1, 2). To further elucidate the competitions and synergies ruling the molecular logic of these protein–protein interaction networks, a critical step relies on the structural characterization of the protein complexes. However, there is still a huge gap between the proteome-wide data accumulating and the available structural details of macromolecular complexes.

A number of studies have tackled the large-scale analysis of protein–protein complexes from a structural perspective (37). They have emphasized that size, shape, and the physicochemical complementarities at the interfaces are key descriptors that could be used to develop computational methods able to predict protein-binding sites from sequences or structures (814). In the context of evolution, seminal studies compared the binding modes of domain–domain interactions between homologous proteins and concluded that they tend to interact similarly, even if sequence identity has been maintained as low as 30% (15, 16). Such a low conservation threshold suggests that interaction surfaces can evolve significantly while maintaining sufficient specificity between the binding partners. The paradox between sequence divergence and structural conservation of macromolecular assemblies has been recently related to the notion of superfamily, and the existence of “heterodimer superfamilies” has been proposed (17). Although the hydrophobicity of buried positions is a major evolutionary constraint for the stable maintenance of a fold within a superfamily, it is not the only driving force at interfaces because these are usually made of heterogeneous physicochemical textures. In that context, understanding and capturing how selection pressure is exerted at protein-binding interfaces constitutes a fundamental challenge.

Different evolution-based methods, such as conservation and coordinated mutations analyses, have been examined to address this issue. Sensitive conservation analyses have been proposed to detect functional sites within proteins (1820) and have further been used for the identification of protein-binding sites (2126). However, conservation does not provide mutual information between interacting partners so as to identify the residue pairs in contact. The goal of coordinated mutations analyses is to identify correlated or compensatory mutations between regions that are close in space (2729). These analyses were shown to be potent in the prediction of the interacting sites within RNA (30) and were also used to predict intra- or intermolecular contacts between domains (23, 31, 32). Based on a large complex database, the Weng group (23) analyzed correlated mutations across permanent and transient interfaces and concluded that transient interfaces show very little evidence of coevolution. The limits of the correlated mutation signals for predicting intermolecular interactions were also emphasized in (32). Overall, these results suggest that evolution-based methods may lack accuracy to assist in large-scale intermolecular-contact prediction and in docking.

The high rate of mutations occurring at protein surfaces should severely challenge the cohesion of protein complexes across evolution. It is difficult to conceive how these perturbations can be accommodated with the delicate balance of elementary interactions that ensure affinity and selectivity (33). To address this issue, we explored an alternative approach, distinct from conservation and coordinated mutations analyses, to inspect how the physicochemical complementarities between contacting positions were constrained throughout evolution. Our analysis was first performed on a database of 86 complexes restricted to intramolecular, mostly permanent, interactions between domains of multidomain proteins. It constitutes a suitable dataset to bypass the problems related to the definition of orthology. In a second stage, a database of 129 intermolecular complexes comprising 75 transient and 54 permanent interactions was analyzed.

A remarkable versatility was found in the way physicochemical complementarities were maintained, likely accounting for the difficulty of capturing specific evolutionary traces at protein–protein interfaces. In many cases, complementarity is found to be preserved not only in a pairwise manner but also through clusters of neighboring residues. The complementarity appears strongly constrained only when mutations are considered in the framework of these clusters. Interactions belonging to different complementarity classes such as hydrophobic or short-range electrostatic are then found minimally disrupted throughout evolution. Based on this analysis, a powerful predictive approach called SCOTCH, as an acronym for Surface COmplementarity Trace in Complex History, was developed. It provides a robust predictive method for the identification of native and near-native interfaces which outperforms other evolution-based or statistically based methods tested. The SCOTCH method provides an appealing strategy for the structural prediction of macromolecular assemblies and a basic framework to efficiently track how protein surfaces could evolve while keeping their partners in contact.


Structural Neighbors Are Key to Capture Evolutionary Constraints at Interfaces.

We first wondered whether the physicochemical complementarity of interacting residue pairs was frequently disrupted through the evolution of intramolecular interfaces. In that scope, a database of 86 intramolecular complexes was built from the PSIMAP database (34) following the protocol described in Methods and supporting information (SI) Tables S1 and S2. We tested a simple model that restricts to three different types of complementarity between amino acids whose disruption is expected to be highly detrimental to the stability of a complex: (i) hydrophobic–hydrophobic, (ii) polar–polar, (iii) short-range oppositely charged complementarities. Based on this model, ≈62% of the residues in an interface have at least one contact with a complementary residue (see distribution Fig. S1A). For each complex between domains A and B, we analyzed all pairs (i,j) formed by each residue i in interface A with all its contacting residues j in interface B. For every pair (i,j), we defined a parameter called “complementarity ratio” as follows: the ratio of sequences inside a multiple sequence alignment for which the physicochemical complementarity between two sites is observed (Fig. 1) (see Methods). If this ratio is above a fixed threshold of 95%, the pair is considered as “significantly complementary” (the complementarity is disrupted in <5% of the sequences of the multiple-sequence alignment). Considering our intramolecular dataset, the proportion of pairs (i,j) at an interface exhibiting a given complementarity ratio was analyzed (Fig. 2A, white bars). Only a weak proportion (11%) of significantly complementary pairs was found.

Fig. 1.
Principle of the complementarity ratio analysis as calculated in the SCOTCH score. The interacting surfaces of partners A and B and a schematic representation of their multiple sequence alignment (circles represent positions) are colored with respect ...
Fig. 2.
Complementarity at native intramolecular interfaces. (A) Proportion of pairs at an interface exhibiting a given complementarity ratio. The complementarity ratio of a pair of contacting residues is the proportion of sequences in a multiple-sequence alignment ...

A limit of the former analysis is that the disruption of the physicochemical complementarity between two residues may be rescued by the mutation of neighboring residues. Complementarity may not be maintained in a strict pairwise manner. To account for that scenario, we further considered the k structural neighbors of each residue in the procedure (as in Fig. 1 for the fish sequence). By using the structural neighbors of the contacting pairs (with k = 2), a drastic shift of the distribution can be noticed (Fig. 2A, black bars). As many as 50% of the pairs (i,j) found at the interface of domain–domain interactions are detected as significantly complementary. Varying the number of structural neighbors from one to three, the proportion of significantly complementary pairs reaches a plateau with two neighbors (Fig. S2A, gray bars). The optimal number of structural neighbors was thus fixed to two in the rest of the study. Outside the interface, a random selection of noninteracting pairs exhibited a much lower percentage of significantly complementarity pairs (Fig. S2A, black line), suggesting an evolutive complementarity specific to real interacting pairs.

To test the specificity of the complementarity ratio analysis to predict protein-binding sites, we generated a set of decoys for each complex of the intramolecular database. Ten thousand docking solutions for each of the 86 pairs of interacting domains were generated by using the FTDOCK rigid body docking program and the bound structures of the domains as input (35). This program uses shape complementarity rules and fast-Fourier transform algorithm to cover as thoroughly as possible the protein surface with physically realistic solutions. A stringent distance criterion was used to define a solution as near-native by setting the backbone RMSD threshold to 3 Å from the structure of the native complex. On average, 7.4 near-native solutions were generated for each set of 10,000 decoys.

For each of the 860,000 solutions, the proportion of pairs (i,j) detected as significantly complementary was calculated. A sharp discrimination was obtained between near-native models and false alternative docking models (Fig. 2B). Without considering the structural neighbors (k = 0) the discrimination is significantly lower (Fig. S2B).

Discriminative Power of the Complementarity Trace (SCOTCH) for Intermolecular Interactions.

From the above analysis, a predictive method was developed combining the complementarity ratio with two other parameters accounting for the size of the interface and the existence of highly variable positions inside (see Methods for details). Two-thirds of the intramolecular database was randomly selected and used to adjust the weights of these three factors by a logistic-regression procedure. The logistic-regression model estimates the probability that a given docking solution belongs to the class of near-native or false complexes. In the following, the estimate of the class membership probability was denoted as the SCOTCH score. The remaining third of the intramolecular database gathering 28 complexes was used in the validation dataset. To assess the power of the SCOTCH score, the validation dataset also integrated 55 permanent and 77 transient intermolecular complexes. For each case, 10,000 models were generated with FTDOCK by using the bound structures as input (see Methods for details). FTDOCK did not produce any near-native solution for 3 of the 132 intermolecular complexes (these 3 cases were discarded from our analysis for the rest of the study).

For each of these cases, the docking models were scored by using the SCOTCH score as well as scores based on conservation or coordinated mutation analyses. The conservation analysis was performed by using the rate4site algorithm, which was shown to provide highly sensitive results and which infers the rate of evolution at each site of a multiple sequence alignment by using a probabilistic-based evolutionary model (36). Two different methods were used to account for coordinated mutations : (i) detection of correlated mutations by using substitution matrices as described in refs. 27 and 31 and (ii) detection of compensatory mutations, which quantifies the compensatory events between pairs of residues considering the physicochemical properties as described in refs. 28 and 37.

The quality of the predictions was evaluated for every method by counting the number of cases for which a near-native complex (RMSD from native <3 Å) has been selected within their best 10, 100, 1,000, and 10,000 solutions. Detection of a near-native solution in these sets of solutions can be considered as very good, good, acceptable, and bad, respectively. After this evaluation procedure, the SCOTCH method was found to outperform all of the others for intra- and intermolecular interactions (Fig. 3). The discriminative power is particularly high for intermolecular complexes with 93 of 129 complexes recognized in the Top 10 category (72%), against 8 of 129 complexes, at best, with the other approaches. Intermolecular permanent interfaces are better predicted than transient ones, with 80% against 67% of the solutions in the Top 10 category. Intramolecular predictions are less accurate, with only 46% of the solutions predicted in the Top 10 category, still significantly above alternative approaches. For coordinated mutation analyses, a sharp decrease in accuracy is found going from intra-, permanent inter-, to transient intermolecular complexes, consistent with previously published work (23). Conservation analysis does not follow this trend and performs, on average, better with transient intermolecular complexes (30% in the Top 10 + Top 100 categories) than with the others (≈20%). The detailed performances of the different approaches can also be appreciated from the global receiving operator characteristic (ROC) curves (see Fig. S3), which report the ability of each score to recognize true biological interfaces when all of the 1.29 × 106 intermolecular complexes are mixed together. The ability of SCOTCH to filter out false complexes (specificity) and to retrieve all of the near-native complexes (sensitivity) has been calculated for each case of the database, and the average values are presented in Table 1.

Fig. 3.
Discriminative power of the SCOTCH method. The number of complex cases for which a near-native solution is selected among the top10, top100, top1,000, and top10,000 solutions is shown. The complex database consists of 28 intramolecular domain interactions ...
Table 1.
Ability of SCOTCH, RPScore, and SCOTCH optimized with RPScore scores to filter out false complexes (specificity) and to retrieve all the near-native complexes (sensitivity)

SCOTCH Highlights the Versatile Adaptation of Complex Interfaces.

The specificity of the predictions obtained so far suggests that SCOTCH captures important constraints that apply to biological interfaces. The analysis of the complementarity ratio can thus be seen as a powerful tool to further probe the evolutionary events likely to occur at an interface. The native structures of the complexes correctly scored by using SCOTCH were used for that analysis (93 complexes scored in the Top 10 category in the intermolecular dataset, see Table S3). First, we analyzed to what extent structural neighbors are required to maintain complementarity between interacting pairs. On average, ≈40% of the positions of an interface were found involved in at least one significantly complementary pair (Fig. S1B). Among these, a majority (62%) is detected as significantly complementary only thanks to the structural neighbors, highlighting their importance to capture the evolution of interfaces (Fig. S1C).

We wondered further whether it was possible that a position in an interface changed its physicochemical nature drastically over evolution while maintaining a high complementarity. To do so, we tracked the existence of unambiguous complementarity switches between hydrophobic–hydrophobic and charged–charged interacting clusters across evolution. To insure significance, we imposed the restriction that at least 10% of the sequences in the multiple-sequence alignment exhibit the switch event. Interestingly, such switches were observed at least once in ≈40% of the 93 analyzed structures (Table S3). These results illustrate the versatility of the elementary interactions that build up an interface and underscore the complexity of the underlying evolution history.

Optimization of SCOTCH by Using a Statistical Pairwise Potential.

In the field of predictive docking, most methods take into account the physicochemical complementarity of interfaces to discriminate near-native structures. Among them, the widely used residue level pair potential score (RPScore) was derived from a statistical analysis of pairwise interactions at complex interfaces and was reported to efficiently select near-native structures from a set of decoys (38). All of the docking solutions generated from the intramolecular and intermolecular databases were scored with the RPScore. The histogram (Fig. 4) reporting the ranks of the first near-native solution of 10,000 decoy structures, shows that the RPScore performs significantly better than the conservation or the coordinated mutations analyses reported in Fig. 3. To further assess the precision of the methods, a stringent Top 1 category was added in Fig. 4, reporting the number of cases for which the model with the highest score is a near-native structure. As regards the intermolecular dataset, the SCOTCH score ranked 57% of the 129 complexes in the Top 1, whereas the RPScore reached the same precision in only 26% of the cases. Of interest, the combination of the SCOTCH and the RPScore scores improved significantly the recognition of the near-native solutions, raising these percentages up to 66% (black bar, Fig. 4) (the RPScore was added as an additional factor to the logistic regression procedure, see SI Text). The high predictive power of this hybrid score emphasizes that the evolutionary signals detected by SCOTCH can be further amplified by a statistical representation of the complementarity (Table 1).

Fig. 4.
Optimization of SCOTCH with the statistically based RPScore (38). The number of complex cases for which a near-native solution is selected among the top1, top10, top100, top1,000, and top10,000 solutions is shown. The complex database consists of 28 intramolecular ...


Evolutionary properties of protein-complex interfaces have been intensively studied to elucidate the key factors governing macromolecular assemblies. To identify the constraints exerted at interfaces, we developed an approach, SCOTCH, that exploits the sequence properties of two interacting partners throughout evolution. The SCOTCH assumption relies on the sensitive detection of any mutation that would disrupt the physicochemical complementarity to discriminate unlikely interfaces from possible interacting clusters of residues. To specifically address this feature, we did not assign to every possible contacting pairs a specific interaction score (as in RPScore-like strategies) but rather considered all them equivalent and counted the number of pairs whose complementarity remained highly conserved throughout evolution within the framework of the structural neighbors (corresponding to the 95% threshold of the complementarity ratio).

Until now, coevolutionary signals at intermolecular interfaces were thought to be too weak to accurately assist in large-scale docking predictions (23, 32). Although coordinated mutations analyses may detect some coevolutionary signals for permanent interactions, we confirmed the absence of almost any signal for transient interactions (23). Thanks to the dramatic increase in sensitivity provided by the SCOTCH approach, our study unambiguously attests to the existence of such coevolutionary signals in both permanent and transient complexes. Similarly to ref. 23, SCOTCH coevolution signals are higher for permanent than for transient interfaces. With regard to coordinated mutations analyses, their performance decrease dramatically as the mutation rate increase, and their use for predicting complex structures may require further development to cope with these limitations.

Although SCOTCH is an evolution-based approach, it does not require many homologs to extract mutual information from the multiple-sequence alignments. In half of our intermolecular cases, only 10–20 homologs were enough to reach high predictive accuracy (Table S3). Certain failures seem to be related either to small interfaces [badly predicted intramolecular interfaces cover 1,050 Å2, on average, significantly lower than the “standard size” of 1,600 (±400) Å2 as found in ref. 6] or to flawed multiple-sequence alignments that diverged during the automatic PSI-BLAST procedure. A typical example is the NAS6–RPT3 complex in which NAS6 belongs to the widespread superfamily of ankyrin repeats (PDB:2DZN). A careful rebuild of its alignments by selecting the proper orthologs boosts the SCOTCH score of the near-native model from the Top 1,000 to the Top 10 category. This example validates our initial use of a learning dataset that only comprised intramolecular interactions devoid of any orthology issues. Given the simplicity of the complementarity rules used here, there is probably room for improving the description of the complementarity classes. For instance, the charged-residues complementarity class restricts to short-range interactions that are only a fraction of the global electrostatic contribution to the binding selectivity and affinity (7, 39, 40). More sophisticated treatment of the electrostatic contribution in the light of the evolutionary principles used in SCOTCH may therefore improve further the performance of the method (41).

From the specificity obtained on the intermolecular dataset, SCOTCH is expected to be useful in predicting the association mode of two partners when only sparse experimental data are available (such as a point mutation disrupting the complex or an NMR-based chemical shift mapping). In terms of protein engineering, back-tracking the mutations that generated the switches between two distinct complementarity classes (see Results) may provide guidelines for the rational alteration of interfaces so as to disrupt, recover, or strengthen them. As an absolute score, consistent with Fig. S4, SCOTCH optimized with RPScore may also be considered as an indicator of true protein–protein interaction. For instance, among the 2.106 decoys generated, scores >11 or 7 for permanent or transient complexes, respectively, are very good indicators that a model is a near-native structure. Last, because SCOTCH captures fundamental constraints in the evolution of protein interfaces, it should contribute to a better understanding of the events that relate to the general evolution of macromolecular assemblies.



A set of 319 nonredundant intramolecular interactions were extracted from the PSIMAP database (34). A second set of 469 intermolecular protein complexes was extracted from previously published studies (23, 32, 42, 43) and from our own update of the PDB between 2004 and 2007. All the complexes exhibiting a buried surface area >700 Å were further analyzed. For every case in the intra- and intermolecular datasets, homologous sequences were retrieved from the SPTREMBL database by using a PSI-BLAST search (44) (SI Text). To be selected for further analysis, a minimal number of 25 and 10 homologous sequences was required for intra-and intermolecular interactions, respectively (23, 45). At the end, the dataset gathered 86 intramolecular interactions (Tables S1 and S2) and 132 intermolecular interactions (Table S3). The permanent or transient nature of the interactions was assigned according to the classifications and rules published in ref. 23.

Decoys Dataset.

Intra- and intermolecular decoy datasets were generated by using the FTDOCK program (35). In each case, 10,000 docking models were generated and classified as either near-native or false solutions if the backbone RMSD between the model and the native complex was below or above 3Å, respectively. For the intramolecular dataset and the bound states of the intermolecular database, 860,000 and 1,320,000 solutions were produced, respectively.

Scoring Decoys by Using Conservation Analysis.

The conservation scores for each amino acid position in the multiple-sequence alignments (MSA) were computed by using the rate4site program (36). The Bayesian method was applied for the calculation of the conservation scores by using the Jones–Taylor–Thornton amino acid-substitution model (46). The conservation scores computed by rate4site are a relative measure of evolutionary conservation at each position in the MSA: The lowest score represents the most conserved position. Each score S was rescaled between 0 and 9 by using

equation image

where Rscore is the rescaled score, Lscore is the lowest score, and Δscore is the amplitude of all the scores. The rescaled scores were then partitioned into a scale nine bins. Bins 1–3 correspond to the most variable sites, whereas bins 7–9 correspond to the most conserved positions. To assess the performance of the conservation analysis, each model of the decoys dataset was scored by counting the percentage of conserved positions (bins 7–9) in the interface of the complex.

Scoring Complexes by Using Coevolution Analyses.

Correlated mutations analysis.

Correlated mutations were calculated according to the Göbel algorithm (27) and the PAM70 substitution matrix. Positions with >10% gaps or that were completely conserved were discarded from the correlated mutation analysis. The pairs of positions were sorted according to their correlation value, and the top M residues were defined as predicted contacts, with M proportional to the protein size. M was set to half of the sequence length L.

Compensatory mutations analysis.

Compensatory mutations were analyzed by assigning a scalar metric to each type of amino acid and calculating correlation coefficients of these quantities between different positions of the interacting proteins as introduced in different studies (28, 37). By considering a physicochemical amino acid property f, compensatory mutations between two positions i and j were calculated by using the formula

equation image

where sij is the covariance between the position i and j, and fmi is the physicochemical property value of the sequence m at position i. Two physicochemical scales were considered, one based on the amino acid isoelectric point (47) to account for compensations involving an inversion of charges and one based on a hydropathy scale (48) to account for hydrophobic/polar compensations. Mutations were considered as compensatory when a pair of positions in contact i and j exhibited a value of rij < −0.4 with isoelectric point scale or rij > 0.4 with the hydropathy scale (0.4 is a standard cut-off defined in ref. 27). Positions with >50% gaps, highly variable positions (bins 1–3 from the conservation analysis), or strictly conserved positions were discarded from the compensatory mutation analysis. The robustness of the procedure was verified by sequentially removing the most distant sequences and rerunning the analysis. In a final step, the pairs were ranked according to their correlation, and the n best pairs involving accessible positions [solvent-accessible threshold 25%, as calculated by WHATIF (49)] were extracted and predicted as real contacts (a value of n = 15 was found to give the best results). No compensatory mutations could be predicted by following that protocol in some intramolecular (SCOP ID codes: d1bgla1_d1bgla5, d1fgs_1_d1fgs_2, d1e5ta1_d1e5ta2, d1e1ca1_d1e1ca2) and intermolecular cases (PDB ID codes: 1FQJ, 1TNR, 1TOC, 1ICF, 1FQK, 1FLE, 1E0O, 1BDJ, 2OMZ, 1I1Q, 1H32, 2FDB, 1JNR, 1JB0_C:D, 1GXD, 1DJ7, 1STF, 1DKF, 1H2V).

Scoring the models.

For correlated and compensatory mutations analyses, decoys were ranked with respect to the Xd formula as proposed in ref. 31. A positive Xd indicates cases where the population of correlated pairs is shifted to smaller distances than the population of all pairs.

Scoring Complexes by Using Complementarity Analysis.

Complementarity analysis.

The complementarity analysis was performed for every pair of interacting positions i and j at a complex interface based on the amino acid types (distance threshold for contact at 4.5 Å between any atom of i and j). The 20 amino acids were clustered into four classes with respect to their major physicochemical characteristics (1) hydrophobic or neutral residues (G, A, V, L, I, M, C, F, P, W, Y), (2) polar residues (S, T, N, Q), (3) positively charged residues (K, R, H), (4) negatively charged residues (D, E). Considering N sequences aligned and two positions i and j, these positions were considered as complementary if they involved either two amino acids of the class 1, or two amino acids of the class 2, or one amino acid of class 3 together with one amino acid of class 4. This complementarity was calculated for the N sequences either restricting to positions i and j (no neighbors) or integrating the effect of neighboring positions. In the latter, the k nearest structural neighboring positions of the site i within the first protein i1, i2,…, ik and the k nearest structural neighboring positions of the site j within the second protein j1, j2,…, jk were also taken into account in the complementarity evaluation of the positions i–j, such as i–jk, or ik–j but not as ik–jk, which are seen as alternative pairs. An additional constraint applied to neighbors so that distances d(ik–j) and d(jk–i) were <4.5 Å and that the solvent-accessible surfaces of i1, i2,…, ik and of j1, j2,…, jk were modified upon the binding of the partner. If the positions i–j were found complementary in >95% of the sequences of the MSA (the percentage of sequences exhibiting complementarity is called the complementarity ratio), these sites were considered as “significantly complementary.”

Scoring the models.

The SCOTCH score is based on the proportion of significantly complementary pairs and was defined as follows. The database of 86 intramolecular complexes was randomly split into two subsets: 58 cases (2/3 of the database) as a training dataset, and 28 complexes (1/3) as part of a blind test dataset. Based on the training dataset, a logistic regression was used to fit the weights of three features extracted from each interface : (i) the proportion of significantly complementary pairs of residues (with k = 2), (ii) the number of interacting pairs, and (iii) the percentage of highly variable positions (bins 1–3 as calculated in the conservation analysis). The resulting equation defines the SCOTCH score, which estimates the probability that the decoy belongs to the class of the near-native solutions (SI Text and Table S4). Decoys were ranked with respect to this score and were predicted as near-native for positive scores (corresponding to a probability of 0.5 of being true from the logistic regression).

Scoring Complexes by Using Statistically Based Pair Potential.

The RPScore (38) was used with default parameters to score every decoy of the intra- and intermolecular datasets.

Supplementary Material

Supporting Information:


We thank E. Becker, J.-B. Charbonnier, B. Gilquin, P. Legrain, C. Mann, M.-C. Marsolier-Kergoat, F. Ochsenbein, A. Peyroche, and the three anonymous reviewers for their constructive comments on the manuscript. This work was supported in part by the Action Concertée Incitative Informatique, Mathématique, Physique en Biologie Moléculaire (ACI IMPBIO) 2004. H.M. was supported by a DGA fellowship.


The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/cgi/content/full/0707032105/DCSupplemental.


1. Bork P, et al. Protein interaction networks from yeast to human. Curr Opin Struct Biol. 2004;14:292–299. [PubMed]
2. Krogan NJ, et al. Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature. 2006;440:637–643. [PubMed]
3. Janin J, Chothia C. The structure of protein–protein recognition sites. J Biol Chem. 1990;265:16027–16030. [PubMed]
4. Jones S, Thornton JM. Principles of protein–protein interactions. Proc Natl Acad Sci USA. 1996;93:13–20. [PMC free article] [PubMed]
5. Jones S, Thornton JM. Analysis of protein–protein interaction sites using surface patches. J Mol Biol. 1997;272:121–132. [PubMed]
6. Lo Conte L, Chothia C, Janin J. The atomic structure of protein–protein recognition sites. J Mol Biol. 1999;285:2177–2198. [PubMed]
7. Shaul Y, Schreiber G. Exploring the charge space of protein–protein association: a proteomic study. Proteins. 2005;60:341–352. [PubMed]
8. Jones S, Thornton JM. Prediction of protein–protein interaction sites using patch analysis. J Mol Biol. 1997;272:133–143. [PubMed]
9. Zhou HX, Shan Y. Prediction of protein interaction sites from sequence profile and residue neighbor list. Proteins. 2001;44:336–343. [PubMed]
10. Fariselli P, Pazos F, Valencia A, Casadio R. Prediction of protein–protein interaction sites in heterocomplexes with neural networks. Eur J Biochem. 2002;269:1356–1361. [PubMed]
11. Ofran Y, Rost B. Predicted protein–protein interaction sites from local sequence information. FEBS Lett. 2003;544:236–239. [PubMed]
12. Keil M, Exner TE, Brickmann J. Pattern recognition strategies for molecular surfaces: III. Binding site prediction with a neural network. J Comput Chem. 2004;25:779–789. [PubMed]
13. Neuvirth H, Raz R, Schreiber G. ProMate: A structure based prediction program to identify the location of protein–protein binding sites. J Mol Biol. 2004;338:181–199. [PubMed]
14. Bradford JR, Westhead DR. Improved prediction of protein–protein binding sites using a support vector machines approach. Bioinformatics. 2005;21:1487–1494. [PubMed]
15. Aloy P, Ceulemans H, Stark A, Russell RB. The relationship between sequence and interaction divergence in proteins. J Mol Biol. 2003;332:989–998. [PubMed]
16. Kim WK, Ison JC. Survey of the geometric association of domain–domain interfaces. Proteins. 2005;61:1075–1088. [PubMed]
17. Lukatsky DB, Shakhnovich BE, Mintseris J, Shakhnovich EI. Structural similarity enhances interaction propensity of proteins. J Mol Biol. 2007;365:1596–1606. [PMC free article] [PubMed]
18. Lichtarge O, Bourne HR, Cohen FE. An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol. 1996;257:342–358. [PubMed]
19. Landgraf R, Xenarios I, Eisenberg D. Three-dimensional cluster analysis identifies interfaces and functional residue clusters in proteins. J Mol Biol. 2001;307:1487–1502. [PubMed]
20. Pupko T, et al. Rate4Site: An algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics. 2002;1(18) Suppl:S71–S77. [PubMed]
21. Caffrey DR, et al. Are protein–protein interfaces more conserved in sequence than the rest of the protein surface? Protein Sci. 2004;13:190–202. [PMC free article] [PubMed]
22. Jimenez JL. Does structural and chemical divergence play a role in precluding undesirable protein interactions? Proteins. 2005;59:757–764. [PubMed]
23. Mintseris J, Weng Z. Structure, function, and evolution of transient and obligate protein–protein interactions. Proc Natl Acad Sci USA. 2005;102:10930–10935. [PMC free article] [PubMed]
24. Duan Y, Reddy BV, Kaznessis YN. Physicochemical and residue conservation calculations to improve the ranking of protein–protein docking solutions. Protein Sci. 2005;14:316–328. [PMC free article] [PubMed]
25. Chelliah V, Blundell TL, Fernandez-Recio J. Efficient restraints for protein–protein docking by comparison of observed amino acid substitution patterns with those predicted from local environment. J Mol Biol. 2006;357:1669–1682. [PubMed]
26. de Vries SJ, van Dijk AD, Bonvin AM. WHISCY: What information does surface conservation yield? Application to data-driven docking. Proteins. 2006;63:479–489. [PubMed]
27. Gobel U, Sander C, Schneider R, Valencia A. Correlated mutations and residue contacts in proteins. Proteins. 1994;18:309–317. [PubMed]
28. Neher E. How frequent are correlated changes in families of protein sequences? Proc Natl Acad Sci USA. 1994;91:98–102. [PMC free article] [PubMed]
29. Choi SS, Li W, Lahn BT. Robust signals of coevolution of interacting residues in mammalian proteomes identified by phylogeny-aided structural analysis. Nat Genet. 2005;37:1367–1371. [PubMed]
30. Dutheil J, Pupko T, Jean-Marie A, Galtier N. A model-based approach for detecting coevolving positions in a molecule. Mol Biol Evol. 2005;22:1919–1928. [PubMed]
31. Pazos F, Helmer-Citterich M, Ausiello G, Valencia A. Correlated mutations contain information about protein–protein interaction. J Mol Biol. 1997;271:511–523. [PubMed]
32. Halperin I, Wolfson H, Nussinov R. Correlated mutations: advances and limitations. A study on fusion proteins and on the Cohesin-Dockerin families. Proteins. 2006;63:832–845. [PubMed]
33. Reichmann D, et al. The molecular architecture of protein–protein binding sites. Curr Opin Struct Biol. 2007;17:67–76. [PubMed]
34. Kim WK, Bolser DM, Park JH. Large-scale co-evolution analysis of protein structural interlogues using the global protein structural interactome map (PSIMAP) Bioinformatics. 2004;20:1138–1150. [PubMed]
35. Gabb HA, Jackson RM, Sternberg MJ. Modelling protein docking using shape complementarity, electrostatics and biochemical information. J Mol Biol. 1997;272:106–120. [PubMed]
36. Mayrose I, Graur D, Ben-Tal N, Pupko T. Comparison of site-specific rate-inference methods for protein sequences: Empirical Bayesian methods are superior. Mol Biol Evol. 2004;21:1781–1791. [PubMed]
37. Afonnikov DA, Oshchepkov DY, Kolchanov NA. Detection of conserved physico-chemical characteristics of proteins by analyzing clusters of positions with co-ordinated substitutions. Bioinformatics. 2001;17:1035–1046. [PubMed]
38. Moont G, Gabb HA, Sternberg MJ. Use of pair potentials across protein interfaces in screening predicted docked complexes. Proteins. 1999;35:364–373. [PubMed]
39. Sheinerman FB, Honig B. On the role of electrostatic interactions in the design of protein–protein interfaces. J Mol Biol. 2002;318:161–177. [PubMed]
40. Dong F, Zhou HX. Electrostatic contribution to the binding stability of protein–protein complexes. Proteins. 2006;65:87–102. [PubMed]
41. Man-Kuang Cheng T, Blundell TL, Fernandez-Recio J. pyDock: Electrostatics and desolvation for effective scoring of rigid-body protein–protein docking. Proteins. 2007;68:503–515. [PubMed]
42. Chen R, Mintseris J, Janin J, Weng Z. A protein–protein docking benchmark. Proteins. 2003;52:88–91. [PubMed]
43. Mintseris J, Weng Z. Atomic contact vectors in protein–protein recognition. Proteins. 2003;53:629–639. [PubMed]
44. Altschul SF, et al. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
45. Pazos F, Valencia A. Similarity of phylogenetic trees as indicator of protein–protein interaction. Protein Eng. 2001;14:609–614. [PubMed]
46. Jones DT, Taylor WR, Thornton JM. The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci. 1992;8:275–282. [PubMed]
47. White A, et al. Principles of Biochemistry. New York: McGraw–Hill; 1978.
48. Kyte J, Doolittle RF. A simple method for displaying the hydropathic character of a protein. J Mol Biol. 1982;157:105–132. [PubMed]
49. Vriend G. WHAT IF: A molecular modeling and drug design program. J Mol Graphics. 1990;8:29, 52–56. [PubMed]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • MedGen
    Related information in MedGen
  • PubMed
    PubMed citations for these articles
  • Substance
    PubChem Substance links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...