![]() | ![]() |
Formats:
|
||||||||||||||||||||||
Copyright : © 2007 Shoemaker and Panchenko. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Deciphering Protein–Protein Interactions. Part II. Computational Methods to Predict Protein and Domain Interaction Partners Fran Lewitter, Editor Whitehead Institute, United States of America * To whom correspondence should be addressed. E-mail: panch/at/ncbi.nlm.nih.gov This article has been cited by other articles in PMC.Recent advances in high-throughput experimental methods for the identification of protein interactions have resulted in a large amount of diverse data that are somewhat incomplete and contradictory. As valuable as they are, such experimental approaches studying protein interactomes have certain limitations that can be complemented by the computational methods for predicting protein interactions. In this review we describe different approaches to predict protein interaction partners as well as highlight recent achievements in the prediction of specific domains mediating protein–protein interactions. We discuss the applicability of computational methods to different types of prediction problems and point out limitations common to all of them. Introduction In our companion review published in the last issue [1], we outlined the experimental techniques for the identification and characterization of protein interactions. We showed that high-throughput experimental methods produce a large amount of data which needs to be analyzed and verified. Despite this, interactomes of many organisms are far from complete. The low interaction coverage along with the experimental biases toward certain protein types and cellular localizations reported by most experimental techniques call for the development of computational methods to predict whether two proteins interact. These methods can be very useful for choosing potential targets for experimental screening or for validating experimental data (see [1]) and can provide information about interaction details (in the case of domain prediction methods) which might not be apparent from the experimental techniques. Many methods use combinations of experimental and computational techniques to different extent (for example, gene co-expression and synthetic lethality methods were covered among experimental approaches in our companion paper [1]) and do not predict physical interactions directly but rather infer the functional associations between potentially interacting proteins. In this review, we report on several methods to predict protein or domain interaction partners. Some computational methods are based on the co-localization of potentially interacting genes in the same gene clusters or protein chains (gene cluster, gene neighborhood, and Rosetta stone methods), on co-evolution patterns in interacting proteins (sequence co-evolution methods), and on the co-expression of genes. Some methods find patterns of co-occurences in interacting proteins, protein domains, and phenotypes (phylogenetic profiles and synthetic lethality methods), while others use the presence of sequence/structural motifs characteristic only for interacting proteins (classification methods, association methods). To analyze interaction specificity at the domain level, in this second paper of the review we describe methods that are aimed at identifying specific domains mediating interactions in an interacting protein pair. Methods for Predicting Protein Interaction Partners Table 1 lists different protein interaction methods, and Figure 1
Gene neighbor and gene cluster methods. Genes with closely related functions encoding potentially interacting proteins are often transcribed as a single unit, an operon, in bacteria and are co-regulated in eukaryotes. Different methods have been developed trying to predict operons based on intergenic distances [2–6] (Figure 1 Phylogenetic profile methods. The phylogenetic profile (PP) method is based on the hypothesis that functionally linked and potentially interacting nonhomologous proteins co-evolve and have orthologs in the same subset of fully sequenced organisms [9,15–19]. Indeed, components of complexes and pathways should be present simultaneously in order to perform their functions. A phylogenetic profile is constructed for each protein, as a vector of N elements, where N is the number of genomes (Figure 1 Some drawbacks of PP include its high computational cost, its dependence on high information profiles, and homology detection between distant organisms. For example, ubiquitous unlinked proteins present in all genomes (profiles with all “1”s) will be counted by PP as correlated. The same is true for proteins that are specific to a given genome (profiles with all, but one, “0”s). Shared phylogenetic relationships between two proteins can also produce false correlations between profiles. This issue has recently been addressed by incorporating the phylogenetic trees in the analysis of correlated gains and losses of pairs of proteins [23]. Rosetta Stone method. The Rosetta Stone approach infers protein interactions from protein sequences in different genomes [24–27]. It is based on the observation that some interacting proteins/domains have homologs in other genomes that are fused into one protein chain, a so-called Rosetta Stone protein (Figure 1 Sequence-based co-evolution methods. As was mentioned earlier, interacting proteins very often co-evolve so that changes in one protein leading to the loss of function or interaction should be compensated by the correlated changes in another protein. The orthologs of coevolving proteins also tend to interact, thereby making it possible to infer unknown interactions in other genomes [28]. It has been argued that co-evolution can be reflected in terms of the similarity between phylogenetic trees of two non-homologous interacting protein families (Figure 1 The similarity between two phylogenetic trees is influenced by the speciation process, and therefore there is a certain “background” similarity between trees of any proteins, no matter if they interact or not. Different statistical techniques have been developed to account for “phylogenetic subtraction” [35]. Simplified versions of this approach were introduced recently to account for the background similarity in protein interaction prediction [36–38]. According to one of them [36], the “background” tree is constructed from the 16S rRNA sequences and is considered to be a canonical tree of life. The final distance matrices are obtained by subtracting the rescaled rRNA-based distances from the evolutionary distances obtained from the original phylogenetic trees. It has been shown that this method finds 50% of real interacting proteins at a 6.4% false positive rate compared with the 16.5% false positive rate obtained using methods which do not take into account evolutionary distances and the “background” canonical tree [29,30]. One example of how co-evolution studies could be used in confirming and predicting putative interaction partners is the case of DNA colicins and their immunity proteins [39]. Colicins consist of an N-terminal domain participating in translocation across the membrane of the target cell, the central domain which specifically binds to the extracellular surface receptor, and the C-terminal domain responsible for the toxic activity of colicin. Each DNase colicin has a specific immunity protein, which binds to the toxic domain and inhibits its cytotoxic activity. Co-evolution studies showed that there is a significant correlation between the two families of DNA colicins and their immunity proteins (with the correlation coefficient of 0.67), with weaker correlation between Im2, Im8, and Im9 immunity proteins and their corresponding binding partners. Experimental studies indicated that there is indeed a cross-reactivity between colicin E9 and Im8 and Im2 proteins [40]. Classification methods. Different classification methods have been successfully applied to the prediction of protein and domain interactions [41–54]. These methods use various data sources to train a classifier to distinguish between positive examples of truly interacting protein/domain pairs from the negative examples of non-interacting pairs. Kernel methods are particularly useful in this respect as they provide a vectorial representation of data in the feature space through the set of pairwise comparisons [54]. Each protein or protein pair can be encoded as a feature vector where features may represent a particular information source on protein interactions, domain compositions, or evidence coming from various experimental methods. As a result of a comparison of different classifiers, it has been shown that Random Forest Decision (RFD) consistently ranks as a top classifier, with Support Vector Machines being in second place [55]. Figure 1 Multiple sources of direct and indirect data on protein–protein interactions can be combined in a supervised learning framework or integrative scoring scheme to predict protein–protein and domain–domain interactions [47,53,56–60]. It has been shown that the prediction accuracy is improved when several sources of data are used, and, in addition, integrative approaches can provide means to justify the confidence of inferred interactions. Predicting Domain Interactions from Protein Interactions By far the most coverage of experimental data describing protein interaction networks comes from high-throughput experiments giving us the identity of interacting protein pairs (see our previous review [1]). Unfortunately, these experiments reveal no structural details about the interaction interfaces and the formation of protein complexes. To deal with these limitations, several approaches have been developed to predict which domains in a protein pair interact given a set of experimental protein interactions; some of them focus on interactions involving specific mediating domains/peptides (SH2, SH3, PDZ domains) [61,62]. The following section gives an overview of domain prediction methods which are listed in Table 1. Note that some of the approaches already mentioned for protein interaction prediction, namely the sequence co-evolution, phylogenetic profiles, and classification methods, are also applicable to domain interaction prediction. Most methods begin by annotating protein sequences with domains that can be defined by Pfam, SCOP, CDD, or other domain databases [63–65]. The methods are typically trained on high-throughput protein interaction data. Predicted domain interactions are evaluated using structural data or by higher quality interaction sets such as MIPS [66]. Moreover, accounting for domains in proteins and domain interaction networks can in turn help in predicting protein interactions [67–70]. Association methods. This group of methods looks for the characteristic sequence or structural motifs which distinguish interacting proteins from non-interacting [71–74]. For this purpose association methods can use different classifiers (see previous section), and some of them are tuned specifically to identify domains responsible for protein interactions. For example, as shown in Figure 2
Bayesian network models and maximum likelihood methods. The association method which uses correlated sequence signatures [71] considers each pair of interacting domains separately, ignoring other domains in a given pair of interacting proteins (Figure 2 Domain pair exclusion analysis. The domain pair exclusion analysis method extends the previously described MLE method and can detect specific domain interactions (see Figure 2 A high E-score value shows the high propensity of two domains to interact, while a low value indicates that competing domains from the same protein pair are more likely to be responsible for this interaction. Therefore, specific domain interactions can be found by screening for low θ values and high E-scores. Although this model does not account for false positives and negatives in the experimental data, it was shown that the E-scores perform better than its constituent quantities, finding 2.9 times more true positives than random assignment; for comparison, θ values yield 1.4 times more true positives than random assignments [77] . p-Value method. The p-value method tests a null hypothesis that the presence of a particular domain pair in a protein pair has no effect on whether two proteins interact [78]. To test this hypothesis, a statistic is calculated for each domain pair which takes into account experimental error (fraction of false positives) and incompleteness of the dataset (fraction of false negatives). The reference distribution is simulated by shuffling domains in proteins so that the network of protein interactions remains fixed. Obtained p-values show the reliability of domain interactions given that two proteins interact, and the domain pair with the lowest p-value is most likely to interact. The p-value method performs reasonably well when there are nine or more domains on a protein pair. However, interestingly enough, for the majority of test cases, random domain prediction outperforms all methods tested, pointing to the low accuracy of all prediction methods of domain interactions. The methods of domain interaction prediction described in this section all have varying degrees of success, but have limitations common to most of them. First, domains are assumed to interact independently, although their interactions can depend on other domains in a protein pair. Second, incomplete domain assignments, due to insufficient coverage of domain databases and limited searching ability of domain profiles, can lead to false positive and negative interaction predictions. Finally, protein interaction data is not complete, whereas domain prediction methods are based on this data. In this paper, we reviewed various computational methods to predict protein and domain interaction partners [79]. All of these methods use experimental data sources, some of them to a larger extent (gene co-expression, synthetic lethality) than others. As a result, they all suffer from the limitations of experimental approaches and incompleteness of observed data. Despite the fact that there is a certain circularity in testing the computational methods on experimental data, their prediction accuracy proved to be increasing, which makes them useful for the validation and analysis of diverse protein interactomes. The majority of presented prediction methods do not rely on protein structures and potentially can be applied on the genome-wide scale; while structural analysis can provide further details of protein–protein and domain–domain interfaces and give clues on their modeling. ![]() Acknowledgments The authors thank Elena Zotenko and Teresa Przytycka for helpful discussions and Robert Yates for graphic design of the figures. Abbreviations
Footnotes Benjamin A. Shoemaker and Anna R. Panchenko are with the Computational Biology Branch of the National Center for Biotechnology Information in Bethesda, Maryland, United States of America. Competing interests. The authors have declared that no competing interests exist. Funding. This work was supported by the Intramural Research Program of the National Library of Medicine at the National Institutes of Health of the US Department of Health and Human Services. Author contributions. BAS and ARP analyzed the data and wrote the paper. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||
Genome Biol. 2004; 5(5):R35.
[Genome Biol. 2004]Genome Biol. 2004; 5(5):R35.
[Genome Biol. 2004]Proc Natl Acad Sci U S A. 2000 Jun 6; 97(12):6652-7.
[Proc Natl Acad Sci U S A. 2000]Trends Biochem Sci. 1998 Sep; 23(9):324-8.
[Trends Biochem Sci. 1998]Nucleic Acids Res. 2002 May 15; 30(10):2212-23.
[Nucleic Acids Res. 2002]Genome Res. 2000 Aug; 10(8):1204-10.
[Genome Res. 2000]Nat Biotechnol. 2000 Jun; 18(6):609-13.
[Nat Biotechnol. 2000]Proc Natl Acad Sci U S A. 1999 Apr 13; 96(8):4285-8.
[Proc Natl Acad Sci U S A. 1999]BMC Bioinformatics. 2006 Sep 27; 7():420.
[BMC Bioinformatics. 2006]FEBS J. 2005 Oct; 272(20):5110-8.
[FEBS J. 2005]Science. 2004 Dec 24; 306(5705):2246-9.
[Science. 2004]PLoS Comput Biol. 2005 Jun; 1(1):e3.
[PLoS Comput Biol. 2005]Science. 1999 Jul 30; 285(5428):751-3.
[Science. 1999]Proc Natl Acad Sci U S A. 2001 Jul 3; 98(14):7940-5.
[Proc Natl Acad Sci U S A. 2001]Science. 2000 Jan 7; 287(5450):116-22.
[Science. 2000]J Mol Biol. 2000 Jun 2; 299(2):283-93.
[J Mol Biol. 2000]Protein Eng. 2001 Sep; 14(9):609-14.
[Protein Eng. 2001]J Mol Biol. 2006 Sep 29; 362(4):861-75.
[J Mol Biol. 2006]J Mol Biol. 2003 Mar 14; 327(1):273-84.
[J Mol Biol. 2003]J Mol Biol. 2005 Sep 30; 352(4):1002-15.
[J Mol Biol. 2005]J Mol Biol. 2000 Jun 2; 299(2):283-93.
[J Mol Biol. 2000]Protein Eng. 2001 Sep; 14(9):609-14.
[Protein Eng. 2001]J Mol Biol. 2002 Nov 15; 324(1):177-92.
[J Mol Biol. 2002]Biochemistry. 1995 Oct 24; 34(42):13751-9.
[Biochemistry. 1995]Science. 2003 Oct 17; 302(5644):449-53.
[Science. 2003]Proteins. 2006 May 15; 63(3):490-500.
[Proteins. 2006]Bioinformatics. 2005 Dec 15; 21(24):4394-400.
[Bioinformatics. 2005]Bioinformatics. 2004 Aug 4; 20 Suppl 1():i363-70.
[Bioinformatics. 2004]Bioinformatics. 2005 Jun; 21 Suppl 1():i38-46.
[Bioinformatics. 2005]Bioinformatics. 2005 Aug 1; 21(15):3279-85.
[Bioinformatics. 2005]Proc Natl Acad Sci U S A. 2004 Nov 2; 101(44):15682-7.
[Proc Natl Acad Sci U S A. 2004]Bioinformatics. 2004 Aug 4; 20 Suppl 1():i274-82.
[Bioinformatics. 2004]Bioinformatics. 2006 Mar 1; 22(5):532-40.
[Bioinformatics. 2006]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D226-9.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2002 Jan 1; 30(1):281-3.
[Nucleic Acids Res. 2002]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D436-41.
[Nucleic Acids Res. 2006]BMC Evol Biol. 2005 Mar 23; 5(1):24.
[BMC Evol Biol. 2005]J Mol Biol. 2001 Aug 24; 311(4):681-92.
[J Mol Biol. 2001]Bioinformatics. 2005 Jan 15; 21(2):218-26.
[Bioinformatics. 2005]J Mol Biol. 2001 Aug 24; 311(4):681-92.
[J Mol Biol. 2001]Bioinformatics. 2003 Oct 12; 19(15):1875-81.
[Bioinformatics. 2003]Genome Res. 2002 Oct; 12(10):1540-8.
[Genome Res. 2002]Genome Biol. 2005; 6(10):R89.
[Genome Biol. 2005]Genome Biol. 2005; 6(10):R89.
[Genome Biol. 2005]Bioinformatics. 2005 Apr 1; 21(7):993-1001.
[Bioinformatics. 2005]