![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||
Copyright : © 2007 Betel et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Structure-Templated Predictions of Novel Protein Interactions from Sequence Information 1 Samuel Lunenfeld Research Institute, Mt. Sinai Hospital, Toronto, Ontario, Canada 2 Department of Biochemistry, University of Toronto, Toronto, Ontario, Canada 3 Department of Medical Genetics and Microbiology, University of Toronto, Toronto, Ontario, Canada Luhua Lai, Editor Peking University, China * To whom correspondence should be addressed. E-mail: betel/at/cbio.mskcc.org Received April 4, 2007; Accepted August 2, 2007. This article has been cited by other articles in PMC.Abstract The multitude of functions performed in the cell are largely controlled by a set of carefully orchestrated protein interactions often facilitated by specific binding of conserved domains in the interacting proteins. Interacting domains commonly exhibit distinct binding specificity to short and conserved recognition peptides called binding profiles. Although many conserved domains are known in nature, only a few have well-characterized binding profiles. Here, we describe a novel predictive method known as domain–motif interactions from structural topology (D-MIST) for elucidating the binding profiles of interacting domains. A set of domains and their corresponding binding profiles were derived from extant protein structures and protein interaction data and then used to predict novel protein interactions in yeast. A number of the predicted interactions were verified experimentally, including new interactions of the mitotic exit network, RNA polymerases, nucleotide metabolism enzymes, and the chaperone complex. These results demonstrate that new protein interactions can be predicted exclusively from sequence information. Author Summary Many functions performed within a living cell are mediated by specific interactions between proteins. Precise geometric and chemical matches between segments of the protein structures facilitate those interactions. Such binding surfaces are often evolutionarily conserved elements of protein structures known as conserved domains that recognize specific binding elements on the interacting proteins. Binding domains and their corresponding interacting profiles constitute basic interacting modules that are replicated in multiple protein pairs, where they mediate similar interactions. Although many conserved domains are identified, only a handful have known, well-characterized binding elements. This paper describes a computational method that aims to elucidate the binding specificity of many domains. The utility of the derived binding specificity is demonstrated by predicting new interactions between yeast proteins. The predictions are based solely on sequence information by identifying the conserved domains and their corresponding binding sequences. A number of the predicted interactions were confirmed experimentally, demonstrating the feasibility of this approach. Introduction The interaction between two proteins is a geometric and electrostatic match between two polypeptide surfaces that results in a stable set of bonds between amino acid side chains or backbone atoms. The interacting amino acids are often part of conserved sequence features such as domains or short linear motifs that constitute the interaction site between the two proteins. Despite the increased coverage and sensitivity of experimental techniques for detecting protein interactions [1–6] (reviewed in [7]), elucidating the precise interacting residues remains experimentally difficult. In most cases, all that is known about an interaction is the identity of the two interacting proteins, with little information about the underlying binding site. However, detailed knowledge of interaction specificity is important for understanding reaction mechanism, interaction prediction, and drug development. Interacting domains are autonomous structural elements that exhibit distinct binding specificity to a multitude of target polypeptides. Such domains act as independent elements that can be “plugged” into a new protein and thereby introduce new functionality to the emerging protein [8]. From an evolutionary perspective, such rearrangements and the multiplication of existing conserved domains is a likely mechanism by which organisms generate new proteins, pathways, and novel functionalities [9,10]. Several protein interaction prediction methods exploit the conservation of protein-binding interfaces by identifying domain pairs that consistently co-occur in interacting proteins or coevolve, which are then used to predict new interactions [11–16]. Structure-based prediction methods use known protein complexes to model interactions between proteins that are homologous to the complex components [17,18]. Other prediction methods use integrative approaches that incorporate interaction experiments with additional functional information such as correlated expression level, common functional annotation [19,20], and cross-species comparisons [21]. Alternative approaches attempt to identify correlated sequence motifs that represent generic interacting sequence elements that may or may not be components of conserved domains [22–25]. In a few limited cases, detailed experimental data are used to generate high-resolution definition of domain binding profiles; however, such information is available only for a small number of domains [26,27]. Our primary objective is to predict interaction between proteins strictly from sequence information. Our approach is based on identifying the binding specificity of interacting domains that can then be used to predict new interactions. Here, we use existing physical interaction data to derive sequence profiles of the binding sequences that are presumed to determine the binding specificity of interacting domains. Our method, called domain–motif interactions from structural topology (D-MIST), is based on a two-step approach. First, potential domain-binding motifs are extracted from structural data. Second, these motifs are converted to sequence profiles in the form of position-specific scoring matrices (PSSMs). These PSSMs are derived using a subset of experimentally determined binary interactions that contain the domain of interest (Figure 1
Results The library of 3-D structures of protein complexes contains a detailed description of the binding interfaces between interacting proteins that include atom contacts and residue side-chain interactions [28]. Using more than 10,000 structural complexes, we identified the domains in the binding sites and extracted their associated sequence motifs on the opposing chain. Interacting residues were defined as two residues on opposite polypeptide chains separated by a maximum of 5 Å (Figure 1 The binding specificity of a domain is determined by a combination of physiochemical properties and structural constraints at the binding site that can be satisfied by multiple variations of the consensus sequence motif [29]. The interacting sequence motifs extracted from the protein structures represent a first approximation of the binding specificity of the interacting domains, but do not represent the full evolutionary variations of the residue–residue interactions available in one binding topology. A more informative representation of the possible motif variations is a sequence profile in the form of a PSSM that captures the compositional variance by assigning probabilities to each amino acid at each position. These sequence variations of the binding profiles can be learned from proteins that are known to interact through the same domain. We collected a set of 87,894 nonredundant protein interactions from four databases containing binary protein interactions from multiple species. Interactions derived from structural studies were excluded to preclude self-identification, as well as high-throughput protein complexes identification experiments [30,31] (see Methods). Gibbs sampling [32] was used to learn the PSSM binding profiles for a specific domain by sampling positions in the set of proteins that interact with proteins that contain the domain of interest. The majority of the proteins in the learning set are assumed to interact through the common domain, and the generated PSSM will represent its binding profile (Figure 1 The learned PSSMs were used to predict interactions for 703 yeast proteins with domains for which we successfully derived binding profiles. A physical interaction was predicted between proteins containing interacting domains and proteins with one or more of the interacting profiles associated with those domains (Figure 1 Experimental verification of a subset of the predicted interactions was performed by a one-step immunoaffinity purification of one of the two interaction partners, followed by mass spectrometric identification of associated proteins (IP-MS) as previously described [31]. The IP-MS method confirmed 37 predicted interactions, including 23 novel interactions (Figure 2
Experimentally Confirmed Predictions Among the experimentally confirmed predictions were interactions between the five components of the PRS complex, which together compose the 5-phosphoribosyl-1(a)-pyrophosphate synthetase enzyme (EC number 2.7.6.1). This complex is a key component in the production of the precursors for purine, pyrmidine, and pyridine nucleotides [34]. An additional interaction was confirmed between the alcohol dehydrogenase (NADP+) Adh7 and Prs5, the latter being a member of the PRS complex. This result suggests a possible direct link between NADP/NADPH balance, which is controlled by Adh7 [35], and the biosynthesis of the purine and pyrimidine precursors. A predicted interaction between the histone H2A protein Hta1 and God1, a component of the SWR-C protein complex that incorporates Htz1 into the chromatin, was also confirmed. Chromatin remodelling by the exchange of Hta1 with Htz1 is thought to induce chromatin restructuring that favours gene transcription, RNA polymerase II recruitment, and gene expression induction near silent heterochromatin [36]. Another confirmed interaction is between a member of the HSP40 family (Apj1) with two HSP70 proteins (Ssa1, Ssa2). HSP40 family members form complexes with HSP70 chaperone proteins, which facilitate the folding of specific proteins at various cellular locations [37]. We also identified new interactions between the RNA polymerase II subunit Rpb2 with Rpb10, which is a common subunit of all three RNA polymerases [38]. Additional interaction was demonstrated between Rpc40, a known shared subunit of RNA polymerases I and III, and Rpb2, an exclusive component of RNA polymerase II. It is possible that some of these interactions are bridged or stabilized by other RNA polymerase subunits [39]. One might argue that the above successful predictions could be easily predicted from the orthology of the interacting proteins to the structural complexes used, such as the interactions between members of the PRS complex. We therefore tested several nonobvious predicted interactions that cannot be easily inferred from structural or sequence homology to other interacting pairs. The critical downstream effector of the mitotic exit network is the phosphatase Cdc14, which activates Clb degradation and Sic1 accumulation by dephosphorylation of key substrates [40]. We confirmed an unexpected predicted interaction between Cdc14 and the protein kinase Cbk1, which functions in a parallel pathway (called RAM [regulation of Ace2p activity and cellular morphogenesis]) at the end of mitosis to facilitate cytokinesis and mother–daughter abscission [41]. The Cdc14–Cbk1 interaction suggests that the activity of the mitotic exit network and RAM pathways may be coordinated via Cdc14-mediated dephosphorylation of RAM components and/or Cbk1-mediated phosphorylation of mitotic exit network components [42]. Other nonobvious interactions between known components of clathrin-associated (AP-1) complex Apm1 and Apl2, as well as between components of the RNA splicing complex Smd2 and Lsm2, were detected by the IP-MS experiments but not by IP-western under the conditions used. Given the strength of the D-MIST predictions for these latter interactions, further investigation using more sensitive reagents seems warranted. These confirmed predictions of nonobvious interactions illustrate the potential of the D-MIST approach to generate new biological hypotheses. Discussion As noted previously, we excluded additional experimental evidence, such as localization and expression data from our prediction method. Although additional experimental information and functional annotation would likely improve prediction accuracy, it may also limit predictions only to those proteins with prior experimental or functional information. In addition, the use of functional annotation such as Gene Ontology terms (assigned by human experts or predicted computationally) in a prediction method will penalize predicted interactions between proteins with unrelated functions. Therefore, it restricts the ability to predict interactions between apparently unrelated proteins that could illuminate new cellular functions [43]. The D-MIST method for identifying domain-binding modules is currently limited in a number of ways. The first limitation is the availability of detailed binding information, as attained primarily through structural studies and peptide-based approaches such as phage display [44] and random peptide libraries [45]. In addition, several studies have concluded that the repertoire of protein structures in the Protein Data Bank is significantly biased in that trans-membrane and disordered domains are underrepresented due to limitations in structure determination [46,47]. Consequently, D-MIST analysis that depends on structural representation of protein interactions is similarly biased. The existing detailed examples of interactions are therefore sparse and noncomprehensive, with only a small subset of all possible domains that is represented. The second limitation is that the derived motifs do not represent the entire repertoire of all possible domain-binding sequences, even for those domains where structural data exist. The third limitation arises from the statistical framework of the Gibbs sampling method that requires a sufficient number of proteins to sample from in order to converge towards a meaningful PSSM. We restricted the analysis to domains with five or more putative interactors, thereby excluding domains that are infrequently found in our set of protein interactions. Fourth, some domains are not amendable to this type of analysis due to the diverse nature of their binding motifs that lack sequence conservation [29]. Last, many interactions are governed by posttranslational modifications or precise physiological states, which may also hamper the accuracy of D-MIST predictions. Despite the above limitations, we have shown that novel protein interactions can be predicted strictly from primary sequence information. D-MIST not only predicts interactions between proteins but also provides sequence level predictions about the binding sites that can be verified experimentally. Predicting protein interactions without the need for additional information or prior experiments is particularly valuable when studying uncharacterized proteins and for predicting interactions in poorly studied organisms where typically only sequence information and predicted open reading frames are available. The sole dependence on sequence information allows for interaction prediction in other organisms without further modifications to the method or input datasets. With the advent of structural genomics initiatives [48], the power of the D-MIST approach will certainly increase. Methods Extracting motifs. The domain-binding motifs were extracted from BIND protein interaction records that were generated from 10,064 structures [28]. Interactions were filtered for crystal-packing artifacts using the PQS server [49], and all the interactions are available as a subset of the BIND database. Domain annotation was assigned to the protein structures using our in-house adaptation of CDD [50] with an e-value cutoff of 10 × 10−6 and then converted to InterPro identifiers [51]. Binding motifs are defined as polypeptide segments of five residues or longer in which the amino acids side chains are <5 Å from the interacting domain's side chains on the opposing protein. Two motif residues that are in direct contact with the interacting domain can be separated by a maximum of two noncontacting residues. For example, the first motifs in Figure 1 Learning the binding modules. A total of 87,894 nonredundant protein interactions were collected from 204 species from four database sources: BIND [52], DIP [53], Mint [54], and IntAct [55]. We excluded all interactions that were derived from 3-D studies, high-throughput protein complex identification studies [30,31], or interactions inferred from synthetic lethal experiments. The interactions were indexed in a relational database by domain annotation such that a single query can provide the full list of proteins that interact with a domain of interest (Figure 1
Predicting new interactions. Two proteins were predicted to interact if one protein had a domain and a second protein matched one or more of the binding profiles for that domain (Figure 1 Experimental verification. Recombination-based cloning, culture growth, and protein complex isolation were performed essentially as described [31] with minor modifications. Each uncharacterized open reading frame was tagged at the 3′-end with the FLAG-tag epitope using the Gateway recombination-based cloning system (Invitrogen, http://www.invitrogen.com). Bait complexes were immunopurified on anti-FLAG M2 antibody resin, resolved by denaturing gel electrophoresis, and visualized by colloidal Coomassie stain. Protein identification by automated liquid chromatography tandem mass spectrometry on a Finnigan LCQ DECA ion trap (Thermo Finnigan, http://www.thermo.com) mass spectrometer was as described previously [31]. Predicted protein interactions were also confirmed by IP-western [31] using interaction partners tagged either as C-terminal HA or Myc3 epitope fusions and detection with 12CA5 anti-HA or 9E10 anti-Myc monoclonal antibodies, respectively (Figure S2). Overlap with literature. The predicted interactions were compared to a new set of yeast curated interactions collected from more than 50,000 abstracts and publications [33] (available at www.thebiogrid.org). The probability of the observed overlap between the predicted interactions and the literature curated is approximated by a Poisson distribution. A random variable Y has a Poisson distribution if
. Given these parameters P (y ≥ 609) under a Poisson distribution is 1.0 × 10−13. Similar calculation using a hypergeometric distribution (sampling without replacement) yields a p-value of 1.0 × 10−8.
Dataset S1: Cytoscape Session File Containing the Validated and Predicted Protein Interactions A Cytoscape session file containing the complete set of predicted interactions as well as the networks in Figures 2 (2.0 MB ZIP) Click here for additional data file.(1.9M, zip) Figure S1: The Overlap between the Predicted Interaction Network and a Comprehensive Set of Literature-Curated Interactions [33] The predicted interactions were compared to a new and exhaustive set of curated interactions extracted from the literature that includes physical interactions from both high-throughput and directed studies as well as genetic interactions. The overlap contains 609 interactions that represent ~3% of the predicted interactions. Proteins are coloured according to Gene Ontology biological process annotation. (519 KB PDF) Click here for additional data file.(520K, pdf) Figure S2: IP-Western Results for the Novel Interactions Predicted by D-MIST Bait proteins were purified using FLAG antibodies, and their interacting proteins were detected by antibodies specific to C-terminal HA or Myc3 epitopes. (325 KB PDF) Click here for additional data file.(326K, pdf) Text S1: The Domain-Binding Profiles Derived by D-MIST Each domain-binding profile is specified as a list of sequence motifs. The sequence motifs are used as input to a PSSM search program [56]. Source code available at http://www.people.fas.harvard.edu/~junliu/index1.html#Computational_Biology. (2.7 MB TXT) Click here for additional data file.(2.7M, txt) Acknowledgments We thank Mai Vo, Brett Larsen, Pavel Metalnikov, and Howard Feldman for technical assistance. Abbreviations
Footnotes ¤ Current address: Computational and Systems Biology Center, Memorial Sloan-Kettering Cancer Center, New York, New York, United States of America Author contributions. DB and MT conceived and designed the experiments and wrote the paper. KEB and DDD performed the experiments. DB and RI analyzed the data. RI contributed reagents/materials/analysis tools. DB conceived and designed the project, and performed the computational work. MT and CWVH directed the study. Funding. DB was supported by Ontario Graduate Scholarship, and KB was supported by a Canadian Institute of Health Research (CIHR) Training Grant. MT's research is supported by CIHR and Genome Canada; MT holds a Canada Research Chair in Bioinformatics and Functional Genomics. CWVH's research is funded by the Ontario R&D Challenge Fund and by Genome Canada through the Ontario Genomics Institute. Competing interests. The authors have declared that no competing interests exist. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||
Nature. 2006 Mar 30; 440(7084):631-6.
[Nature. 2006]Science. 2006 Jan 13; 311(5758):239-42.
[Science. 2006]Science. 2003 Apr 18; 300(5618):445-52.
[Science. 2003]Curr Opin Struct Biol. 2004 Dec; 14(6):690-9.
[Curr Opin Struct Biol. 2004]Sci Am. 2000 Jun; 282(6):72-9.
[Sci Am. 2000]Genome Res. 2002 Oct; 12(10):1540-8.
[Genome Res. 2002]Genome Biol. 2006; 7(11):R104.
[Genome Biol. 2006]Nucleic Acids Res. 1990 Oct 25; 18(20):6097-100.
[Nucleic Acids Res. 1990]Biopolymers. 2001-2002; 61(2):111-20.
[Biopolymers. 2001]Science. 2000 Feb 18; 287(5456):1279-83.
[Science. 2000]Nature. 2002 Jan 10; 415(6868):141-7.
[Nature. 2002]Nature. 2002 Jan 10; 415(6868):180-3.
[Nature. 2002]Nucleic Acids Res. 2003 Jul 1; 31(13):3580-5.
[Nucleic Acids Res. 2003]J Biol. 2006; 5(4):11.
[J Biol. 2006]Nature. 2002 Jan 10; 415(6868):180-3.
[Nature. 2002]J Biol Chem. 1999 Apr 30; 274(18):12480-7.
[J Biol Chem. 1999]Eur J Biochem. 2002 Nov; 269(22):5738-45.
[Eur J Biochem. 2002]Mol Cell. 2003 Dec; 12(6):1565-76.
[Mol Cell. 2003]Cell Stress Chaperones. 2003 Winter; 8(4):309-16.
[Cell Stress Chaperones. 2003]Curr Opin Struct Biol. 2002 Feb; 12(1):89-97.
[Curr Opin Struct Biol. 2002]Mol Cell. 1998 Dec; 2(6):709-18.
[Mol Cell. 1998]Mol Biol Cell. 2003 Sep; 14(9):3782-803.
[Mol Biol Cell. 2003]Mol Cell Biol. 2001 Apr; 21(7):2449-62.
[Mol Cell Biol. 2001]Science. 2002 Jan 11; 295(5553):321-4.
[Science. 2002]Methods Enzymol. 2000; 328():157-70.
[Methods Enzymol. 2000]Bioinformatics. 2002 Jul; 18(7):922-33.
[Bioinformatics. 2002]Science. 2000 Feb 18; 287(5456):1279-83.
[Science. 2000]Science. 2006 Jan 20; 311(5759):347-51.
[Science. 2006]Biopolymers. 2001-2002; 61(2):111-20.
[Biopolymers. 2001]Trends Biochem Sci. 1998 Sep; 23(9):358-61.
[Trends Biochem Sci. 1998]Nucleic Acids Res. 2003 Jan 1; 31(1):383-7.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2003 Jan 1; 31(1):315-8.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D418-24.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D449-51.
[Nucleic Acids Res. 2004]FEBS Lett. 2002 Feb 20; 513(1):135-40.
[FEBS Lett. 2002]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D452-5.
[Nucleic Acids Res. 2004]Nature. 2002 Jan 10; 415(6868):141-7.
[Nature. 2002]Protein Sci. 1995 Aug; 4(8):1618-32.
[Protein Sci. 1995]Nature. 2002 Jan 10; 415(6868):180-3.
[Nature. 2002]J Biol. 2006; 5(4):11.
[J Biol. 2006]Protein Sci. 1995 Aug; 4(8):1618-32.
[Protein Sci. 1995]