![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||
Copyright : © 2007 Shoemaker and Panchenko. This is an open-access article distributed under the terms of the Creative Commons Public Domain declaration which stipulates that, once placed in the public domain, this work may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. Deciphering Protein–Protein Interactions. Part I. Experimental Techniques and Databases Fran Lewitter, Editor Whitehead Institute, United States of America * To whom correspondence should be addressed. E-mail: panch/at/ncbi.nlm.nih.gov This article has been cited by other articles in PMC.Proteins interact with each other in a highly specific manner, and protein interactions play a key role in many cellular processes; in particular, the distortion of protein interfaces may lead to the development of many diseases. To understand the mechanisms of protein recognition at the molecular level and to unravel the global picture of protein interactions in the cell, different experimental techniques have been developed. Some methods characterize individual protein interactions while others are advanced for screening interactions on a genome-wide scale. In this review we describe different experimental techniques of protein interaction identification together with various databases which attempt to classify the large array of experimental data. We discuss the main promises and pitfalls of different methods and present several approaches to verify and validate the diverse experimental data produced by high-throughput techniques. Introduction It is now becoming clear that protein interactions determine the outcome of most cellular processes [1–4]. Therefore, identifying and characterizing protein–protein interactions and their networks is essential for understanding the mechanisms of biological processes on a molecular level. Despite the fact that protein interactions are remarkably diverse, all protein interfaces share certain common properties. Protein interactions can be classified into different types depending on their strength (permanent and transient), specificity (specific or nonspecific), the location of interacting partners within one or on two polypeptide chains, and the similarity between interacting subunits (homo- and hetero-oligomers). It has been shown that interface types are significantly different in amino acid composition so that it is possible to predict the type of interaction interface from amino acid composition alone [5]. Earlier structural analysis of interfaces showed that most interfaces consist of completely buried cores surrounded by partially accessible rims [6,7] with the overall size of about 1600 ± 400 Å2 (a “standard size” patch) [8]. It has been found that certain amino acids are preferred on protein interfaces and that the amino acid composition of the core differs considerably from the rim [6,7,9,10]. More recent models suggested that the protein binding site consists of a few independent highly packed regions, so called “hot spots,” which contribute significantly to the free energy of binding [11–13]. Hot spots were found to be structurally conserved [14], and the energetics of interactions at the hot spots have been analyzed in several studies [15–18]. In many cellular processes, proteins recognize specific targets and bind them in a highly regular manner. The specificity of interactions in these cases is determined by structural and physico–chemical properties of two interacting proteins. As a result, there should be a certain degree of conservation in the interaction patterns between similar proteins and domains. Indeed, it has been found that close homologs almost always interact in the same way and protein–protein interactions place certain evolutionary constraints on protein sequence and structural divergence [19–24]. Recent studies confirm that the total number of interaction types or modes is limited and rather small [25–27]. On the other hand, remotely related proteins/domains can have different interaction modes [21,26,28]; and the conservation of such protein interfaces is similar to the average conservation of rest of the protein [29–32]. In this review and its companion review in the April issue [33], we attempt to classify and systemize the array of experimental and theoretical data on the identification and prediction of protein interactions. In this review we focus on the generic experimental techniques for identifying protein interactions and the databases storing the information obtained from these experiments. In the second review, we present different methods to predict protein and domain interactions and discuss various challenges faced in this field with respect to limited prediction accuracy. Experimental Methods for Identifying and Characterizing Protein Interactions Protein interactions can be analyzed by different genetic, biochemical, and physical methods, which are listed in Table 1 and shown in Figure 1
Yeast two-hybrid method. The development of the Y2H technique has considerably accelerated the screening of protein interactions in vivo. Y2H is based on the fact that many eukaryotic transcription activators have at least two distinct domains, one that directs binding to a promoter DNA sequence (BD) and another that activates transcription (AD) (Figure 1 For screening entire genomes, the Y2H method has been advanced into two main approaches [44–46]: matrix-based and library-based. In the matrix approach, a matrix of prey clones is created where each clone expresses a particular prey protein in one well of a plate. Then each bait strain is mated with an array of prey strains and those diploids where two chimeric proteins interact are selected based on the expression of a reporter gene and the position on a plate. In the library approach, each bait is screened against an undefined prey library containing random cDNA fragments or open reading frames (ORFs). Diploid positives are selected based on their ability to grow on specific substrates; and interacting proteins are determined by DNA sequencing. The first two genome-wide analyses of the yeast “interactome” revealed 692 and 841 putative interactions, respectively [47,48]. The overlap between these two experimental studies was quite small; both methods shared only 141 interactions, about 20% of the interaction data [48]. Recently, Y2H has been used to identify interactions in worm [2], fly [1], and human [49,50]. The small overlap between Y2H experiments can be explained by different factors, among them: differences in protein interaction sampling, Y2H bias towards nonspecific interactions [51], and limitations of the Y2H method itself. For example, proteins initiating transcription by themselves cannot be targeted in Y2H experiments; and the use of sequence chimeras can impose difficulties since fusion can change the structure of a target protein. In addition, protein folding and posttranslational modifications can differ between yeast and other organisms. This makes it difficult to screen proteins from mammalian and prokaryotic cells using Y2H as well as cytoplasmic and membrane proteins. To validate the quality of Y2H protein interactions in vivo, different in vitro techniques can be used. Mass spectroscopy. MS is a powerful method of studying macromolecular interactions in vitro. The principle of the MS method is to produce ions which can be detected based on their mass-to-charge ratios, thereby allowing the identification of polypeptide sequences [36,52,53] (Figure 1 TAP method of complex purification. A TAP tag consists of two IgG binding domains of Staphylococcus protein A and a calmodulin binding peptide separated by the tobacco etch virus protease cleavage site [61,62] (Figure 1 Gene co-expression. Since the function of a protein complex depends on the functionality of all subunits, subunits should be present in stoichiometric amounts and gene expression levels of subunits in a complex should be related. Gene expression profiles can be provided, for example, from cell cycle experiments and expression levels of a gene under different conditions. Expression profile similarity can be calculated as a correlation coefficient between relative expression levels of two genes/proteins or the normalized difference between their absolute expression levels or calculated using other methods [65–69] (Figure 1 Synthetic lethality method. It is not very well-understood how genetic variation influences phenotype and how genes interact with each other producing different phenotypes in different strains of the same species [77,78]. These problems can be addressed by using various genetic interaction methods, the most common of which is the synthetic lethality method (Figure 1 Monitoring specific protein interactions. The most detailed information about protein interaction interfaces at the atomic level can be provided by X-ray crystallography and NMR spectroscopy, but the number of solved protein complexes remains low [84]. At the same time, the real-time characterization of interacting proteins in vivo can be achieved with various spectroscopic techniques requiring the attachment of a spectroscopic label to a target protein [87,88] (Table 1). A powerful technique in this respect is fluorescence resonance energy transfer (FRET), which can occur only if two fluorophores are located close to each other [89]. Another effective method, surface plasmon resonance (SPR), does not require spectroscopic labeling and can detect interactions between soluble ligands and immobilized receptors [90,91]; while the isothermal titration calorimetry (ITC) technique allows for direct measurement of the enthalpy of binding [92]. Recently, new methods have been developed to analyze protein interactions at the single-molecule level. For example, atomic force microscopy can fairly accurately measure interaction forces ([93]) while fluorescence techniques can characterize conformational changes in proteins upon binding [94]. Protein interaction networks derived from experiments. The fast development of experimental techniques for protein interactions has enabled the construction and systematic analysis of interaction networks [1,2,95]. Interaction maps obtained for one species can be used to predict interaction networks in other species, to identify functions of unknown proteins, and to get insight into the evolution of protein interaction patterns. The interaction map analyses and comparisons are based on the observation that many interactions are conserved among species (“interologs”) [46]. Sequence-based searches for “interologs” were able to identify 16%–31% of true “interologs” (tested using Y2H system) even between remotely related species such as yeast and worm [96]. Analysis of conservation in the networks produced by gene co-expression data revealed that interologs correspond to the functionally related genes responsible for core biological processes [77]. Moreover, a multiple-species network has been constructed by identifying pairs of genes with correlated expression in different organisms. A multiple-species network has shown to perform better than a single-species network in linking together functionally related genes. Verification of protein interactions. Validation of protein interaction data is difficult; except for small datasets on protein interactions provided by the Protein Data Bank (PDB) [84] and the Munich Information Center for Protein Sequences (MIPS) [97], there is no comprehensive gold standard interaction set. Several methods have been proposed for verification of protein interaction data [66,67,76,98,99], and some of them are described here. Expression profile reliability method (EPR) [66] is based on the observation that interacting proteins are coexpressed. Two distributions of expression distances are defined for noninteracting and reliably interacting proteins. The distribution of expression distances for a protein set of interest is assumed to be a linear combination of two predefined distributions with the linear coefficient that characterizes the accuracy of a given dataset. Paralogous verification method (PVM) [66] is based on the observation that if two proteins interact, their paralogs most likely interact. It gives more reliability to the interaction of two families that contain a greater number of interactions between paralogous proteins. This method identified ~40% true interactions at a 1% error rate. Protein localization method (PLM) [98] defines true positives as interacting proteins that are localized in the same cellular compartment and/or interacting proteins that are annotated to have a common cellular role. PLM showed that the accuracy of experimental data strongly depends on the method with up to 50% true positives detected in Y2H experiments and up to 100% true positives detected in immunoprecipitation experiments [100]. Protein and domain interaction databases. A large variety of databases exists to study binary protein interactions and the higher order interactions in protein complexes. A summary of some available databases is given in Tables 2 and 3. Different databases contain interactions obtained by direct submission from experimentalists and by mining literature and other data sources; in some cases the data is verified using automated algorithms or manual curation. In addition to direct detection of physical protein interactions, indirect methods can be used to predict the functional association between proteins or to predict the location of the interaction interface itself. There is indeed a wide range of detail characterizing the interactions available from different databases. For example, Y2H data gives the identity of interacting proteins, electron microscopy provides relative positional information of interacting proteins, and crystallography provides full atomic detail of interaction surfaces. In addition, interacting proteins can be studied either as complete units or by domains used as the units of interaction. Consequently, in this review we group all databases into protein and domain-related databases.
In spite of the interaction data diversity, there exist considerable overlaps in the datasets contained in the databases, making it difficult to recommend a single resource for a particular type of information. In one effort to deal with this redundancy, the International Molecular Exchange Consortium (IMEx) has been formed in which databases agree to share their data in a consistent and timely fashion (Table 2). In addition, a standard data model has been proposed for the representation and exchange of protein interaction data [101]. A few example databases from Table 2 will now be highlighted to illustrate different types of interaction data available. Protein Interaction Databases Database of Interacting Proteins. The Database of Interacting Proteins (DIP) contains experimentally determined protein interactions and includes a core subset of interactions that have passed a quality assessment [102]. Interaction data are obtained from the literature; PDB; and high-throughput methods such as Y2H, DNA and protein microarrays; and TAP–MS analysis of protein complexes. Several methods are employed to assess the quality of interaction data and are offered as a service for query interactions. DIP has links to a couple of related databases including LiveDIP, which records information about the state of a biological interaction, such as covalently modified, conformational, or cellular location states [103]. Another database related to DIP is Prolinks, which brings together four methods of linking proteins: phylogenetic profiles, Rosetta Stone, gene neighbors, and gene clusters[104]. The database includes a Proteome Navigator tool to browse the linkages and view accompanying data. Biomolecular Interaction Network Database. The Biomolecular Interaction Network Database (BIND) includes high-throughput experimental datasets and protein complexes from PDB [105,106]. It contains a variety of curated experimental data. A generalized data specification handles not only various types of protein interaction data, but also protein–small molecule interactions and protein–nucleic acid interactions. An interaction viewer is provided to browse the interaction space. BIND also can distinguish different functional types of interactions. Munich MPact/MIPS database. MPact is a resource to access MIPS, which contains a manually curated yeast protein interaction dataset [97] collected by curators from the literature. The resource also includes high-throughput results for yeast, but keeps this data separate. MIPS is often used as a standard of truth database for evaluating the quality of data and the accuracy of interaction prediction methods. Domain Interaction Databases PIBASE database. PIBASE is a database of domain interactions from the protein structure data [107]. It uses SCOP and CATH domain definitions to find putative domain interactions. Several methods are employed to remove redundancy in structural data; for example, structural comparisons of interfaces are made between domains within one structure. The database combines physicochemical properties of protein binding sites and has a link to MODBASE [108], containing models of three-dimensional structures that allow use of PIBASE for modeling of putative domain interfaces. 3did database. 3did allows one to explore the details of domain interactions from protein structure data (yeast interactions are also included) [109]. For each domain, an overview is given of all its interactions with other domains, showing different interaction types. In some cases, dot plots of structural comparisons between interaction interfaces show the variance of the interactions between pairs of domain families. Database entries are also supplied with the GO-based functional annotations. InterPreTS is a Web-based service associated with 3did that predicts domain interactions based on sequence homology of query proteins to a database of interacting domains (DBID) [21]. Conserved Binding Mode database. The Conserved Binding Mode (CBM) database is a collection of domain interactions from the structure data where domains are defined by the Conserved Domain Database [110]. Unlike other structure-based databases, domain interactions are grouped by geometry into conserved interaction modes for each pair of domain families across all PDB structures [26]. Structural superpositions are used to infer CBMs from different members of interacting domain families docking in the same way. Such domain interactions with recurring structural themes have greater significance to be biologically relevant, unlike spurious crystal packing interactions. CBMs can also assist in analyzing protein interaction network topology by emphasizing connections made in a biological context. Finally, the CBM database can be used to categorize the specific interaction surfaces that have evolved from conserved domains and thereby allows for the homology modeling of protein interaction interfaces. A similar approach for grouping interaction patterns for SCOP domains was recently undertaken with the SCOPPI database [111]. Domain Interaction Map database. Domain Interaction Map (DIMA) database is a domain interaction map derived from phylogenetic profiling Pfam domains [97]. Instead of looking at entire protein sequences, the algorithm compares the occurrences of domains across genomes and associates similar patterns of occurrences with functional associations. The method works well for domains with moderate information content that have distinct phylogenetic profiles. In this paper we have reviewed a wide spectrum of experimental techniques for identifying and characterizing protein interactions; each technique can provide a piece in the puzzle of mechanisms of protein recognition [112]. Despite enormous efforts in this field, the overall picture is still incomplete, which is not surprising given the enormous complexity of a cell. Indeed, proteins can behave differently in different parts of the cell, and many proteins form transient complexes that are difficult to identify. Moreover, evolutionarily conserved proteins have much better coverage in experiments than the proteins restricted to a certain organism. The low coverage together with the small overlap between different experimental methods calls for the development of theoretical approaches for interaction data verification and prediction, the topic we address in our companion review [33]. ![]() Acknowledgments The authors thank Lewis Geer for helpful discussions and Robert Yates for graphic design of the figures. This work was supported by the Intramural Research Program of the National Library of Medicine at the National Institutes of Health of the US Department of Health and Human Services. Abbreviations
Footnotes Benjamin A. Shoemaker and Anna R. Panchenko are with the Computational Biology Branch of the National Center for Biotechnology Information in Bethesda, Maryland, United States of America. Competing interests. The authors have declared that no competing interests exist. Funding. The authors received no specific funding for this article. Author contributions. BAS and ARP analyzed the data and wrote the paper. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||
Science. 2003 Dec 5; 302(5651):1727-36.
[Science. 2003]Nature. 2006 Mar 30; 440(7084):631-6.
[Nature. 2006]J Mol Biol. 2003 Jan 10; 325(2):377-87.
[J Mol Biol. 2003]J Mol Biol. 1998 Jul 3; 280(1):1-9.
[J Mol Biol. 1998]Proteins. 2002 May 15; 47(3):334-43.
[Proteins. 2002]Proteins. 2001 Jan 1; 42(1):108-24.
[Proteins. 2001]Proteins. 2005 Nov 15; 61(3):535-44.
[Proteins. 2005]Nat Biotechnol. 2004 Oct; 22(10):1317-21.
[Nat Biotechnol. 2004]PLoS Comput Biol. 2006 Sep 29; 2(9):e124.
[PLoS Comput Biol. 2006]J Mol Biol. 2003 Oct 3; 332(5):989-98.
[J Mol Biol. 2003]Nature. 1989 Jul 20; 340(6230):245-6.
[Nature. 1989]Gene. 2000 May 30; 250(1-2):1-14.
[Gene. 2000]Biol Proced Online. 1999 Oct 4; 2():1-38.
[Biol Proced Online. 1999]Methods. 2001 Jul; 24(3):201-17.
[Methods. 2001]Proc Natl Acad Sci U S A. 1996 Oct 29; 93(22):12423-7.
[Proc Natl Acad Sci U S A. 1996]Nat Genet. 1996 Jan; 12(1):72-7.
[Nat Genet. 1996]Science. 2000 Jan 7; 287(5450):116-22.
[Science. 2000]Nature. 2000 Feb 10; 403(6770):623-7.
[Nature. 2000]Proc Natl Acad Sci U S A. 2001 Apr 10; 98(8):4569-74.
[Proc Natl Acad Sci U S A. 2001]Science. 2004 Jan 23; 303(5657):540-3.
[Science. 2004]Science. 2003 Dec 5; 302(5651):1727-36.
[Science. 2003]Cell. 2005 Sep 23; 122(6):830-2.
[Cell. 2005]Proc Natl Acad Sci U S A. 2006 Jan 10; 103(2):311-6.
[Proc Natl Acad Sci U S A. 2006]Mass Spectrom Rev. 2004 Sep-Oct; 23(5):350-67.
[Mass Spectrom Rev. 2004]J Mass Spectrom. 2005 Jul; 40(7):845-65.
[J Mass Spectrom. 2005]Nature. 2003 Mar 13; 422(6928):198-207.
[Nature. 2003]Anal Chem. 1985 Mar; 57(3):675-9.
[Anal Chem. 1985]Nucleic Acids Res. 1993 Jul 11; 21(14):3191-6.
[Nucleic Acids Res. 1993]Nat Biotechnol. 1999 Oct; 17(10):1030-2.
[Nat Biotechnol. 1999]Methods. 2001 Jul; 24(3):218-29.
[Methods. 2001]Science. 2004 Jan 23; 303(5657):540-3.
[Science. 2004]Nature. 2006 Mar 30; 440(7084):631-6.
[Nature. 2006]Nature. 2002 Jan 10; 415(6868):141-7.
[Nature. 2002]Nature. 2002 Jan 10; 415(6868):180-3.
[Nature. 2002]Nature. 2006 Mar 30; 440(7084):637-43.
[Nature. 2006]Genome Res. 2002 Jan; 12(1):37-46.
[Genome Res. 2002]Bioinformatics. 2002 Nov; 18(11):1454-61.
[Bioinformatics. 2002]Brief Bioinform. 2005 Mar; 6(1):34-43.
[Brief Bioinform. 2005]Science. 2003 Oct 10; 302(5643):249-55.
[Science. 2003]Proc Natl Acad Sci U S A. 2004 Jun 15; 101(24):9033-8.
[Proc Natl Acad Sci U S A. 2004]Science. 2003 Oct 10; 302(5643):249-55.
[Science. 2003]Proc Natl Acad Sci U S A. 2004 Jun 15; 101(24):9033-8.
[Proc Natl Acad Sci U S A. 2004]Mol Syst Biol. 2006; 2():2006.0001.
[Mol Syst Biol. 2006]Science. 2001 Dec 14; 294(5550):2364-8.
[Science. 2001]Science. 2003 Apr 4; 300(5616):87-91.
[Science. 2003]Curr Opin Struct Biol. 2005 Feb; 15(1):4-14.
[Curr Opin Struct Biol. 2005]Curr Opin Chem Biol. 2003 Oct; 7(5):635-40.
[Curr Opin Chem Biol. 2003]J Mol Recognit. 2004 May-Jun; 17(3):151-61.
[J Mol Recognit. 2004]Science. 2003 Dec 5; 302(5651):1727-36.
[Science. 2003]Science. 2004 Jan 23; 303(5657):540-3.
[Science. 2004]Nature. 2005 Feb 3; 433(7025):531-7.
[Nature. 2005]Science. 2000 Jan 7; 287(5450):116-22.
[Science. 2000]Genome Res. 2001 Dec; 11(12):2120-6.
[Genome Res. 2001]Science. 2001 Dec 14; 294(5550):2364-8.
[Science. 2001]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D436-41.
[Nucleic Acids Res. 2006]Mol Cell Proteomics. 2002 May; 1(5):349-56.
[Mol Cell Proteomics. 2002]Mol Cell. 2002 May; 9(5):1133-43.
[Mol Cell. 2002]Genome Res. 2001 Dec; 11(12):1971-3.
[Genome Res. 2001]Mol Cell Proteomics. 2002 May; 1(5):349-56.
[Mol Cell Proteomics. 2002]Mol Cell Proteomics. 2002 May; 1(5):349-56.
[Mol Cell Proteomics. 2002]J Mol Biol. 2003 Apr 11; 327(5):919-23.
[J Mol Biol. 2003]Methods Mol Biol. 2004; 261():337-50.
[Methods Mol Biol. 2004]Nat Biotechnol. 2004 Feb; 22(2):177-83.
[Nat Biotechnol. 2004]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D449-51.
[Nucleic Acids Res. 2004]Mol Cell Proteomics. 2002 Feb; 1(2):104-16.
[Mol Cell Proteomics. 2002]Genome Biol. 2004; 5(5):R35.
[Genome Biol. 2004]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D418-24.
[Nucleic Acids Res. 2005]Bioinformatics. 2000 May; 16(5):465-77.
[Bioinformatics. 2000]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D436-41.
[Nucleic Acids Res. 2006]Bioinformatics. 2005 May 1; 21(9):1901-7.
[Bioinformatics. 2005]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D291-5.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D413-7.
[Nucleic Acids Res. 2005]J Mol Biol. 2003 Oct 3; 332(5):989-98.
[J Mol Biol. 2003]Nucleic Acids Res. 2002 Jan 1; 30(1):281-3.
[Nucleic Acids Res. 2002]Protein Sci. 2006 Feb; 15(2):352-61.
[Protein Sci. 2006]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D310-4.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D436-41.
[Nucleic Acids Res. 2006]