![]() | ![]() |
Formats:
|
||||||||||||||||
Copyright © 2008 Yang et al; licensee BioMed Central Ltd. Deducing topology of protein-protein interaction networks from experimentally measured sub-networks 1Departments of Medicine (Cardiology), David Geffen School of Medicine at the University of California, Los Angeles, California 90095, USA 2Physiology, David Geffen School of Medicine at the University of California, Los Angeles, California 90095, USA 3Anesthesiology, David Geffen School of Medicine at the University of California, Los Angeles, California 90095, USA Corresponding author.Ling Yang: lyang/at/mednet.ucla.edu; Thomas M Vondriska: TVondriska/at/mednet.ucla.edu; Zhangang Han: zhan/at/bnu.edu.cn; W Robb MacLellan: rmaclellan/at/mednet.ucla.edu; James N Weiss: jweiss/at/mednet.ucla.edu; Zhilin Qu: zqu/at/mednet.ucla.edu Received February 22, 2008; Accepted July 3, 2008. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background Protein-protein interaction networks are commonly sampled using yeast two hybrid approaches. However, whether topological information reaped from these experimentally-measured sub-networks can be extrapolated to complete protein-protein interaction networks is unclear. Results By analyzing various experimental protein-protein interaction datasets, we found that they are not random samples of the parent networks. Based on the experimental bait-prey behaviors, our computer simulations show that these non-random sampling features may affect the topological information. We tested the hypothesis that a core sub-network exists within the experimentally sampled network that better maintains the topological characteristics of the parent protein-protein interaction network. We developed a method to filter the experimentally sampled network to result in a core sub-network that more accurately reflects the topology of the parent network. These findings have fundamental implications for large-scale protein interaction studies and for our understanding of the behavior of cellular networks. Conclusion The topological information from experimental measured networks network as is may not be the correct source for topological information about the parent protein-protein interaction network. We define a core sub-network that more accurately reflects the topology of the parent network. Background Biological systems are characterized by extremely complex interacting networks of nucleotides, proteins, metabolites and other molecules. It has become increasingly clear that to understand the function of a cell, one must understand the function of these networks. Because the topological characteristics of a network are believed to determine basic properties of its function [1-4], a primary goal in analyzing biological networksis to determine how the interacting elements (nodes) are connected toeach other (edges or links). The commonly used large-scaleexperimental approaches (yeast two hybrid and affinity pull-down combined with mass spectrometry) for mapping protein-protein interaction networks are extremely useful to sample portions of the entire network, however, they have well recognized limitations: (i) some interactions are missed (false negatives); (ii) spurious interactions are detected (false positives); (iii) interactions are assumed to be direct (binary analyses lose hierarchical information); and (iv) some proteins function better than others in a protein interaction assay [5,6]. "Sticky" proteins may be less likely to have false negatives, but it remains an empirical argument as to whether these proteins are also more likely to have false positives. Other factors contributing to these limitations include effects of affinity tag interactions, effects of antibody binding, influence of subcellular localization and protein activity, and post-translational modifications. A general theoretical question is whether there is a way to sample a network so that the topological information of a sub-network can reflect well that of the original network. This issue was addressed by recent theoretical studies of Stumpf and colleagues [7,8] who showed that a randomly-sampled sub-network from an Erdös-Rényi random network is still an Erdös-Rényi random network; the same is true for an exponential network. When the original network is scale-free, however, the randomly sampled sub-network is not truly scale-free, but the degree distribution is still very close to a power-law. These findings suggest that a randomly-sampled sub-network may still largely maintain the topological information of the original scale-free network. Besides the maintenance of degree distribution, we also numerically analyzed the network motifs and found that the motif structures were also maintained after random sampling (Additional file 1 Fig.S1). Therefore, a practical question that arises is whether the sub-networks measured by the large-scale experimental approaches can be used to deduce topological information of the original networks. The answer to this question remains largely unclear. In a recent computational analysis [9], it was found that the power-law degree distributions of sampled networks reported in previous studies [3,4,10-13] may be a consequence of the manner in which the data are acquired and the low coverage of the complete (i.e., the "actual") protein-protein interaction networks. Besides the degree distribution and network motifs, other topological properties of the randomly sampled network, such as degree exponent, average path length and clustering coefficient, can be quite different from the original network when the size of sampled network is smaller than that of the original one [14,15]. Nevertheless, based on these previous studies [7-9] and our simulations (Additional file 1 Fig.S1), a sample that reflects the degree distribution and percentage of network motifs of the original network should: be randomly acquired and contain a high degree of coverage of the parent network. By analyzing several experimentally measured protein-protein interaction networks in the present study, we demonstrate that these experimental samples do not constitute random samples, likely due to the aforementioned experimental considerations. This observation highlights that the experimentally-measured sub-networks may not be the correct source for topological information about the parent protein-protein interaction network, raising the distinct possibility that previous analyses of biological networks [3,4,10-13,16-22] make inappropriate conclusions about topology. Although we conclude in this study that the current experiment datasets cannot be used directly for deducing topological information of the original network, we hypothesized that there is a core sub-network (CSN) within the experimentally sampled network that can better retain the topological information of the original protein-protein interaction network. Results Properties of experimentally-measured protein-protein interaction networks Despite the insights obtained by Stumpf and colleagues [7,8] regarding degree distribution and our numerical analyses of network motifs in randomly sampled networks (Additional file 1 Fig.S1), one is still faced with the problem that experimental sampling may not be random due to one or more of the following reasons: (i) some proteins are used as either bait or as prey, but not both; (ii) experimental results often contain data from different laboratories, species, techniques, etc.; and (iii) even if all proteins under analysis are used as both baits and preys (e.g., large scale yeast two-hybrid approaches), the relative ability of a protein to "behave as a bait" may not be equivalent to (and sometimes is completely different from) its ability to "behave as a prey" due to a variety of reasons. For example, the yeast protein-protein interaction network by Ito et al [23], all 6,000 proteins were used both as baits and preys, but in the resultant network, many proteins exhibited a preferential capacity to act as either a bait or a prey, while some do both. Figure Figure1a1a
Here we first defined the sub-network composed of the proteins which have both bait and prey functions, and the links among these proteins (red dot and links in Fig. Fig.1c),1c Ideally, if the interactions (in this study, we count A–B as one link, but A → B with A as bait and B as prey and B → A with B as bait and A as prey, as two interactions) between the proteins were completely sampled, there would no pure baits or pure preys. One can attribute the occurrence of the asymmetrical properties to the limitations of experimental systems or to the proteins being artificially sorted by the way the experiments were carried out. However, the asymmetrical bait and prey properties can also occur with random sampling if the sampling of the interactions is incomplete. To exclude that the measured network is indeed a randomly sampled sub-network of the original network, we did further analyses of the experimental datasets. Firstly, if the experimental sampling were indeed random, then the number of observed "pure bait" and "pure prey" proteins following an incomplete sampling should be approximately equal; in fact, however, these numbers are quite different in the experimental datasets (Fig. (Fig.1d).1d
The results in Fig. Fig.1d1d Effects of experimental sampling on network topology To show how the experimental sampling affects the topological information, we first studied effects of the ratio of the three types of nodes in the sampled network on the degree distribution and motif structure. We generated three theoretical networks (15,000 nodes each) with different topologies (Erdös-Rényi random distribution with an average connectivity equals 40, exponential distribution p(k) e-0.025k, and scale-free distribution p(k) k-1.4) and used the Drosophila protein-protein interaction (DPPI) network by Giot et al [20] as if it were a theoretical network without the original bait and prey information.To mimic the experimental sampling, we randomly selected 6000 nodes from the 15,000-node parent networks (for the DPPI network, 5980 proteins were randomly sampled from the original 7049 proteins) as the experimental libraries, and randomly assigned proteins (independent of degree/link number) in the libraries to be pure baits, pure preys, or BPs (proteins that can act as both bait and prey), with certain probabilities. Different ratios between these three types were thus obtained. We then applied the following rules to the interactions: (i) any interaction originating from a pure prey or terminating on a pure bait is forbidden (see Additional file 1 Fig. Fig.2);2
Figure Figure3a3a We also counted the sub-graphs of the networks as performed in previous studies [27-29]. Theoretically, a randomly sampled sub-network retaining all links (q = 1) should maintain the ratios between different types of motifs, based on the following argument: a given four-node motif (for example) in the parent network remains intact in the sampled sub-network if and only if all 4 nodes are in the sub-network. If the sub-network is sampled by selecting nodes with a probability p, then a four-node motif survives with probability p4. Since all motifs have the same survival probability, the percentage of different motif types will not change in the randomly sampled sub-network. On the other hand, in the simulated experimental network, the three types (BP, pure bait, pure prey) may change the survival probability, i.e. the probability that the link is maintained in the sample. For example, for the two motifs: Motif 1 (A–B, A–C, A–D) and Motif 2 (A–B, A–C, B–D) (see Additional file 1 Fig. Fig.2b),2b Figure Figure3b3b Figure Figure3c3c Filtering core sub-network within an experimental dataset Based on our analysis above, it is not surprising that the bait/prey preference affects the network topology so that it cannot be used to predict the topology of the parent network. But it is also not non-intuitive that the core sub-network (CSN) which is composed of only BPs (the red dots and lines in Fig. Fig.1c)1c According to our analysis above, exclusion of pure baits and pure preys does not eliminate the biased behavior of proteins from the CSN. To further refine this network, we first define two quantities–the bait score and prey score–to quantitatively characterize the experimental behavior of individual proteins. These two quantities are empirically defined as: bait score = m/n1, prey score = n/m1 (truncated to 1 if greater than 1). The rationale for these definitions is as follows. For the hypothetical Protein X, m is the number of preys to which Protein X links when it is a bait protein, among which m1 proteins are themselves also baits in the experiment. The number of baits to which Protein X links when it is a prey protein, is denoted by the term n. In the perfect experiment, when Protein X functions as a prey it should therefore link to at least m1 proteins (i.e. m1 should be equal to n). This of course is not the case in a real experiment, however, and therefore a protein's behavior as a prey is quantified by n/m1, i.e., the prey score. In the experimental setting, n can be larger than m1, and m1 = 0 for the pure preys; therefore, once n>m1, we set the prey score to be the maximum 1. Similar nomenclature is used to label proteins from the prey perspective. For a given Protein X, n is the number of baits to which it links when it is a prey, among which n1 proteins are themselves also preys in the same experiment. As with the bait score above, the experimental data does not show the idealized relationship in which all interactions are detected from both directions, and therefore the bait score is calculated as m/n1. Relating these two scores together in the idealized scenario for a BP protein the bait score = prey score = 1, pure baits have bait score = 1 and prey score = 0, and pure preys have bait score = 0 and prey score = 1. For the proteins in red nodes in Fig. Fig.1c,1c Figures Figures4a4a
In the DPPI dataset by Giot et al [20], a confidence score was assigned to each link in the measured network on the basis of experimental data. Figure Figure5a5a
When the original DPPI dataset was filtered into the high-confidence one [20], the protein number collapsed from 7048 to 4679 (66% of initial value) and the link number from 20405 to 4780 (23% of initial value). For the CSN generated with bait and prey scores ≥ 0.5 before filtering with confidence score, there were 1149 proteins with 1834 links, of which 130 links were bidirectional, and the average confidence score was 0.438. After the filtering, 702 (61%) proteins, 854 (47%) links, and 126 (97%) bidirectional links remained, and the average confidence score was 0.747. This exercise demonstrates that the links in the CSN have a much higher retention rate (47% vs. 23%) when filtered with confidence, in further agreement with the higher average confidence score of interactions in the CSN. This conclusion is further substantiated if we regenerate the CSN (with the same bait and prey scores) after filtering the DPPI network to the high confidence DPPI network on the basis of the experimental data: this new CSN has 937 (602 are identical to those in the unfiltered CSN) proteins, 902 (450 identical) links, 223 bidirectional links, and an average confidence score of 0.753, which is substantially increased in comparison to when the filtering is done after the CSN is defined from the DPPI network. Interestingly, 84% (223/266) of the bidirectional links were retained when the CSN was defined after filtering the DPPI network to the high confidence DPPI network, versus 47% (126/266) retention of bidirectional links when defined from the DPPI network prior to confidence score filtering. Thus, this CSN approach is an independent (and complementary) method to identify high confidence links more likely to harbor accurate topological information. We also compared the motif distributions of the DDPI dataset and their CSNs (Fig. (Fig.5c).5c Based on the analyses above, we hypothesize that the CSN within the experimentally sampled sub-network is a closer approximation of a random sample and thus retains the topological information of the original network better than the entire experimental sample. Theoretically, filtering the experimental datasets using our method with higher bait score and prey score thresholds, one can obtain a better CSN. However, due to the limited number of proteins in the network, higher bait and prey scores result in fewer proteins in the CSN, which may cause the CSN to be too small to faithfully retain the topological information of the parent network. What are the degree distributions of protein-protein interaction networks? A number of studies have suggested that protein-protein interaction networks are scale-free [3,4,10-13,18], whereas other studies have contested this interpretation [19-22]. Han et al [9] showed that the scale-free nature may be due to the low sampling rate and imperfect sampling methods which can cause a sub-network from a Erdös-Rényi random network to appear scale-free. For this to happen, a key feature is the loss of the peak in the binomial distribution of the random network. Since the peak is located at [Nγ]~[(N-1)γ] = [<k>] ([x] is the integer part of x, N is the size of the original network, γ is the sampling rate, and <k> is average connectivity of the sampled sub-network, see Additional file 1 text for details), when <k><2, the peak will disappear. However, the average connectivity <k> of most of the measured networks is greater than 2, even for some of the CSNs we examined (Additional file 1 Table S1), indicating that the protein-protein interaction networks may not be random networks. On the other hand, our analysis shows that if the protein-protein interaction networks are scale-free (that is, if they have a power-law distribution), the degree distributions of either a random sample, an experimental sample or the CSN all closely resemble the same power-law distribution of the original network (see Fig. Fig.3).3 k-δ e-εk (see Additional file 1 Fig.S4), and for the DPPI dataset (Fig. (Fig.6a),6a k-1.2 e-0.038k as shown by Giot et al [20]. A CSN with both bait and prey scores greater than or equal to 0.5 has a degree distribution close to p(k) k-0.6 e-0.22k, which has a larger exponential component but smaller power-law component than the DPPI network. For the high-confidence dataset of the DPPI network (Fig. (Fig.6b),6b k-1.26 e-0.27k, while the CSN defined by both bait and prey scores greater than or equal to 0.5 has a degree distribution p(k) k-0.01 e-0.75k which is almost completely exponential. To show that this effect is not due solely to the reduction in network size, we also show the degree distributions of two random subsets of the experimentally sampled network: one where the protein number is the same as that of the CSN (called random sample 1) and the other in which the link number is the same as that of the CSN (called random sample 2), both of which have degree distributions that are very different from the CSN. In other datasets we analyzed, the degree distributions of CSNs all have a smaller power-law component and a larger exponential component as compared to the original datasets (Additional file 1 Fig.S4). However, we are not able to completely rule out that the reduction in network size contributes to the enhancement of the exponential component. The two randomly sampled networks in Fig. Fig.6a6a
Discussion The present study provides an improved method for extracting accurate topological information about real protein-protein interaction networks from experimentally-obtained sub-networks. The fundamental conclusions of this study can be summarized as follows: (i) random sampling of networks preserves topological information, regardless of the type of network analyzed; and (ii) experimental protein-protein interaction studies have well-established limitations that make their method of sampling non-random; however, (iii) definition of a CSN that contains proteins that behave experimentally as both baits and preys better approximates a random sample and therefore increases the accuracy of topological assessment of protein-protein interaction networks. We show that sampling of theoretical protein interaction networks with exponential, random or scale-free topology in a manner that takes into account experimental limitations, can (and indeed, usually does) produce a sample with scale-free topology; it is given that samples of protein interaction networks appear scale-free; from this, however, it cannot be concluded (as has been previously attempted) that protein interaction networks are scale-free. Based on our method of defining CSN from the experimental datasets, we show that the degree distribution of the original network may not be scale-free, but may in fact exhibit an exponential distribution. Protein interaction analyses have unavoidable limitations including false positive and negative identifications [30-33] and assumed binary interactions, as mentioned above. We suspect that these false positives may contribute to the observed power-law component of the protein-protein interaction networks based on the following rationale: (i) the high-confidence Drosophila network (purportedly containing fewer false positives [32]) has a stronger exponential component (also verified by Przulj and colleagues [21]) and the CSN has an even higher confidence score and stronger exponential component (Fig. (Fig.55 Determining with high confidence topological information about protein-protein interaction networks from the properties of a smaller, experimentally measured, sub-networks has been challenging [35-37]. However, the topologies of the networks are extremely important for their function and robustness [1-4,38,39]. Conclusion In this study, we have developed an improved method for extracting topological information for cellular protein-protein interaction networks from experimentally-obtained datasets. As structure, or network anatomy, is a necessary precursor to understanding function, or network physiology, these findings enhance our ability to use existing experimental methods for protein-protein interaction analysis to investigate the behavior of these networks in vivo. Methods Experimental datasets Theoretical networks Theoretical networks were generated following the method by Bender and Canfield [43], that is, we assigned a desired number of edges for each node following the theoretical distribution, then randomly linked a pair of nodes to make an edge, and decreased the link number for both nodes by one until all edges were assigned to nodes without repetition. Random networks were generated according to the Erdös-Rényi model binomial degree distribution represented by: Simulated experimental networks To mimic the experimental sampling, we first generated the theoretical parent networks with N nodes by the method mentioned above. Then we randomly selected M(M<N) nodes from the N-node parent network, and randomly assigned the nodes in the M-node network to be pure baits, pure preys, or both baits and preys with different probabilities independent of the number of links of the nodes. We then applied the following rules to the links of the selected nodes: 1) Any interaction starts from a pure prey or ends at a pure bait is forbidden; 2) For the allowed interactions, each has a probability q (in the simulations in Fig Fig3,3 3) A link A–B exists when at least one of interactions A → B and B → A is detected. Motif detection We detected the motifs using the software mfinder1.2 developed by U. Alon's lab [44]. Authors' contributions LY carried out the computer simulations, participated in research design and drafting the manuscript. TMV participated in design and discussion of the research, and helped to draft the manuscript. ZH, WRM and JNW participated in discussion of the research. ZQ designed and directed the research, and drafted the manuscript. All authors read and approved the final manuscript. Acknowledgements This study was supported by grants from the NIH/NHLBI and by the Laubisch and Kawata Endowments at UCLA. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||
Nature. 2000 Jul 27; 406(6794):378-82.
[Nature. 2000]Nature. 2001 May 3; 411(6833):41-2.
[Nature. 2001]Proc Natl Acad Sci U S A. 2006 Jan 10; 103(2):311-6.
[Proc Natl Acad Sci U S A. 2006]Proc Natl Acad Sci U S A. 2006 Aug 1; 103(31):11527-32.
[Proc Natl Acad Sci U S A. 2006]Proc Natl Acad Sci U S A. 2005 Mar 22; 102(12):4221-4.
[Proc Natl Acad Sci U S A. 2005]Nat Biotechnol. 2005 Jul; 23(7):839-44.
[Nat Biotechnol. 2005]Nat Rev Genet. 2004 Feb; 5(2):101-13.
[Nat Rev Genet. 2004]Nature. 2001 May 3; 411(6833):41-2.
[Nature. 2001]Nature. 2000 Oct 5; 407(6804):651-4.
[Nature. 2000]Proc Natl Acad Sci U S A. 2005 Mar 22; 102(12):4221-4.
[Proc Natl Acad Sci U S A. 2005]Proc Natl Acad Sci U S A. 2001 Apr 10; 98(8):4569-74.
[Proc Natl Acad Sci U S A. 2001]Proc Natl Acad Sci U S A. 2001 Apr 10; 98(8):4569-74.
[Proc Natl Acad Sci U S A. 2001]Science. 2003 Dec 5; 302(5651):1727-36.
[Science. 2003]Cell. 2005 Sep 23; 122(6):957-68.
[Cell. 2005]FEBS Lett. 2002 Oct 23; 530(1-3):253-4.
[FEBS Lett. 2002]Science. 2002 May 3; 296(5569):910-3.
[Science. 2002]FEBS Lett. 2002 Oct 23; 530(1-3):255-6.
[FEBS Lett. 2002]Proc Natl Acad Sci U S A. 2005 Mar 22; 102(12):4221-4.
[Proc Natl Acad Sci U S A. 2005]Science. 2003 Dec 5; 302(5651):1727-36.
[Science. 2003]Science. 2002 Oct 25; 298(5594):824-7.
[Science. 2002]Science. 2005 Aug 12; 309(5737):1078-83.
[Science. 2005]Science. 2004 Jan 23; 303(5657):540-3.
[Science. 2004]Cell. 2005 Sep 23; 122(6):957-68.
[Cell. 2005]Science. 2003 Dec 5; 302(5651):1727-36.
[Science. 2003]Proc Natl Acad Sci U S A. 2001 Apr 10; 98(8):4569-74.
[Proc Natl Acad Sci U S A. 2001]Proc Natl Acad Sci U S A. 2001 Apr 10; 98(8):4569-74.
[Proc Natl Acad Sci U S A. 2001]Science. 2003 Dec 5; 302(5651):1727-36.
[Science. 2003]Science. 2003 Dec 5; 302(5651):1727-36.
[Science. 2003]Cell. 2005 Sep 23; 122(6):957-68.
[Cell. 2005]Proc Natl Acad Sci U S A. 2001 Apr 10; 98(8):4569-74.
[Proc Natl Acad Sci U S A. 2001]Science. 2003 Dec 5; 302(5651):1727-36.
[Science. 2003]Science. 2003 Dec 5; 302(5651):1727-36.
[Science. 2003]Nat Rev Genet. 2004 Feb; 5(2):101-13.
[Nat Rev Genet. 2004]Nature. 2001 May 3; 411(6833):41-2.
[Nature. 2001]Nature. 2000 Oct 5; 407(6804):651-4.
[Nature. 2000]Nature. 2005 Jan 27; 433(7024):392-5.
[Nature. 2005]Science. 2006 Jan 13; 311(5758):239-42.
[Science. 2006]Science. 2003 Dec 5; 302(5651):1727-36.
[Science. 2003]Mol Cell Proteomics. 2002 May; 1(5):349-56.
[Mol Cell Proteomics. 2002]Nat Genet. 2006 Mar; 38(3):285-93.
[Nat Genet. 2006]Nat Biotechnol. 2004 Jan; 22(1):78-85.
[Nat Biotechnol. 2004]Bioinformatics. 2004 Dec 12; 20(18):3508-15.
[Bioinformatics. 2004]Science. 1999 Oct 15; 286(5439):509-12.
[Science. 1999]Science. 2003 Jul 4; 301(5629):102-5.
[Science. 2003]Biophys J. 2005 Mar; 88(3):1626-34.
[Biophys J. 2005]Nature. 2000 Jul 27; 406(6794):378-82.
[Nature. 2000]Nature. 2001 May 3; 411(6833):41-2.
[Nature. 2001]Proc Natl Acad Sci U S A. 2005 Oct 11; 102(41):14497-502.
[Proc Natl Acad Sci U S A. 2005]Nature. 2004 Jul 1; 430(6995):88-93.
[Nature. 2004]Nat Biotechnol. 2005 Jul; 23(7):839-44.
[Nat Biotechnol. 2005]Science. 2004 Jan 23; 303(5657):540-3.
[Science. 2004]Cell. 2005 Sep 23; 122(6):957-68.
[Cell. 2005]Science. 2003 Dec 5; 302(5651):1727-36.
[Science. 2003]