![]() | ![]() |
Formats:
|
||||||||||||
Copyright © 2007 by The National Academy of Sciences of the USA Evolution Reconstruction of ancestral protein interaction networks for the bZIP transcription factors *Faculty of Life Sciences and §School of Computer Science, University of Manchester, Oxford Road, Manchester M13 9PL, United Kingdom; and ‡VIB/Ghent University, Technologiepark 927, B-9052 Ghent, Belgium †To whom correspondence may be addressed. E-mail: john.pinney/at/manchester.ac.uk or Email: david.robertson/at/manchester.ac.uk Edited by Michael S. Waterman, University of Southern California, Los Angeles, CA, and approved October 30, 2007 Author contributions: J.W.P., G.D.A., M.R., and D.L.R. designed research; J.W.P. performed research; J.W.P. analyzed data; and J.W.P. and D.L.R. wrote the paper. Received July 6, 2007. This article has been cited by other articles in PMC.Abstract As whole-genome protein–protein interaction datasets become available for a wide range of species, evolutionary biologists have the opportunity to address some of the unanswered questions surrounding the evolution of these complex systems. Protein interaction networks from divergent organisms may be compared to investigate how gene duplication, deletion, and rewiring processes have shaped the evolution of their contemporary structures. However, current approaches for comparing observed networks from multiple species lack the phylogenetic context necessary to reconstruct the evolutionary history of a network. Here we show how probabilistic modeling can provide a platform for the quantitative analysis of multiple protein interaction networks. We apply this technique to the reconstruction of ancestral networks for the bZIP family of transcription factors and find that excellent agreement is obtained with an alternative sequence-based method for the prediction of leucine zipper interactions. Further analysis shows our probabilistic method to be significantly more robust to the presence of noise in the observed network data than a simple parsimony-based approach. In addition, the integration of evidence over multiple species means that the same method may be used to improve the quality of noisy interaction data for extant species. The ancestral states of a protein interaction network have been reconstructed here by using an explicit probabilistic model of network evolution. We anticipate that this model will form the basis of more general methods for probing the evolutionary history of biochemical networks. Keywords: biological networks, computational biology, molecular evolution, probabilistic modeling The complex relationship between an organism's genotype and phenotype is mediated by many interrelated biochemical networks. As our knowledge of these network structures improves, we can start to ask questions about the evolution of cellular systems as a whole, as opposed to studying individual genes and their products in isolation (1, 2). This article extends recent work on the reconstruction of ancestral protein sequences (3, 4) by focusing on interactions between ancestral proteins. Greater understanding of the ancestral configurations of interaction networks would be of immense value in uncovering the processes involved in the evolution of cellular systems. Analogous to the inference of evolutionary history at the level of the DNA or amino acid sequence, evolutionary biologists would like to be able to infer ancestral protein interactions based only on their observations of networks from extant species. However, current methods of network alignment are generally lacking in any phylogenetic context (5–8). Hence, they have only limited value as quantitative tools for the study of evolution. Here we report the development of a general methodology for the reconstruction of ancestral protein–protein interaction networks by inference over a probabilistic model of interaction network evolution. By applying these methods to the bZIP transcription factor interaction network in chordates, we are able to predict ancestral networks with much greater robustness to measurement error than would be possible by using a naive parsimony-based approach. The bZIP transcription factors are a family of homo- and heterodimerizing proteins involved in the regulation of development, metabolism, circadian rhythm, and many other cellular processes (9). The characteristic bZIP domain consists of a basic region (contacting the DNA major groove) and a leucine zipper (LZ) that mediates dimerization-specificity. Gene duplication has played a major role in the evolution of the bZIP subfamilies, which are known to have broadly conserved patterns of interactions with each other (10, 11). The relative strengths of pairwise interactions between bZIP proteins have previously been measured experimentally for humans and the yeast Saccharomyces cerevisiae (12). In addition, the relatively simple biophysics of the coiled-coil LZ interaction means that pairs of proteins that form strongly interacting dimers can be predicted reliably from their LZ sequences alone by using computational methods (13): 93% sensitivity at 98% specificity based on a subset of human bZIP pairs with unambiguous experimental results. This combination of accurate genome-scale experimental data (12) and the capacity for highly accurate computer-based interaction prediction directly from amino acid sequences (13) makes the bZIP system useful as a model for investigating methods for ancestral network inference. Inference of Ancestral Protein Interaction Networks Our method starts with the assumption that it is possible both to construct a reliable phylogeny for the gene family of interest and reconcile this phylogeny with the known species tree, such that all internal nodes are labeled as gene duplication or speciation events (14). Although it would be possible to incorporate phylogenetic uncertainty into a probabilistic model of network evolution, this result would add greatly to the computational burden undertaken. Complete sets of bZIP protein sequences from four chordates [Ciona intestinalis (sea squirt), Takifugu rubripes (pufferfish), Danio rerio (zebrafish), and Homo sapiens (human)] were used to construct such a reconciled gene tree (Fig. 1
The two different types of arcs in Fig. 1 Using the sequence-based predictions for every possible interaction between a pair of proteins in each of our four extant species as input, we compute the probability of a strong interaction between each pair of proteins in each ancestral species (labeled “Teleost,” “Vertebrate,” and “Chordate”) according to the model. The tree-like structure of our probabilistic graphical model has the consequence that the inference of the ancestral network states is tractable by using belief-propagation techniques (16). Of course the inference of ancestral states for traits not directly related to gene sequence is not a new problem, and it might reasonably be expected that a parsimony-based approach would yield comparable results without the complications of parameter estimation that are required in the case of the probabilistic method. As a comparison, we therefore reconstructed the ancestral networks by applying an algorithm for finding maximally parsimonious evolutionary histories (17) to our interaction tree (Fig. 1 With most protein–protein interaction datasets, it would be impossible to determine which method for ancestral network inference was the more successful because we do not have protein interaction data for the ancestral species. However, the ability to make reliable predictions (13) of interaction strength directly from pairs of LZ sequences permits the construction of a benchmark dataset for the bZIP family by using inferred probability distributions for the amino acid sequences at each ancestral node. Results and Discussion Fig. 2
For comparison with these networks, results from the benchmark sequence inference method are shown in SI Fig. 7 and from the alternative, parsimony-based method in SI Fig. 8. To compare the performance of the probabilistic and parsimony-based methods fairly, receiver-operator characteristic (ROC) curves were plotted for the Chordate results (Fig. 3
Clearly the parsimony-based approach performs well for this system, and this finding is attributable, in part, to the high quality of the input data provided by the LZ interaction prediction software. However, experimental protein–protein interaction datasets usually have many false-positive and false-negative observations (1). We can simulate this situation by adding Gaussian noise with different variances to the interaction scores from which the input data are derived. The results, summarized by the ROC curves for Chordate in Fig. 3 Given these extremely high experimental error rates, there is currently a great deal of interest in methods for increasing the accuracy of protein interaction datasets (19, 20). In addition to predicting interactions for ancestral species, our probabilistic inference method offers a principled way to combine multiple interaction datasets to improve interaction predictions in extant species. Fig. 4
The availability of a comprehensive experimental dataset (12) for the bZIP transcription factor system has enabled us to calculate parameters for modeling the network rewiring process as a function of evolutionary distance, which is an approach that may prove applicable to more general types of protein interaction. Our probabilistic method for the inference of ancestral networks is significantly more robust to experimental noise than a naive maximum parsimony approach. In addition, the same inference process could be used to improve the quality of network datasets for extant organisms by combining evidence across multiple species and/or experiments. The probabilistic model presented in this study represents an important step forward in the evolutionary analysis of biochemical interaction networks. We currently have a detailed understanding of the important contribution of both small-scale- and whole-genome-mediated gene duplication to evolution both generically (21, 22) and specifically to transcription factor networks (11, 23). However, gene duplication only contributes by providing the raw material for innovation. Functional evolution is a consequence of changes in specificity between proteins, resulting in both the gain and/or loss of interactions. Our approach permits the inference of the evolutionary history of these rewiring events. In conclusion, detailed knowledge of the ancestral states of protein interaction networks will bring insights into the functional evolution of the interactome and the nature of conservation and change in divergent evolutionary lineages. Methods The identification of protein sequences for bZIP transcription factors and LZ regions from H. sapiens, D. rerio, T. rubripes, and C. intestinalis was described in our previous study (24). An interaction score was calculated for each potential bZIP interaction within each species by using the software of Fong and Singh (13) with base-optimized weights. Because of atypical features, interactions involving smMaf, lgMaf, and CNC proteins could not be predicted reliably (13), so these families were excluded from the analysis. LZ regions from all species were aligned by using MUSCLE (25). A consensus maximum likelihood (ML) phylogeny was built by using PROML (26) with the JTT model of amino acid replacement (27). Branch lengths for this consensus tree were calculated by using PAML (28). This tree was reconciled with the species phylogeny by using NOTUNG2 (14) with default settings. A probabilistic graphical model (Fig. 1 An implementation of the PARS algorithm (17) was applied to the interaction tree to infer the presence of ancestral interactions by maximum parsimony based on the binary interaction evidence for extant species as described earlier. The ratio of penalties (loss of interaction:gain of interaction) used was 1:11.4, corresponding to the relative probabilities in the plateau regions of SI Fig. 5. A benchmark set of interactions was constructed by using ancestral sequence inference, against which the interactions inferred by the probabilistic and parsimony-based methods could be compared. Probability distributions for the amino acid sequences of every ancestral bZIP protein in the gene tree were inferred by using PAML (28). Taking 1,000 random samples from these distributions, the Fong/Singh software (13) was then used to predict a mean interaction score for every bZIP pair in each ancestral species. Each pairwise score was then converted to a binary prediction of interaction by using a threshold of 30.6. ROC curves were plotted to evaluate performance of the probabilistic and parsimony inference methods against these predictions for the ancestral Chordate (Fig. 3 ACKNOWLEDGMENTS. We thank Mona Singh and Jessica Fong for kindly providing us with their software, two anonymous referees for constructive comments, and Simon Lovell for additional helpful suggestions. This work was supported by Biotechnology and Biological Sciences Research Council Grant BB/C515412/1 (to D.L.R.). Footnotes The authors declare no conflict of interest. This article is a PNAS Direct Submission. This article contains supporting information online at www.pnas.org/cgi/content/full/0706339104/DC1. References 1. Stumpf MP, Kelly WP, Thorne T, Wiuf C. Trends Ecol Evol. 2007;22:366–373. [PubMed] 2. Sharan R, Ideker T. Nat Biotechnol. 2006;24:427–433. [PubMed] 3. Thornton JW. Nat Rev Genet. 2004;5:366–375. [PubMed] 4. Thornton JW, Need E, Crews D. Science. 2003;301:1714–1717. [PubMed] 5. Berg J, Lassig M. Proc Natl Acad Sci USA. 2006;103:10967–10972. [PubMed] 6. Flannick J, Novak A, Srinivasan BS, McAdams HH, Batzoglou S. Genome Res. 2006;16:1169–1181. [PubMed] 7. Kelley BP, Yuan B, Lewitter F, Sharan R, Stockwell BR, Ideker T. Nucleic Acids Res. 2004;32:W83–W88. [PubMed] 8. Koyuturk M, Kim Y, Topkara U, Subramaniam S, Szpankowski W, Grama A. J Comput Biol. 2006;13:182–199. [PubMed] 9. Hurst HC. Protein Profile. 1995;2:101–168. [PubMed] 10. Deppmann CD, Alvania RS, Taparowsky EJ. Mol Biol Evol. 2006;23:1480–1492. [PubMed] 11. Amoutzias GD, Veron AS, Weiner J, III, Robinson-Rechavi M, Bornberg-Bauer E, Oliver SG, Robertson DL. Mol Biol Evol. 2007;24:827–835. [PubMed] 12. Newman JR, Keating AE. Science. 2003;300:2097–2101. [PubMed] 13. Fong JH, Keating AE, Singh M. Genome Biol. 2004;5:R11. [PubMed] 14. Durand D, Halldorsson BV, Vernot B. J Comput Biol. 2006;13:320–335. [PubMed] 15. Berg J, Lassig M, Wagner A. BMC Evol Biol. 2004;4:51. [PubMed] 16. Pearl J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Francisco: Morgan Kaufmann; 1988. 17. Mirkin BG, Fenner TI, Galperin MY, Koonin EV. BMC Evol Biol. 2003;3:2. [PubMed] 18. Hughes AL, Friedman R. Evol Dev. 2005;7:196–200. [PubMed] 19. D'Haeseleer P, Church GM. Proc IEEE Comput Syst Bioinform Conf. 2004:216–223. [PubMed] 20. Suthram S, Shlomi T, Ruppin E, Sharan R, Ideker T. BMC Bioinformatics. 2006;7:360. [PubMed] 21. Guan Y, Dunham MJ, Troyanskaya OG. Genetics. 2007;175:933–943. [PubMed] 22. Hakes L, Pinney JW, Lovell SC, Oliver SG, Robertson DL. Genome Biol. 2007;8:R209. [PubMed] 23. Amoutzias GD, Robertson DL, Oliver SG, Bornberg-Bauer E. EMBO Rep. 2004;5:274–279. [PubMed] 24. Amoutzias GD, Bornberg-Bauer E, Oliver SG, Robertson DL. BMC Genomics. 2006;7:107. [PubMed] 25. Edgar RC. Nucleic Acids Res. 2004;32:1792–1797. [PubMed] 26. Felsenstein J. Cladistics. 1989;5:164–166. 27. Jones DT, Taylor WR, Thornton JM. Comput Appl Biosci. 1992;8:275–282. [PubMed] 28. Yang Z. Comput Appl Biosci. 1997;13:555–556. [PubMed] 29. Murphy KP. Comput Sci Statist. 2001;33:331–350. 30. Cavallini F. College Math J. 1993;24:247–253. 31. Holden BJ, Pinney JW, Lovell SC, Amoutzias GD, Robertson DL. BMC Bioinformatics. 2007;8:289. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||
Trends Ecol Evol. 2007 Jul; 22(7):366-73.
[Trends Ecol Evol. 2007]Nat Biotechnol. 2006 Apr; 24(4):427-33.
[Nat Biotechnol. 2006]Nat Rev Genet. 2004 May; 5(5):366-75.
[Nat Rev Genet. 2004]Science. 2003 Sep 19; 301(5640):1714-7.
[Science. 2003]Proc Natl Acad Sci U S A. 2006 Jul 18; 103(29):10967-72.
[Proc Natl Acad Sci U S A. 2006]Genome Res. 2006 Sep; 16(9):1169-81.
[Genome Res. 2006]Nucleic Acids Res. 2004 Jul 1; 32(Web Server issue):W83-8.
[Nucleic Acids Res. 2004]J Comput Biol. 2006 Mar; 13(2):182-99.
[J Comput Biol. 2006]Protein Profile. 1995; 2(2):101-68.
[Protein Profile. 1995]Mol Biol Evol. 2006 Aug; 23(8):1480-92.
[Mol Biol Evol. 2006]Mol Biol Evol. 2007 Mar; 24(3):827-35.
[Mol Biol Evol. 2007]Science. 2003 Jun 27; 300(5628):2097-101.
[Science. 2003]Genome Biol. 2004; 5(2):R11.
[Genome Biol. 2004]J Comput Biol. 2006 Mar; 13(2):320-35.
[J Comput Biol. 2006]Genome Biol. 2004; 5(2):R11.
[Genome Biol. 2004]J Comput Biol. 2006 Mar; 13(2):320-35.
[J Comput Biol. 2006]Trends Ecol Evol. 2007 Jul; 22(7):366-73.
[Trends Ecol Evol. 2007]BMC Evol Biol. 2004 Nov 27; 4(1):51.
[BMC Evol Biol. 2004]Science. 2003 Jun 27; 300(5628):2097-101.
[Science. 2003]Genome Biol. 2004; 5(2):R11.
[Genome Biol. 2004]BMC Evol Biol. 2003 Jan 6; 3():2.
[BMC Evol Biol. 2003]Genome Biol. 2004; 5(2):R11.
[Genome Biol. 2004]Mol Biol Evol. 2007 Mar; 24(3):827-35.
[Mol Biol Evol. 2007]Evol Dev. 2005 May-Jun; 7(3):196-200.
[Evol Dev. 2005]BMC Bioinformatics. 2007 Aug 6; 8():289.
[BMC Bioinformatics. 2007]Trends Ecol Evol. 2007 Jul; 22(7):366-73.
[Trends Ecol Evol. 2007]Proc IEEE Comput Syst Bioinform Conf. 2004; ():216-23.
[Proc IEEE Comput Syst Bioinform Conf. 2004]Proc IEEE Comput Syst Bioinform Conf. 2004; ():216-23.
[Proc IEEE Comput Syst Bioinform Conf. 2004]BMC Bioinformatics. 2006 Jul 26; 7():360.
[BMC Bioinformatics. 2006]BMC Bioinformatics. 2007 Aug 6; 8():289.
[BMC Bioinformatics. 2007]Science. 2003 Jun 27; 300(5628):2097-101.
[Science. 2003]Genetics. 2007 Feb; 175(2):933-43.
[Genetics. 2007]Genome Biol. 2007; 8(10):R209.
[Genome Biol. 2007]Mol Biol Evol. 2007 Mar; 24(3):827-35.
[Mol Biol Evol. 2007]EMBO Rep. 2004 Mar; 5(3):274-9.
[EMBO Rep. 2004]BMC Genomics. 2006 May 4; 7():107.
[BMC Genomics. 2006]Genome Biol. 2004; 5(2):R11.
[Genome Biol. 2004]Nucleic Acids Res. 2004; 32(5):1792-7.
[Nucleic Acids Res. 2004]Comput Appl Biosci. 1992 Jun; 8(3):275-82.
[Comput Appl Biosci. 1992]Comput Appl Biosci. 1997 Oct; 13(5):555-6.
[Comput Appl Biosci. 1997]J Comput Biol. 2006 Mar; 13(2):320-35.
[J Comput Biol. 2006]Science. 2003 Jun 27; 300(5628):2097-101.
[Science. 2003]Genome Biol. 2004; 5(2):R11.
[Genome Biol. 2004]BMC Evol Biol. 2003 Jan 6; 3():2.
[BMC Evol Biol. 2003]Comput Appl Biosci. 1997 Oct; 13(5):555-6.
[Comput Appl Biosci. 1997]Genome Biol. 2004; 5(2):R11.
[Genome Biol. 2004]Science. 2003 Jun 27; 300(5628):2097-101.
[Science. 2003]