![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2007 Evlampiev and Isambert; licensee BioMed Central Ltd. Modeling protein network evolution under genome duplication and domain shuffling 1RNA dynamics and Biomolecular Systems Lab, CNRS UMR168, Institut Curie, Section de Recherche, 11 rue P. & M. Curie, 75005 Paris, France Corresponding author.Kirill Evlampiev: kirill.evlampiev/at/curie.fr; Hervé Isambert: herve.isambert/at/curie.fr Received March 23, 2007; Accepted November 13, 2007. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract Background Successive whole genome duplications have recently been firmly established in all major eukaryote kingdoms. Such exponential evolutionary processes must have largely contributed to shape the topology of protein-protein interaction (PPI) networks by outweighing, in particular, all time-linear network growths modeled so far. Results We propose and solve a mathematical model of PPI network evolution under successive genome duplications. This demonstrates, from first principles, that evolutionary conservation and scale-free topology are intrinsically linked properties of PPI networks and emerge from i) prevailing exponential network dynamics under duplication and ii) asymmetric divergence of gene duplicates. While required, we argue that this asymmetric divergence arises, in fact, spontaneously at the level of protein-binding sites. This supports a refined model of PPI network evolution in terms of protein domains under exponential and asymmetric duplication/divergence dynamics, with multidomain proteins underlying the combinatorial formation of protein complexes. Genome duplication then provides a powerful source of PPI network innovation by promoting local rearrangements of multidomain proteins on a genome wide scale. Yet, we show that the overall conservation and topology of PPI networks are robust to extensive domain shuffling of multidomain proteins as well as to finer details of protein interaction and evolution. Finally, large scale features of direct and indirect PPI networks of S. cerevisiae are well reproduced numerically with only two adjusted parameters of clear biological significance (i.e. network effective growth rate and average number of protein-binding domains per protein). Conclusion This study demonstrates the statistical consequences of genome duplication and domain shuffling on the conservation and topology of PPI networks over a broad evolutionary scale across eukaryote kingdoms. In particular, scale-free topologies of PPI networks, which are found to be robust to extensive shuffling of protein domains, appear to be a simple consequence of the conservation of protein-binding domains under asymmetric duplication/divergence dynamics in the course of evolution. Background Gene duplication is considered the main evolutionary source of new protein functions [1]. Although long suspected [2,3], whole genome duplications have only been recently confirmed [4-12] through large scale comparisons of complete genomes. Whole genome duplications are rare evolutionary transitions followed by random nonfunctionalization of many gene duplicates, resulting in characteristic reciprocal gene loss patterns [4,9,13], on time scales of about 100 MY (with large variations between genes, see Discussion). Whole genome duplications presumably provide unique opportunities to evolve many new functional genes at once through accretion of functional domains [14-20] from contiguous pseudogenes (or redundant genes) and may also promote speciation events by preventing genetic recombinations between close descendants with different reciprocal gene loss patterns [13,21]. Consecutive whole genome duplications (WGDs) have now been firmly established in all major eukaryote kingdoms within the last 300–500 MY, i.e. about 10–15% of life history. WGDs have been more frequent in plants [22] due to their widespread polyploidy; for instance, there were 3 consecutive WGDs in the recent evolution of the flowering plants Arabidopsis thaliana [7] and Populus trichocarpa [23] while 4 WGDs can be identified in Solanum (potato), Gossypium (cotton) and Brassica genomes [22]. Overall, there were between 2 and 4 WGDs in plants in the last 300 MY and many extant species like Solanum (potato), Glycine (soybean) or Saccharum (sugarcane) have undergone a recent WGD and are still essentially pseudotetraploid plants with about twice as many gene loci as their close relatives lacking this recent WGD. They are living examples of the dramatic simultaneous changes a single WGD event produces on a genome. No other genome rearrangement is known to have a comparable immediate impact on the evolution of genomes (with the exception of endosymbiotic events). Successive genome duplications have also occurred in animal genomes, even though most extant species are diploids. In vertebrates (chordates), there are, for instance, 4 consecutive WGDs between the seasquirt Ciona intestinalis and the common carp, Cyprinus carpio, with most tetrapods (including mammals) in between at +2WGDs from seasquirt and -2WGDs from carp and most bony fish at +3WGDs from seasquirt and -1WGDs from carp [11,12,24,25]. In fact, the common carp, Cyprinus carpio, and other bony fish from the salmonidae family (salmon, trout) as well as the amphibian Xenopus laevis and even the mammal Tympanoctomys barrerae (red vizcacha rat from Argentina [26]) are all pseudotetraploid vertebrates. [Constitutive tetraploidy is even occasionally observed in humans where it is responsible for 1 to 2% of early miscarriages but may lead, in rare cases, to liveborn infants reaching the age of two [27]. Amongst invertebrates, examples of polyploid species are also suspected or confirmed in most phyla, as in annelids (e.g., leeches [28]), flatworms (e.g., Stenostomum [29]), mollusks (e.g., Pacific oyster, Crassostrea gigas [30]) and in the major classes of arthropods, including insects (e.g., Nabis pallidus [31]), maxillopods (e.g., copepods [29]) and branchiopods (e.g., brine shrimp [32]). Finally, WGDs have also occured in protists; in particular, there were at least 3 consecutive WGDs in the ciliate Paramecium tetraurelia [33]. Other WGDs will likely be uncovered as more eukaryote sequences will become available. Extrapolating from these 2 to 4 consecutive WGDs in the last 300–500 MY for typical eukaryote genomes, one roughly expects a few tens consecutive WGDs (or equivalent "doubling events") since the emergence of eukaryotes, if not the origin of life itself. [While WGDs do not seem readily traceable in extant prokaryote genomes, they cannot be ruled out either over long evolutionary time scales (e.g. > 500 MY). In fact, wildtype subpopulations of bacteria with stable diploid genomes are known to exist [34]. In addition, viable whole genome recombinants between different prokaryotes have also been successfully engineered [35]. These rare but dramatic evolutionary transitions due to whole genome duplications must have had major consequences on the long time scale evolution of large biological networks, such as protein-protein interaction (PPI) networks. In this paper, we first discuss some experimental evidences (Fig. (Fig.1)1
Results Effect of WGD on PPI network evolution A direct experimental evidence for the effect of WGD on PPI network evolution is illustrated in Fig. Fig.1.1 From a more theoretical point of view and on longer evolutionary time scales (e.g. > 500 MY), we also expect that alternating WGDs and extensive gene deletions lead to exponential dynamics of PPI network evolution. In the long time limit, this should outweigh all time-linear dynamics that have been assumed in PPI network evolution models so far [36,41-45] (see, however, Discussion). In fact, the prevailing exponential dynamics of genome evolution is already clear from the wide distribution of genome sizes [1,3] and proliferation of repetitive elements [46]: it is hard to imagine that the 104-fold span in lengths of eukaryote genomes could have solely arisen through time-linear increases (and decreases) in genome sizes. [There is even a 105-fold span in genome lengths when including prokaryotes and 108-fold including viruses]. Overview of the model We propose a simple model of PPI network evolution focussing on the effect of whole genome duplication (extensions to local or partial genome duplication are presented in ref [47] and confirm the conclusions of this paper, see also Discussion). In the present model, each time step n corresponds to a whole genome duplication and leads to a complete duplication of the PPI network, whereby each node is duplicated (×2) and each interaction quadruplated (×4) as depicted on Fig. Fig.22 γo γ γn 0. "Old" and especially "new" duplicates that loose all their interactions with previous partners are then eliminated from the PPI network, while the "old" and "new" labels of selected duplicates are eventually all reset (to "old") before the next WGD iteration. Hence, "old" and "new" labels are only transient notations reflecting the asymmetric divergence of duplicated pairs after each WGD event (see Method).The PPI network evolution resulting from these successive WGDs is first solved analytically in the asymptotic limit of large PPI networks and then numerically for comparison with the available data on the yeast PPI network. Finally, an extension of this model is proposed to include the role of protein domains and their extensive shuffling between multidomain proteins over long evolutionary time scales. Modelling PPI network evolution under WGD The interaction network is characterized at each WGD step n by its number of nodes with k neighbors ![]() for k ≥ 0 and L(n) . In addition, because evolutionary changes in the averages ![]() are coupled to one another for all node degrees k ≥ 0, it is convenient to model the evolution of these averages ![]() by introducing a linear transform of ![]() in the form of a "generating function",which includes all nodes of the network according to their connectivity k ≥ 0. Permanently disconnected nodes (k = 0) need, however, to be removed from the list of relevant nodes, as they correspond to proteins that have in fact lost all previous interactions and presumably their function, and are eventually eliminated from the genome. To this end, we redefine the graph size as, The use of generating functions is a standard method [49] that enables to characterize distributions ![]() and pk from their successive moments, e.g. Asymmetric divergence of duplicated proteins In the following, we consider a general model of PPI network evolution under WGD which allows for asymmetric divergence of duplicated proteins, Fig. Fig.2.2 Actually, asymmetric divergence between duplicated genes is well supported by the reciprocal gene loss patterns arising after WGD [4,6,9]; this demonstrates that many, if not most, of the initially duplicated genes are eventually retained as single genes in the duplicated genome, reflecting clearly the asymmetric fate of duplicated genes after WGD (see, however, Discussion). Indeed, while duplicated genes are initially equivalent and experience, at first, the same functional constraints [51], their divergence becomes eventually asymmetric [52-54]. This occurs as one duplicate is more constrained to retain "old" interactions, while the other duplicate is less constrained and thus accumulates more mutations with the likely outcome to become nonfunctional by loosing all its duplication-derived interactions, unless some of them are eventually retained by selection. Note that the only interaction changes considered in this model are deletions of duplication-derived interactions (e.g. interactions arising from horizontal gene transfer are more characteristic of prokaryote evolution [55] and neglected here [45]). As outlined in the model overview above, divergence asymmetry is introduced by assigning different evolutionary parameters γo and γn in between "old" or "new" duplicated nodes corresponding to a larger and lower chance to conserve instances of their parent-node interactions, Fig. Fig.2.2 We have solved this mathematical model of PPI network evolution under WGD illustrated in Fig. Fig.2.2 • Non-conserved, exponential regime The case Γo < 1 (and Γn < 1) implies an exponentially decreasing degree distribution, pk exp(-μk) for large k 1, corresponding to a regular, infinitely derivable generating function, p(x). From an evolutionary perspective, we find that this exponential topology arises while the links emerging from each node (Fig. (Fig.2)2• Conserved, scale-free regime The case Γo > 1 > Γn implies a "scale-free" topology with a power law decrease of the node degree distribution pk k-α-1, for large k 1. This corresponds to a singular, non-infinitely derivable generating function, p(x), with the following asymptotic expansion in the vicinity of x = 1,
where r ≥ 1 is an integer and α > 1 the solution of the following characteristic equation (with r ≤ α <r + 1), When In summary, whole genome duplication with asymmetric divergence of duplicated proteins leads to the emergence of two main classes of PPI networks : i) PPI networks with an exponential degree distribution and without protein nor topology evolutionary conservation and ii) PPI networks with a scale-free limit degree distribution and protein conservation together with at least some local topology conservation. All other evolution scenarios are unlikely to model biologically relevant cases; they correspond either to an exponential disappearance of the whole PPI network (i.e. if Γn + Γo < 1) or to an exponential shift of all proteins towards higher and higher connectivities, i.e. dense regime in Fig. Fig.3A,3A Fitting PPI network data with a one-parameter model Scale-free degree distributions have been widely reported for large biological networks and other exponentially growing networks like the WWW. We showed in the previous discussion that scale-free limit degree distributions require an asymmetric divergence of duplicated proteins (Γo - Γn = γo - γn > 0) which corresponds to the probability difference between conservation of old interactions (γo) and coevolution of new binding sites (γn). The expected range of parameters for actual biological networks is 1 γo γ γn 0; In particular, the most conservative (γo = 1) and least correlated (γn = 0) evolution scenario corresponds to the strongest divergence asymmetry between duplicated proteins (Γo - Γn = 1, upper border on Fig. Fig.3A).3AAdding and removing up to 30% of links randomly, or drawing γ from a uniform distribution between 0 and 0.52 (with average Direct vs indirect protein-protein interactions The protein-protein interactions we have considered so far correspond to direct physical contact between protein pairs derived, for instance, from two-hybrid expression assays [56]. However, we expect from the proposed scale-free fit of the degree distribution (Fig. (Fig.3B)3B In fact, many biological functions are known to rely on multiple direct and indirect interactions within protein complexes. Moreover, the combinatorial complexity of multiple-protein interactions is likely responsible for the remarkable diversity amongst living organisms [72], despite their rather limited and largely shared genetic background (i.e. a few (ten) thousands genes built from a few hundreds families of homologous protein domains [18-20,73,74]). High-throughput studies using affinity precipitation methods coupled to mass spectroscopy [75-77] have proposed some 80,000 direct and indirect protein interactions for S. cerevisiae (raw data) and similar data are now becoming available for several other species. Yet, from a theoretical point of view, the evolution of indirect interactions is expected to depend not only on locally conserved network topology but also on the actual "combinatorial logic" between direct interactions [78,79]. This cannot be readily defined on traditional PPI network representation (e.g. Fig. Fig.2)2 Redefining PPI network evolution in terms of protein domains Indirect protein interactions reflect the occurence of simultaneous direct interactions within protein complexes. This requires that some proteins have more than one binding sites to simultenaously interact with several protein partners. Indeed, proteins with a single protein-binding site can only bind to one partners at a time, underlying a simple "XOR"-like combinatorial logic. By contrast, proteins with several protein-binding sites greatly increase the combinatorial complexity of biological processes (like gene regulation or cell signaling) by adding "AND" operators to the computational logic between multiple direct interactions. In addition, we note that binding sites are likely the primary source of asymmetric divergence in PPI network evolution, as mutations on a shared binding site will generally affect the interactions with all its binding partners (Fig. (Fig.2)2 In the following, we propose to highlight this central role of protein domains in the evolution of PPI networks by simply redefining our initial asymmetric divergence model (Fig. (Fig.2)2 Combining whole genome duplication and extensive domain shuffling As noted in the introduction, whole-genome duplications is thought to promote efficient shuffling of multi-domain proteins by enabling many accretion and deletion events of functional domains after each genome doubling. In fact, we will assume in the following that the overall shuffling of multi-domain proteins is so efficient that protein domains encoded along the genome are effectively randomly shuffed over long evolutionary time scales, e.g. > 500 MY-1 GY, as suggested by the different multi-domain combinations typically observed across distant living kingdoms [19]). Indeed, our aim, here, is not to model the fine details of domain shuffling events on short evolutionary time scales, but instead to check the robustness of PPI network scale-free topology against the extensive shuffling of protein domains that effectively occurs over long evolutionary time scales. Assuming a random shuffling of individual protein domains implies that their evolutionary dynamics is ultimately averaged over a long series of single- and multi-domain proteins. Hence, the integrated connectivity of individual protein domains can be assumed to have evolved independently from their current position inside a specific single- or multi-domain protein. Besides, a more elaborate model of protein evolution detailing domain accretion and deletion events leads to virtually identical asymptotic results (not shown). Assuming a random shuffling of independent protein domains over long evolutionary time scales is also a more stringent condition with regards to the robustness of PPI network topology against domain shuffling events. The overall topology of PPI networks is expected to be a forceriori less affected by actual domain shuffling events. Finally, the assumption of random shuffling of independent protein domains is simple enough to be amenable to an exact mathematical extension of the initial model neglecting multidomain protein structures. Indeed, in the asymptotic limit, the generating function for the connectivity distribution of the global multidomain protein network, Hence, summing over all possible multidomain proteins finally yields for the overall generating function Although non-protein-binding domains are omitted here for simplicity, they can readily be taken into account by including a fraction of disconnected, non-protein-binding domains in p(x). Eq. (5) implies, in particular, an exponential distribution of multi-domain proteins, in agreement with actual distributions [80,81], with an average of 1/(1 - λ) protein-binding sites per protein. While p(x) now reflects the independent evolution of single protein-binding domains, Eq. (5) shows that it also controls the asymptotic properties of the derived multi-domain networks which implies that degree distributions of multi-domain protein networks Hence, domain shuffling of multi-domain proteins provides a powerful, yet non-disruptive source of combinatorial innovation, as it preserves essential topological features inherited from the underlying protein-domain interaction network evolution. Finally, comparison with experimental data sets including indirect protein-protein interactions [75-77] is made by adopting a statistical implementation of the "combinatorial logic" discussed above (see Supporting Information). It is based on a Dijkstra algorithm that estimates the relative importance of all possible indirect interactions between multi-domain (and single-domain) proteins for each PPI network realization. Figs. Figs.4B4B Discussion In this paper, we establish the statistical consequences of successive whole genome duplications and divergence asymmetry between gene duplicates on both i) evolutionary conservation and ii) emerging topological properties of PPI networks. The evolutionary dynamics of non-conserved networks implies that all evolutionary traces are erased exponentially fast from the network and its underlying genome over typical WGD time scales (e.g. 100 MY). Hence, evolutionary conserved networks are presumably the only biologically relevant PPI networks that may arise through whole genome duplications. We have also demonstrated that they necessarily present a scale-free topology that is robust to extensive domain shuffling of their multiple domain proteins. Other evolutionary processes than WGD and domain shuffling have not been included in the main text above, for simplicity. Yet, additional PPI network features can also be taken into account. We have investigated, in particular, the roles of 3 additional well-documented features of PPI network evolution, which we discuss below. They are i) protein homo-oligomerization, ii) protein domains with multiple binding sites and, finally, iii) other duplication-divergence events at smaller genomic scale than entire genome (i.e. from single gene to partial genome duplication). Yet, we have found that none of these additional PPI network features significantly affect the general conclusions of the present study. • i) Protein homo-oligomerization The possibility of protein homo-oligomerization can be explicitly taken into account by introducing 2 types of nodes corresponding respectively to i) self-interacting proteins with self-link loops and ii) non-self-interacting proteins without self-link loops, see Fig. S2 and Supporting Information. Available data on PPI networks reveals that about 10 to 15% of interacting proteins are self-interacting [38,39]. Empiral evidence have also been reported on the higher overall connectivity and interconnectivity of homodimer proteins in PPI networks [82]. In principle, the detailed evolution of PPI network conservation and topology is affected by self-link loops which provide a source of duplication-derived de novo interactions between "old" and "new" copies of duplicated self-interacting proteins, Fig. S2. However, the general conservation and topological properties of PPI networks turn out to be little affected by the presence of self-link loops, in the asymptotic limits of large PPI networks and large node degrees (see Supporting Information for detailed proof). In a nutshell, this is because conservation and topology of PPI networks are controlled by the exponential increase of their node degrees while the contribution of de novo interactions arising from duplicated self-interacting proteins can at most lead to a linear increase of node degrees, with a maximum increment of +1 link per duplication event and protein. Thus, although an abundance of self-interacting proteins would significantly affect the evolution of low connectivity proteins, it could not lead to a change of topological regimes for the highly connected nodes of the PPI networks (e.g. from exponential to scale-free node degree distribution or vice versa). Hence, to a first approximation, self-interacting proteins can be simply ignored to establish the asymptotic conservation and topology regimes of PPI network evolution, as we have done in the main text and Fig. Fig.3A.3A • ii) Protein domains with multiple binding sites The possibility of having protein interfaces involving more than two proteins at a time (e.g., the hetero-trimeric fibrinogen) is not currently included in the model. Actually, the average number of binding sites per protein-binding domains is around 1.3, with about 80% of protein-binding domains having a single binding site [83] (except for self-interacting domains forming homo-oligomeric self-assemblies, which require, as expected, at least 2 binding sites, see table 2 in [83].) Yet, in principle, the evolution of protein-binding domains with multiple binding sites can be taken effectively into account, at least numerically, by introducing a strong physical correlation between successive single-binding-site "domains". However, we want to stress that our main results regarding protein-binding domains do not concern nor rely on the detailed evolutionary correlation of binding sites and domain shuffling mechanisms. Indeed, by assuming only single-binding-site domains, we have demonstrated that even the most extensive shuffling of binding site/domain orders, implying the loss of all correlation along the primary sequence, does not qualitatively affect the general conservation and topological properties of emerging PPI networks under whole genome duplications. Hence, it is quite clear and confirmed by simulations (not shown) that introducing physical correlation between successive binding sites/domains has a forceriori even less effect on the general evolutionary regimes, we have predicted above. • iii) Duplication-divergence events at smaller genomic scales Finally, beyond whole genome duplication, duplication-divergence events are also known to occur at smaller genomic scales from single gene to partial genome duplication. Moreover, local duplications/deletions may also lead to exponential dynamics of PPI network evolution if they are selected independently in parallel. A general model for PPI network evolution under duplication-divergence processes at any genomic scale (from single gene to whole genome) and allowing also for variations in all evolutionary parameters Interestingly, recent evolutionary records (< 500 MY) for specific eukaryotes from various kingdoms, e.g. [23,33], suggest that whole genome duplications have been a significant factor in the overall expansion of ancestral genomes [23,33], while local duplications have been mainly responsible for the expansion of specific gene families. It will be interesting to see whether this is a general trend or not as new complete eukaryote sequences will become available. This difference in typical selection pattern of gene duplicates from either whole genome or local duplications may possibly reflect their opposite dosage effects on cellular activity and ultimately correspond to two evolutionary paradigms reminiscent to Monod's "chance and necessity" principles [84]. Indeed, random local duplications of essential genes are thought to be generally detrimental by the dosage imbalance they initially induce, thereby raising the odds for their rapid nonfunctionalization [85-87], unless they specifically happen to be beneficial under concomitant environmental changes [51]. Hence, the typical fate of random local duplications might be primarily driven by immediate "necessity" rather than "chance" and eventually lead to the expansion of specific gene families through series of beneficial local duplications. By contrast, rapid nonfunctionalization of duplicates following a whole genome duplication should be typically opposed by dosage effect, in particular, for highly expressed genes and for genes involved in multiprotein complexes or metabolic pathways [33]. This is because whole genome duplications initially preserve correct relative dosage between expressed genes, while subsequent random nonfunctionalizations disrupt this initial dosage balance. Preventing rapid asymmetric divergence between duplicates from recent whole genome duplications appears, in the end, to increase their chance of neo- or subfunctionalization by favoring longer genetic drift rather than early functional loss. Hence, by contrast with local duplications, the typical fate of gene duplicates under whole genome duplication might be largely driven by (long term) chance rather than (immediate) necessity. It is also reflected in the random pattern of reciprocal gene loss associated with multiple speciation events that typically follow a whole genome duplication [13,21]. This prevalence of chance over necessity following whole genome duplications further supports the stochastic and statistical framework we have adopted here to model the evolution of PPI networks under whole genome duplication. Conclusion In this paper, we argue that, large scale topological features of PPI networks emerge spontaneously in the course of evolution under simple duplication/deletion events [45], regardless of the specific evolutionary advantages individual proteins might have been selected for. While other selection drives than mere protein domain conservation might have also played a role, they do not appear to have been necessary nor prevailing factors to shape the large scale topology of PPI networks. For instance, the repulsion of protein hubs into largely independent network modules (i.e. the so-called network "disassortativity" property [42,50]) is predicted here (Figs. (Figs.3C3C From a more general perspective, the context of accelerating genome sequencing projects calls for a broader and inevitably more statistical understanding of biological network evolution, beyond the accumulation of details for particular evolutionary transitions of specific species. The analysis of PPI networks over broad evolutionary scales can only be based on a few well-established evolutionary mechanisms shared across a wide variety of organisms. As novel whole genome duplications are now routinely discovered in newly sequenced eukaryote genomes, e.g. [23,33], it is clear that these rare but dramatic simultaneous changes in genome content must have had a major impact on the long time scale evolution of eukaryote genomes and, hence, resulting biological networks. This study demonstrates the expected biological implications of such successive genome duplications in terms of both conservation and topology of PPI networks. In particular, it shows from first principles, that scale-free topologies of PPI networks are a simple consequence of their evolutionary conservation. It also highlights the importance and origin of the divergence asymmetry between gene duplicates, as well as the overall robustness of the resulting scale-free topology to domain shuffling of multi-domain proteins. Method Mathematical solution of the model Our formal approach is based on the use of generating functions to capture the statistical properties of emerging PPI networks under WGD. In particular, the generating function of the average number of protein nodes ![]() with k binding partners after n WGD steps is defined as,As discussed in Results, a general model for PPI network evolution under WGD allows for an asymmetric divergence of duplicated genes, Fig. Fig.2.2
where Ai(x) = (γx + δ)(γix + δi), for i = n, o and, γ, γn and γo [resp. δ, δn and δo] correspond to the probabilities to preserve [resp. delete] the duplication-derived interactions between "old" and "new" duplicated nodes, as depicted in Fig. Fig.2.2 Note, that there are two types of time scales in this model of PPI network evolution: one which is slow corresponds to the long time decay of ancestral interactions between "old" genes, while the other one is faster (e.g. 10–100 MY) and corresponds to the spontaneous symmetry breaking between "old" and "new" duplicate copies and the concommitant deletion of many "new" duplicates. In particular, we do not introduce distinct time scales for spontaneous symmetry breaking and deletion of "new" genes, since these two steps are not assumed to be distinct phenomena but rather simultaneous processes that cannot be formally decoupled. The overall graph dynamics through successive global duplications is clearly exponential as anticipated; in particular, the total number of nodes grows as F(n)(1) = A·2n, where A is the initial number of nodes, and the number of links scales as L(n) (2γ + γo + γn)n. We remove permanently disconnected nodes from the list of relevant nodes, assuming that they correspond to proteins that have in fact lost their function and are eventually eliminated from the genome. To this end, we redefine the graph size as, Absolute and relative generating functions are related through, Inserting this expression (10) in recurrence (8) gives a closed relation between successive where Δ(n) is the ratio between consecutive numbers of connected nodes, Δ(n) = N(n+1) / N(n) = 2 - p(n)(δδn) - p(n)(δδo) ≤ 2. The evolution of the mean degree is obtained by taking the first derivative of (11) at x = 1,where Γn = γ + γn = We will limit the discussion here to degree distributions approaching a stationary regimes p(n)(x) → p(x) with a finite mean degree 1 ≤ p'(1) < ∞. This seems to cover the most biologically relevant networks; for completeness, other cases are discussed elsewhere [47]. From (12) and the condition of finite mean degree, we readily obtain that Δ(n) → Γn + Γo ≤ 2, which implies that the network evolution is asymptotically equivalent in terms of connected nodes and links,
This condition can be shown [47] to ensure that the evolution of the ensemble average of networks (Eq. 7) indeed reflects the "typical" evolution of PPI networks under global duplication. The stationary degree distribution is then solution of the functional equation, with which can be differentiated k times to express the kth derivative in terms of lower derivatives, where the coefficients αm αm(γn, γo, γ) are all positive from the definition (9).The finite or infinite nature of • Non-conserved, exponential regime If both Γo < 1 and Γn < 1, then, and the factor in front of • Conserved, scale-free regime If Γo > 1 > Γn, then the factor in front of implying that all derivatives
for some appropriate r <α <r + 1. This anzats is then inserted in (14) using (γx + δ)(γn,ox + δn,o) = 1 - Γn,o(1 - x) + γγn,o(1 - x)2 to obtain an equation on the coefficients A1,...Ar. The term Aα does not mix with previous terms and gives the following equation for α, The limit degree distribution follows a power law in this case,
When Note that scale-free degree distributions emerge under successive, global network duplications only if the "old" node copy has its links more likely duplicated than lost at each round of global duplication (as Γo = γ + γo > 1 is equivalent to γγo > δδo). Thus, "old" nodes statistically keep on increasing their connectivity once they have emerged as "new" nodes by duplication. From biological perspective, this implies that most nodes and their surrounding links are conserved throughout the evolution process, thereby ensuring that local topologies of previous networks remain embedded in subsequent networks. In summary, whole genome duplication with asymmetric divergence of duplicated proteins leads to the emergence of two classes of PPI networks with finite asymptotic degree distributions : i) PPI networks with an exponential degree distribution and without protein nor topology evolutionary conservation and ii) PPI networks with a scale-free limit degree distribution and protein conservation together with at least some local topology conservation. All other evolution scenarios, which do not lead to finite asymptotic degree distributions, are unlikely to model biologically relevant cases; they correspond either to an exponential disappearance of the whole PPI network (i.e. if Γn + Γo < 1) or to an exponential shift of all proteins towards higher and higher connectivities (i.e. dense regime in Fig. Fig.3A3A Abbreviations WGD : Whole Genome Duplication; PPI network : Protein-Protein Interaction network. Authors' contributions HI conceived the research, KE and HI performed the research and wrote the paper. Additional File 1 Supporting Information (6 pages). I. Model of PPI network evolution under WGD with symmetric divergence and link "complementation". II. Proof of Functional Recurrences (Eq. 8 and Eq. S1). III. Gene functionalization patterns in different models of PPI network evolution under WGD. IV. Statistical weighting of indirect interactions from protein complexes. V. Evolution of PPI networks including self-interacting proteins under WGD. Click here for file(140K, pdf) Acknowledgements We thank U. Alon, J. Berg, M. Bornens, M. Cosentino-Lagomarsino, T. Fink, L. Hirschbein, R. Monasson, M. Vergassola and P. Wincker for discussion. This work was supported by CNRS and Institut Curie. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||||||||||||||||
Science. 1976 May 7; 192(4239):524-9.
[Science. 1976]Nature. 1997 Jun 12; 387(6634):708-13.
[Nature. 1997]PLoS Biol. 2005 Oct; 3(10):e314.
[PLoS Biol. 2005]Nature. 1997 Jun 12; 387(6634):708-13.
[Nature. 1997]Nature. 2004 Apr 8; 428(6983):617-24.
[Nature. 2004]Nature. 2006 Mar 16; 440(7082):341-5.
[Nature. 2006]Annu Rev Biochem. 1995; 64():287-314.
[Annu Rev Biochem. 1995]Cell Mol Life Sci. 2005 Feb; 62(4):435-45.
[Cell Mol Life Sci. 2005]Curr Opin Plant Biol. 2005 Apr; 8(2):135-41.
[Curr Opin Plant Biol. 2005]Proc Natl Acad Sci U S A. 2002 Oct 15; 99(21):13627-32.
[Proc Natl Acad Sci U S A. 2002]Science. 2006 Sep 15; 313(5793):1596-604.
[Science. 2006]Nature. 2004 Oct 21; 431(7011):946-57.
[Nature. 2004]PLoS Biol. 2005 Oct; 3(10):e314.
[PLoS Biol. 2005]Genome Res. 2003 Jun; 13(6A):1056-66.
[Genome Res. 2003]Mol Biol Evol. 2003 Sep; 20(9):1425-34.
[Mol Biol Evol. 2003]Genomics. 2006 Aug; 88(2):214-21.
[Genomics. 2006]Heredity. 2000 Feb; 84 ( Pt 2)():201-8.
[Heredity. 2000]Hereditas. 2004; 140(2):99-104.
[Hereditas. 2004]Nature. 2006 Nov 9; 444(7116):171-8.
[Nature. 2006]Genetics. 1996 Nov; 144(3):871-81.
[Genetics. 1996]Proc Natl Acad Sci U S A. 2005 Nov 1; 102(44):15971-6.
[Proc Natl Acad Sci U S A. 2005]Proc Natl Acad Sci U S A. 2005 Mar 1; 102(9):3192-7.
[Proc Natl Acad Sci U S A. 2005]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D418-24.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D169-72.
[Nucleic Acids Res. 2006]Nature. 1997 Jun 12; 387(6634):708-13.
[Nature. 1997]Proc Natl Acad Sci U S A. 2002 Jul 9; 99(14):9272-7.
[Proc Natl Acad Sci U S A. 2002]Nature. 2004 Apr 8; 428(6983):617-24.
[Nature. 2004]Phys Rev E Stat Nonlin Soft Matter Phys. 2005 Jun; 71(6 Pt 1):061911.
[Phys Rev E Stat Nonlin Soft Matter Phys. 2005]Science. 1976 May 7; 192(4239):524-9.
[Science. 1976]PLoS Biol. 2006 Apr; 4(4):e109.
[PLoS Biol. 2006]Science. 2002 May 3; 296(5569):910-3.
[Science. 2002]Proc Natl Acad Sci U S A. 2005 Mar 1; 102(9):3192-7.
[Proc Natl Acad Sci U S A. 2005]Nature. 1997 Jun 12; 387(6634):708-13.
[Nature. 1997]Proc Natl Acad Sci U S A. 2002 Jul 9; 99(14):9272-7.
[Proc Natl Acad Sci U S A. 2002]Nature. 2004 Apr 8; 428(6983):617-24.
[Nature. 2004]Genome Biol. 2002; 3(2):RESEARCH0008.
[Genome Biol. 2002]Genome Biol. 2003; 4(9):R56.
[Genome Biol. 2003]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D418-24.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D169-72.
[Nucleic Acids Res. 2006]Nat Rev Genet. 2004 Feb; 5(2):101-13.
[Nat Rev Genet. 2004]Science. 2002 May 3; 296(5569):910-3.
[Science. 2002]Nature. 2000 Feb 10; 403(6770):623-7.
[Nature. 2000]Proc Natl Acad Sci U S A. 2002 Oct 29; 99(22):14132-6.
[Proc Natl Acad Sci U S A. 2002]Proteomics. 2005 Aug; 5(12):3116-9.
[Proteomics. 2005]Nature. 1999 Dec 2; 402(6761 Suppl):C47-52.
[Nature. 1999]Curr Opin Struct Biol. 2006 Jun; 16(3):420-9.
[Curr Opin Struct Biol. 2006]Dev Biol. 2001 Jun 15; 234(2):275-88.
[Dev Biol. 2001]J Mol Biol. 2001 Jul 6; 310(2):311-25.
[J Mol Biol. 2001]Cell Mol Life Sci. 2005 Feb; 62(4):435-45.
[Cell Mol Life Sci. 2005]J Mol Biol. 1995 Apr 7; 247(4):536-40.
[J Mol Biol. 1995]J Mol Biol. 2001 Nov 2; 313(4):903-19.
[J Mol Biol. 2001]Nature. 2002 Jan 10; 415(6868):141-7.
[Nature. 2002]Nature. 2006 Mar 30; 440(7084):631-6.
[Nature. 2006]Nature. 2004 Jul 1; 430(6995):88-93.
[Nature. 2004]Science. 2006 Dec 22; 314(5807):1938-41.
[Science. 2006]Annu Rev Biochem. 2005; 74():867-900.
[Annu Rev Biochem. 2005]Annu Rev Biochem. 2005; 74():867-900.
[Annu Rev Biochem. 2005]Genome Res. 1999 Jan; 9(1):17-26.
[Genome Res. 1999]J Mol Biol. 2005 Apr 22; 348(1):231-43.
[J Mol Biol. 2005]Nature. 2002 Jan 10; 415(6868):141-7.
[Nature. 2002]Nature. 2006 Mar 30; 440(7084):631-6.
[Nature. 2006]Genome Res. 1999 Jan; 9(1):17-26.
[Genome Res. 1999]J Mol Biol. 2005 Apr 22; 348(1):231-43.
[J Mol Biol. 2005]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D418-24.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D169-72.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2005; 33(11):3629-35.
[Nucleic Acids Res. 2005]PLoS Comput Biol. 2006 Sep 29; 2(9):e124.
[PLoS Comput Biol. 2006]Science. 2006 Sep 15; 313(5793):1596-604.
[Science. 2006]Nature. 2006 Nov 9; 444(7116):171-8.
[Nature. 2006]BMC Evol Biol. 2003 May 23; 3():11.
[BMC Evol Biol. 2003]Proc Natl Acad Sci U S A. 2005 Apr 12; 102(15):5454-9.
[Proc Natl Acad Sci U S A. 2005]Genome Biol. 2002; 3(2):RESEARCH0008.
[Genome Biol. 2002]Nature. 2006 Nov 9; 444(7116):171-8.
[Nature. 2006]Nature. 2006 Mar 16; 440(7082):341-5.
[Nature. 2006]Phys Rev E Stat Nonlin Soft Matter Phys. 2005 Jun; 71(6 Pt 1):061911.
[Phys Rev E Stat Nonlin Soft Matter Phys. 2005]Nat Rev Genet. 2004 Feb; 5(2):101-13.
[Nat Rev Genet. 2004]Science. 2002 May 3; 296(5569):910-3.
[Science. 2002]Science. 2006 Sep 15; 313(5793):1596-604.
[Science. 2006]Nature. 2006 Nov 9; 444(7116):171-8.
[Nature. 2006]