# Conservation and topology of protein interaction networks under duplication-divergence evolution

Author contributions: H.I. designed research; K.E. and H.I. performed research; K.E. and H.I. analyzed data; and K.E. and H.I. wrote the paper.

## Abstract

Genomic duplication-divergence processes are the primary source of new protein functions and thereby contribute to the evolutionary expansion of functional molecular networks. Yet, it is still unclear to what extent such duplication-divergence processes also restrict by construction the emerging properties of molecular networks, regardless of any specific cellular functions. We address this question, here, focusing on the evolution of protein–protein interaction (PPI) networks. We solve a general duplication-divergence model, based on the statistically necessary deletions of protein–protein interactions arising from stochastic duplications at various genomic scales, from single-gene to whole-genome duplications. Major evolutionary scenarios are shown to depend on two global parameters only: (*i*) a protein conservation index (*M*), which controls the evolutionary history of PPI networks, and (*ii*) a distinct topology index (*M′*) controlling their resulting structure. We then demonstrate that conserved, nondense networks, which are of prime biological relevance, are also necessarily scale-free by construction, irrespective of any evolutionary variations or fluctuations of the model parameters. It is shown to result from a fundamental linkage between individual protein conservation and network topology under general duplication-divergence evolution. By contrast, we find that conservation of network motifs with two or more proteins cannot be indefinitely preserved under general duplication-divergence evolution (independently from any network rewiring dynamics), in broad agreement with empirical evidence between phylogenetically distant species. All in all, these evolutionary constraints, inherent to duplication-divergence processes, appear to have largely controlled the overall topology and scale-dependent conservation of PPI networks, regardless of any specific biological function.

**Keywords:**evolutionary constraint, scale-free graph, functional motif, orthology, statistical model

The primary source of new protein functions is generally considered to originate from *duplication* of existing genes followed by functional *divergence* of their duplicate copies (1–3). In fact, duplication-divergence events have occurred and continue to occur at a wide range of genomic scales, from many independent duplications of individual genes^{†} [10^{−3} fixed events per gene per million years (MY) (4)] to rare but evolutionary dramatic duplications of entire genomes [one fixed event per 100–200 MY (5)]. For instance, there have been between two and four *consecutive* whole-genome duplications in all major eukaryote kingdoms in the past 300–500 MY (5). This actually amounts to a more-or-less similar contribution of new genes from whole-genome duplication as from individual gene duplications [i.e., one fixed event per 100–200 MY 10^{−3} fixed events per gene per MY, assuming a 10% fixation rate after a whole-genome duplication with ≈10,000 genes (5)].

This succession of whole-genome duplications, together with the accumulation of individual gene duplications, must have greatly contributed to shaping the global structure of large biological networks, such as protein–protein interaction (PPI) networks, that control cellular activities. In fact, concordant empirical evidence reveals the evolutionary persistence of duplication-derived protein–protein interactions. For instance, there are clear enrichments of recent protein duplicates around common protein partners compared with randomly picked pairs of proteins (5, 6), although the fraction of proteins identified as having undergone a (recent) duplication (<200 MY) remains typically small in absolute terms, for example, 10% (4). Similarly, protein residues implicated in protein–protein interaction are generally the most conserved at the surface of proteins (7), revealing their duplication-derived origin,^{‡} with typically little more than one conserved binding interface per protein-binding domains.^{§}

Ispolatov *et al.* (10) proposed an interesting local duplication-divergence model of PPI network evolution based on (*i*) the *statistical* deletion of individual, duplication-derived interactions and (*ii*) a *time-linear* increase in genome and PPI network sizes. Clearly, the deletion of redundant interactions arising from duplication is necessary to avoid the emergence of biologically irrelevant, densely connected PPI networks, lacking low-degree connectivities. Yet, we expect that *independent* local duplications and, *a forteriori*, partial- or whole-genome duplications all lead to *exponential*, not time-linear, evolutionary dynamics of PPI networks. In the long time limit, exponential dynamics should outweigh all time-linear processes that have been assumed in earlier PPI network evolution models (10–15). Models based on time-linear processes also assume that local evolutionary dynamics remain essentially frozen, as long as they are not directly affected by a local modification of the network. Yet, in reality, sequence mutations and environmental changes continue to affect the evolution of whole PPI networks, not just in the immediate surroundings of recently duplicated proteins.

In this article, we propose and asymptotically solve a general duplication-divergence model based on prevailing exponential dynamics^{¶} of PPI network evolution under local, partial, or global genome duplications. The only interaction changes that are considered are *deletions* of duplication-derived interactions. In particular, the rewiring dynamics of PPI networks by *de novo* creation of protein-binding interfaces (4) is neglected (10), as suggested by the empirical evidence mentioned earlier (see also *Evolution of PPI network motifs*). Indeed, our aim here is to establish a theoretical baseline from which other evolutionary processes beyond strict gene duplication and interaction loss events, such as shuffling of protein domains (5) or horizontal gene transfers, can then be considered.

A visual overview of the model is shown in Fig. 1 including its two main effective parameters, *M* and *M′*, that control, respectively, the evolutionary history or *conservation* (*M*) and resulting structure or *topology* (*M′*) of PPI networks under duplication-divergence evolution. In this article, we demonstrate a fundamental relation between protein conservation (*M*) and network topology (*M′*), that is, *M* ≤ *M′*, that is strictly independent from *any* evolutionary variations or fluctuations of the model parameters. We then discuss simple consequences in terms of evolutionary *linkage* between individual protein conservation and PPI network topology. The approach is also extended to outline the evolutionary statistics of small-network motifs including two or more proteins. In particular, we show that network motifs, unlike individual proteins, cannot be indefinitely conserved under general duplication-divergence evolution, regardless of any network-rewiring dynamics. Throughout the article, theoretical assumptions and results are commented on with brief discussions highlighting their biological relevance.

## Results

### General Duplication-Divergence Model.

The general duplication-divergence (GDD) model is designed to capture PPI network properties caused by evolutionary constraints, inherent to duplication-divergence processes and independent of selective adaptation (3) or any specific biological function. Concretely, the GDD model analyzes the deletion statistics of protein–protein interactions that arise from stochastic duplications at various genomic scales, from single-gene to whole-genome duplications. This deletion statistics of duplication-derived interactions is indeed a necessary “background” dynamics of PPI network evolution to prevent the emergence of biologically irrelevant, densely connected PPI networks, lacking low-degree connectivities.

In practice, a fraction *q* of extant genes is randomly *duplicated* at each time step of the GDD model. The divergence of both duplicated and nonduplicated genes then leads to the stochastic deletion or conservation of their related interactions, before another round of duplication-divergence occurs (Fig. 1). In the following, we first solve the GDD model assuming that *q* is constant over evolutionary time scales. We then study more realistic scenarios combining, for instance, rare whole-genome duplications (*q* = 1) with more frequent local duplications of individual genes (*q* 1), and including also stochastic fluctuations in *all* microscopic parameters of the GDD model (see Fig. 1 and below). To analyze the deletion statistics of duplication-derived interactions, we assume that ancient and recently duplicated interactions are stochastically conserved with distinct probabilities γ* _{ij}*'s, depending only on the recently duplicated or nonduplicated state of each protein partners, as well as on the

*asymmetric*divergence between “old” and “new” (or more “conserved” and more “divergent”) gene duplicates (5), see the Fig. 1 legend (“

*s*” for “singular,” nonduplicated genes and “

*o*”/“

*n*” for old/new asymmetrically divergent duplicates). Here, we consider nonoriented PPI networks, that is, γ

*= γ*

_{ij}*, for*

_{ji}*i*,

*j*=

*s*,

*o*,

*n*.

The first effective parameters derived from these microscopic evolutionary parameters are the average rates of connectivity change Γ* _{i}* (i.e.,

*k*→

*k*Γ

*) for each type of node*

_{i}*i*=

*s*,

*o*,

*n*, where Γ

*= (1 −*

_{i}*q*)γ

*+*

_{is}*q*(γ

*+ γ*

_{io}*) is independent from node connectivity*

_{in}*k*. In the following, we assume Γ

*≥ Γ*

_{o}*by definition of old and new duplicates caused by asymmetric divergence. Note that self-interacting proteins, corresponding to self-link loops, are not taken into account, for simplicity, in the main text, because they can be shown to have little effect on the asymptotic evolutionary regimes of the connectivity distribution (see*

_{n}*SI Appendix*, Fig. S3 and

*SI Text*, for details).

We study the GDD evolutionary dynamics of PPI networks in terms of ensemble averages *Q*^{(n)} defined as the mean value of a feature *Q* over all realizations of the evolutionary dynamics after *n* successive duplications. This does not imply, of course, that all network realizations “coexist,” but only that a random selection of them is reasonably well characterized by the theoretical ensemble average. Although it is generally not the case for exponentially growing systems, here, we can show that ensemble averages over all evolutionary dynamics indeed reflect the properties of typical network realizations for biologically relevant regimes (see *Statistical Properties of GDD Models* in *SI Appendix*).

In the following, we focus on the number of proteins (or “nodes”) *N _{k}* of connectivity

*k*in PPI networks, while postponing the analysis of GDD models for simple network motifs to the end of the article and the

*SI Appendix*. The total number of nodes in the network is noted N = Σ

_{k≥0}

*N*and the total number of interactions (or “links”)

_{k}*L*= Σ

_{k≥0}

*kN*/2. The dynamics of the ensemble averages

_{k}*N*

_{k}

^{(n)}after

*n*duplications is analyzed by using a generating function,

The evolutionary dynamics of *F*^{(n)}(*x*) corresponds to the following recurrence deduced from the microscopic definition of the GDD model (see *SI Appendix*),

where we note for *i* = *s*, *o*, *n*,

where δ* _{ij}* = 1 − γ

*are deletion probabilities (*

_{ij}*i*,

*j*=

*s*,

*o*,

*n*) and

*A′*(1) = (1 −

_{i}*q*)γ

*+*

_{is}*q*(γ

*+ γ*

_{io}*) = Γ*

_{in}*, average rates of connectivity change for each type of nodes i =*

_{i}*s*,

*o*,

*n*(Fig. 1).

### Network Expansion (Γ) and Protein Conservation (*M*).

The total number of nodes generated by the GDD model, *F*^{(n)} (1), growths exponentially with the number of partial duplications, *F*^{(n)} (1) = C·(1 + *q*)* ^{n}*, where

*C*is the initial number of nodes, as a constant fraction of nodes

*q*is duplicated at each time step. Yet, some nodes become completely disconnected from the rest of the graph during divergence and rejoin the disconnected component of size

*F*

^{(n)}(0). From a biological point of view, these disconnected nodes represent genes that have presumably lost all biological functions and become pseudogenes before being simply eliminated from the genome. We neglect the possibility for nonfunctional genes to reconvert to functional genes again after suitable mutations, and remove them at each round of partial duplication

^{‖}focusing solely on the connected part of the graph.

In particular, the link growth rate Γ = (1 − *q*)Γ* _{s}* +

*q*Γ

*+*

_{o}*q*Γ

*obtained by taking the first derivative of Eq.*

_{n}**2**at

*x*= 1, controls whether the connected part of the graph is exponentially growing (Γ > 1) or shrinking (Γ < 1).

Let us now introduce another rate of *prime* biological interest, *M* = (1 − *q*)Γ* _{s}* +

*q*Γ

*. It is the*

_{o}*average rate of connectivity increase (M >1) of decrease (M < 1) for the most conserved duplicate lineage*, which corresponds to a stochastic alternance between singular (

*s*) and most conserved (

*o*) duplicate descents. In particular, we have by construction,

*M*< Γ =

*M*+

*q*Γ

*, independently from any evolutionary parameters,*

_{n}*q*and γ

*> 0. This implies three main evolutionary regimes from the perspective of network expansion (Γ) and protein conservation (*

_{ij}*M*) (Fig. 2):

- If
*M*< Γ < 1. PPI networks are vanishing in this regime with seemingly little biological relevance. - If
*M*< 1 < Γ. PPI networks are expanding, in this case, but their proteins are*not*conserved over long evolutionary time scales. This implies that the networks forget their evolutionary history exponentially fast, as most nodes eventually disappear and, with them, all traces of network evolution. These networks are*not*preserved over time, but instead are continuously renewed from duplication of the (few) most connected nodes (Fig. 2). Individual proteins of a given network realization are thus more similar to one another than to any protein of other network realizations, which can be seen, from a speciation perspective, as PPI networks of phylogenetically distant organisms. This is in sharp contrast to the widespread structural orthology observed across all extant life forms, even though functions of orthologs often differ (see*Evolution of PPI Network Motifs*). - If 1 <
*M*< Γ. By contrast, PPI networks remember their past evolution from the very beginning, in this case, as proteins statistically keep on increasing their connectivity once they have emerged from a duplication-divergence event. This implies that most proteins are conserved*throughout*the evolution process and preserve some interaction partners. This is indeed in broad agreement with empirical evidence, because traces of protein conservation are even observed within the core transcriptional and translational machineries across all three major living kingdoms (16).

### Evolution of PPI Network Degree Distribution.

We now turn to the evolution of the degree distribution and other topological properties of PPI networks, which correspond to the technical core of the GDD model. To this end, we rescale the exponentially growing connected graph by introducing a normalized generating function for the average degree distribution,

where *N*^{(n)} = Σ_{k≥1} *N _{k}*

^{(n)}, that is, after removing

*N*

_{0}

^{(n)}.

*F*

^{(n)}(

*x*) can be reconstructed from the shifted degree distribution,

^{(n)}(

*x*) =

*p*

^{(n)}(

*x*) − 1, as

which yields the following recurrence for ^{(n)}(*x*),

where Δ^{(n)} is the ratio between two consecutive graph sizes in terms of connected nodes, that is, Δ^{(n)} = *N*^{(n + 1)}/*N*^{(n)},

Although Δ^{(n)} is not known *a priori* and should, in general, be determined self-consistently with ^{(n)}(*x*) itself, it is directly related to the evolution of the mean degree ^{(n)} = Σ_{k≥1} *kp*_{k}^{(n)} obtained by taking the first derivative of Eq. **6** at *x* = 1,

Hence, although connected networks grow exponentially both in terms of number of links (link growth rate Γ) and number of connected nodes (node growth rate Δ^{(n)}), features normalized over these growing networks, such as node mean connectivity (Eq. **8**) or distributions of node degree (or simple network motifs, see below), exhibit richer evolutionary dynamics in the asymptotic limit *n* → ∞, as we will now discuss.

### Asymptotic Analysis of Node Degree Distribution (*M′*).

The node degree distribution can be shown (see *SI Appendix*) to converge toward a limit function *p*(*x*), with (*x*) = *p*(*x*) − 1 solution of the functional Eq. **6**.

where Δ = lim_{n→∞}Δ^{(n)} with both Δ ≤ 1 + *q*, the maximum node growth rate, and Δ ≤ Γ, the link growth rate, because the number of connected nodes cannot increase faster than the number of links. Asymptotic regimes with Δ = Γ correspond to the same exponential growth of the network in terms of connected nodes and links, and will be referred to as *linear* regimes, hereafter, whereas Δ < Γ corresponds to *nonlinear* asymptotic regimes, which imply a diverging mean connectivity ^{(n)} → ∞ in the asymptotic limit *n* → ∞ (Eq. **8**).

To determine Δ and *p*(*x*) self-consistently, we first express successive derivatives of *p*(*x*) at *x* = 1 in terms of lower derivatives by using Eq. **9**,

where α* _{k,l}* are positive functions of the 1 + 6 parameters. Inspection of this expression readily defines two classes of asymptotic regimes,

*regular*and

*singular*regimes, depending on the value of a

*topology index M′*= max

*(Γ*

_{i}*), for*

_{i}*i*=

*s*,

*o*,

*n*. The detailed analysis relies on the “characteristic function”

*h*(α) = (1 −

*q*)Γ

_{s}

^{α}+

*q*Γ

_{o}

^{α}+

*q*Γ

_{n}

^{α}, as outlined below and in Fig. 3(see

*SI Appendix*,

*Asymptotic Methods*, for proof details).

Regular regimes, if *M′* = max_{i}(Γ_{i}) < 1, for *i = s, o, n*. In this case, the only possible solution is Δ = *h*(1) (i.e., linear regime). Hence, since *M′* < 1, *h*(1) > *h*(*k*), and successive derivatives _{x}^{k}*p*(1) are thus finite and positive for all *k* ≥ 1. This corresponds to an exponential decrease of the node degree distribution for *k* 1, *p _{k}*

*e*

^{−μk}with a power law prefactor. The limit average connectivity (Eq.

**8**) is finite in this case, < ∞.

Singular regimes, if *M′* = max_{i}(Γ_{i}) > 1, for *i* = *s, o, n*. In this case, Eq. **10** suggests that there exists an integer *r* ≥ 1 for which the *r*th derivative is negative, _{x}^{r}*p*(1) < 0, which is impossible by definition. This simply means that neither this derivative nor any higher ones exist (for *k* ≥ *r*). We thus look for self-consistent solutions of the “characteristic equation” *h*(α) = Δ (with *r* − 1 < α ≤ *r*) corresponding to a singularity of *p*(*x*) at *x* = 1 and a power law tail of *p _{k}*, for

*k*1 (17),

where the singular term (1 − *x*)^{α} is replaced by (1 − *x*)* ^{r}* ln(1 −

*x*) for α =

*r*exactly. Several asymptotic behaviors are predicted from the convex shape of

*h*(α) (

_{α}

^{2}

*h*≥ 0), depending on the signs of its derivatives

*h′*(0) and

*h′*(1) (Fig. 3

*Inset*).

- If
*h′*(0) < 0 and*h′*(1) < 0. There exists an α^{}> 1 so that*h*(α^{}) =*h*(1) and the condition Δ ≤*h*(1) implies that α^{}≥ α ≥ 1. The solution α = 1 requires*h′*(1) = 0 and should be rejected in this case. Hence, because < ∞ for α > 1, we must have Δ =*h*(1) (linear regime) and a scale-free limit degree distribution with a*unique*α = α^{}> 1,*p*_{k}*k*^{−α−1}for*k*1. - If
*h′*(0) < 0 and*h′*(1) > 0. The general condition Δ ≤ min(*h*(0),*h*(1)) leads*a priori*to a whole range of possible α ]0, 1] corresponding to stationary scale-free degree distributions with diverging mean degrees^{(n)}→ ∞. Yet, numerical simulations suggest that there might still be a unique asymptotic node growth rate Δ regardless of initial conditions or evolutionary trajectories, although convergence is extremely slow (see*SI Appendix*,*Numerical simulations*). - If
*h′*(0) ≥ 0 and*h′*(1) > 0. Δ =*h*(0) = 1 +*q*, implying that all duplicated nodes are selected in this case. No suitable α exists as the node degree distribution is exponentially shifted toward higher and higher connectivities. This is a dense, nonstationary regime with seemingly little relevance to biological networks.

Finally, note that the characteristic equation Δ = *h*(α) can be recovered directly from the average change of connectivity *k* → *k*Γ* _{i}* and the following continuous approximation (by using

*N*

^{(n)}= Σ

_{k}N_{k}

^{(n)}∫

_{u}N_{u}

^{(n)}

*du*and

*N*

_{k}

^{(n)}

*k*

^{−α−1}),

### Local (*q* 1) and Global (*q* = 1) Duplication Limits.

The asymptotic degree distribution of the GDD model can be conveniently mapped into the (Γ, *M*) plane for two limit regimes of prime biological relevance: (*i*) for local duplication events (*q* 1 and γ* _{ss}* = 1; Fig. 4

*A*) and (

*ii*) for whole-genome duplication events (

*q*= 1; Fig. 4

*B*). See

*SI Appendix*for details.

*A*) Local duplication-divergence limit (

*q*1 and γ

*= 1). (*

_{ss}*B*) Whole-genome duplication-divergence limit (

*q*= 1). Boxed figures are power law exponents (α + 1) of scale-free

**...**

The local duplication-divergence limit leads to scale-free limit degree distributions for both conserved and nonconserved networks, with power law exponents 1 < α + 1 ≤ 3 if γ* _{so}* 1 (i.e., which ensures that most previous interactions are conserved in at least one copy after duplication). By contrast, the whole-genome duplication-divergence limit leads to a wide range of asymptotic behaviors from nonconserved, exponential regimes to conserved, scale-free regimes with arbitrary power law exponents. Conserved, nondense networks require, however, an asymmetric divergence between old and new duplicates (γ

*≠ γ*

_{oo}*) (5) and lead to scale-free limit degree distributions with the same range of exponents 1 < α + 1 ≤ 3 for maximum divergence asymmetry (γ*

_{nn}*1 and γ*

_{oo}*0).*

_{nn}### Evolutionary Variations of Model Parameters.

The previous analysis with fixed parameters {*q*, γ* _{ij}*} can be readily extended to combine local and global PPI network duplications (Fig. 4

*A*and

*B*) or even include

*any*evolutionary variations and stochastic fluctuations of the GDD model parameters with arbitrary series {

*q*

^{(n)},γ

_{ij}

^{(n)}}

*(see*

_{R}*SI Appendix*). Protein conservation is then found to be controlled by the cumulated product of connectivity growth/decrease rates following the most conserved, old duplicate lineage,

with conserved (resp. nonconserved) protein evolutionary regimes corresponding to *M* > 1 (resp. *M* < 1).

A similar geometric average also controls the nature of the asymptotic degree distribution as the network topology index now reads,

with *M′* < 1 corresponding to exponential networks and *M′* > 1 to scale-free (or dense) networks with an effective node degree exponent α and effective node growth rate Δ that are self-consistent solutions of the generalized characteristic equation,

where *h*^{(n)}(α) = (1 − *q*^{(n)})Γ_{s}^{(n)α} + *q*^{(n)}Γ_{o}^{(n)α} + *q*^{(n)}Γ_{n}^{(n)α}, as before. This leads to *exactly* the same discussion for singular regimes as with constant *q* and Γ* _{i}* (Fig. 3) because of the convexity of the generalized function

*h*(α) (

_{α}

^{2}

*h*(α) ≥ 0; see

*SI Appendix*for details and discussion on the

*R*→ ∞ limit).

In particular, because (1 − *q*^{(n)})Γ_{s}^{(n)} + *q*^{(n)}Γ_{o}^{(n)} ≤ max* _{i}*(Γ

_{i}

^{(n)}) for

*all q*

^{(n)}and Γ

_{i}

^{(n)}(

*i*=

*s*,

*o*,

*n*), we

*always*have

*M*≤

*M′*. This relation implies a fundamental linkage between protein conservation and network topology under general duplication-divergence evolution, regardless of all possible evolutionary variations of the model parameters,

*q*

^{(n)}and Γ

_{i}

^{(n)}. We expect, in particular, that

*all conserved networks are necessarily scale-free*(or dense) (1 <

*M*≤

*M′*), whereas

*exponential networks can never be conserved*(

*M*≤

*M′*< 1), under general duplication-divergence evolution.

### Evolution of PPI Network Motifs.

The generating function approach, introduced for the one-node degree distribution *p*_{k}^{(n)} (Eqs. **1**–**6**), can be generalized to analyze the evolutionary statistics of multinode correlation functions and related clustering coefficient, distribution of first-neighbor average connectivity *g _{k}* (18) (see Fig. 6) and small-network motifs. Yet, although

*M′*also controls transitions between major evolutionary regimes for multinode correlation functions, their analysis remains technically involved (

*SI Appendix*).

By contrast, the conservation property of network motifs under general duplication-divergence evolution turns out to be remarkably simple, as outlined in Fig. 5. We derive conservation indices for specific network motifs by summing over all possible combinations of *s* nodes (with probability κ* _{s}* = 1 −

*q*) or

*o*nodes (with probability κ

*=*

_{o}*q*) and the corresponding γ

*(*

_{ij}*i*,

*j*=

*s*,

*o*) (Fig. 5). Clearly, network motifs with a larger number of interactions,

*p*≥ 1, have lower conservation indices,

*M*

_{p}*O*(γ

_{ij}

^{p}) (Fig. 5). Moreover, because the probability to conserve a specific interaction γ

*cannot be exactly 1, because of deleterious mutations (i.e., γ*

_{ij}*< 1), motif conservation indices*

_{ij}*M*must

_{p}*all*be <1, regardless of any parameter variations,

*q*

^{(n)}and γ

_{ij}

^{(n)}.

*M*> 1), network motifs including two or more proteins

*cannot*be indefinitely conserved under general duplication-divergence evolution (κ

*= 1 −*

_{o}**...**

Hence, network motifs *cannot* be indefinitely conserved under duplication-divergence evolution, even though their individual proteins *are* typically conserved in the network (if *M* > 1) (Fig. 2). This implies that *structural* orthology between individual proteins from phylogenetically distant species *cannot* indefinitely coincide with *functional* orthology at the level of protein interactions and complexes, in broad agreement with empirical evidences (19). The resulting turnover toward more and more divergent interaction partners is a simple evolutionary consequence of the GDD model, regardless of any network-rewiring dynamics (that have been neglected here). In particular, even the most conserved orthologous proteins (*s*/*o* descents) must eventually perform different functions, but conserved ancestral functions are inevitably passed down to less conserved protein complexes (*s*/*o*/*n* descents) in phylogenetically distant species. This inherent evolutionary constraint of the GDD model sets conceptual and practical limitations on functional annotation transfer between genomes (19).

### Comparison with Empirical PPI Network Data.

Although the GDD model can, in principle, accommodate 1*q* + 6γ* _{ij}* fitting parameters, and any evolutionary variations or fluctuations

*q*

^{(n)}and γ

_{ij}

^{(n)}, we have restricted all quantitative comparisons with the empirical data (20) to biologically relevant regimes, limited to one or two effective fitting parameters, only.

The results, corresponding to 10^{3} to 10^{4} protein nodes and low network growth rates in the conserved, scale-free regime (1< *M* ≤ *M′*), are compared with Yeast PPI network data (20) (Fig. 6), both for the connectivity distribution *p _{k}* and the first-neighbor average connectivity

*g*(18) (see Fig. 6 legend and

_{k}*SI Appendix*for model and fitting details).

Extending these numerical simulations to larger PPI network sizes (>10^{5} nodes) shows little changes in these distributions for small *k* ≤ 20 (*SI Appendix*, Figs. S5–S7). This suggests that the available empirical PPI network for yeast (20) has essentially reached its asymptotic degree distribution, for *k* ≤ 20. By contrast, global convergence of the GDD model is typically quite slow for large growth rate regimes with little biological relevance, as shown in *SI Appendix*, *Numerical simulations*, Figs. S5–S7.

## Conclusions

Although conservation and network topology have *a priori* no reason to be related features of emerging PPI networks in the course of evolution, we demonstrate in this article that they are, in fact, linked properties under general duplication-divergence evolution, because of a fundamental linkage between protein conservation and network topology indices, that is, *M* ≤ *M′*, regardless of any variations or fluctuations of the model parameters.

By contrast, we also showed that conservation of network motifs cannot be indefinitely preserved under general duplication-divergence evolution (independently from any network rewiring dynamics). This underlies intrinsic limitations on functional annotation transfer across phylogenetically distant genomes (19).

All in all, these evolutionary constraints, inherent to duplication-divergence processes and independent from selective adaptation (3), appear to have largely controlled the overall topology and scale-dependent conservation of PPI networks, regardless of any specific biological function.

## Acknowledgments.

We thank U. Alon, R. Bruinsma, M. Cosentino-Lagomarsino, T. Fink, R. Monasson, and M. Vergassola for discussion. This work was supported by the Ministry of Higher Education and Research, Centre National de la Recherche Scientifique, and Institut Curie.

## Footnotes

The authors declare no conflict of interest.

^{†}Duplicated protein domains or subdomains are also quite common even within ancestral proteins, such as the ubiquitous aquaporin membrane proteins in eubacteria, archaea, and eukaryotes, or the TATA-binding protein from archaea and eukaryotes.

^{‡}Except for a few interesting cases of protein-binding mimicry, typically found in virus–host protein–protein interactions (8).

^{§}Except for domains that self-assemble into homo-oligomers, which *must* have at least two binding interfaces, see table 2 in ref. 9.

^{¶}Results from the time-linear duplication-divergence model (10) are recovered as a special limit, see supporting information (SI) *Appendix*.

This article contains supporting information online at www.pnas.org/cgi/content/full/0804119105/DCSupplemental.

^{‖}Note, however, that pseudogenes may still have a critical role in evolution by providing functional domains that can be fused to adjacent genes. This supports a view of PPI network evolution in terms of protein domains instead of entire proteins (*SI Appendix*, Fig. S6*B*, and ref. 5). Yet, we showed in ref. 5 that extensive domain shuffling does not change the resulting network topology from duplication-divergence models.

## References

**National Academy of Sciences**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (528K) |
- Citation

- Modeling protein network evolution under genome duplication and domain shuffling.[BMC Syst Biol. 2007]
*Evlampiev K, Isambert H.**BMC Syst Biol. 2007 Nov 13; 1:49. Epub 2007 Nov 13.* - The evolutionary dynamics of protein-protein interaction networks inferred from the reconstruction of ancient networks.[PLoS One. 2013]
*Jin Y, Turaev D, Weinmaier T, Rattei T, Makse HA.**PLoS One. 2013; 8(3):e58134. Epub 2013 Mar 20.* - Network topology and the evolution of dynamics in an artificial genetic regulatory network model created by whole genome duplication and divergence.[Biosystems. 2006]
*Dwight Kuo P, Banzhaf W, Leier A.**Biosystems. 2006 Sep; 85(3):177-200. Epub 2006 May 2.* - Evolution in protein interaction networks: co-evolution, rewiring and the role of duplication.[Biochem Soc Trans. 2009]
*Robertson DL, Lovell SC.**Biochem Soc Trans. 2009 Aug; 37(Pt 4):768-71.* - Duplication, divergence and formation of novel protein topologies.[Bioessays. 2006]
*Vogel C, Morea V.**Bioessays. 2006 Oct; 28(10):973-8.*

- Computational Prediction of Protein–Protein Interaction Networks: Algo-rithms and Resources[Current Genomics. 2013]
*Zahiri J, Bozorgmehr JH, Masoudi-Nejad A.**Current Genomics. 2013 Sep; 14(6)397-414* - Evolutionary Pressure on the Topology of Protein Interface Interaction Networks[The journal of physical chemistry. B. 2013]
*Johnson ME, Hummer G.**The journal of physical chemistry. B. 2013 Oct 24; 117(42)10.1021/jp402944e* - A Probabilistic Model of Neutral and Selective Dynamics of Protein Network Evolution[Journal of Computational Biology. 2013]
*Dutkowski J, Tiuryn J.**Journal of Computational Biology. 2013 Sep; 20(9)631-642* - Interface-Resolved Network of Protein-Protein Interactions[PLoS Computational Biology. 2013]
*Johnson ME, Hummer G.**PLoS Computational Biology. 2013 May; 9(5)e1003065* - The Evolutionary Dynamics of Protein-Protein Interaction Networks Inferred from the Reconstruction of Ancient Networks[PLoS ONE. ]
*Jin Y, Turaev D, Weinmaier T, Rattei T, Makse HA.**PLoS ONE. 8(3)e58134*

- Conservation and topology of protein interaction networks under duplication-dive...Conservation and topology of protein interaction networks under duplication-divergence evolutionProceedings of the National Academy of Sciences of the United States of America. 2008 Jul 22; 105(29)9863

Your browsing activity is empty.

Activity recording is turned off.

See more...