• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptNIH Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
J Mol Evol. Author manuscript; available in PMC Jan 20, 2006.
Published in final edited form as:
PMCID: PMC1343502
NIHMSID: NIHMS4461

Gene Duplication and the Properties of Biological Networks

Abstract

Patterns of network connection of members of multi-gene families were examined for two biological networks: a genetic network from yeast Saccharomyces cerevisiae and a protein-protein interaction network from Caenorhabditis elegans. In both networks, genes belonging to gene families represented by a single member in the genome (“singletons”) were disproportionately represented among the nodes having large numbers of connections. Of 68 single-member yeast families with 25 or more network connections, 28 (44.4%) were located in duplicated genomic segments believed to have originated from an ancient polyploidization event; thus, each of these 28 loci was thus presumably duplicated along with the genomic segment to which it belongs, but one of the two duplicates has subsequently been deleted. Nodes connected to major “hubs” with a large number of connections, tended to be relatively sparsely inter-connected among themselves. Furthermore, duplicated genes, even those arising from recent duplication, rarely shared many network connections, suggesting that network connections are remarkably labile over evolutionary time. These factors serve to explain well-known general properties of biological networks, including their scale-free and modular nature.

With increasing knowledge of gene regulation, protein-protein interactions, and metabolic processes, it has become possible to assemble these and similar sorts of biological information in the form of networks; that is, graphical representations of intermolecular interactions (Kanehisa 2000). Networks have been constructed from information on metabolic pathways, signal transduction, transcriptional regulation, and other cellular processes (Kanehisa 2000). One important generalization regarding biological networks is that they tend to be scale-free (Barabási and Albert 1999). Unlike a random network, a scale-free network has the property that a small proportion of nodes have a large number of connections, while the other nodes have smaller numbers of connections (Barabási and Albert 1999). In intuitive terms, in a scale-free network, nodes are divided into “hubs” (having many connections) and “spokes” (having few connections and connected with one another mainly through hubs). More formally, the scale-free property occurs when P(k), the probability that a node in the network is connected to k other nodes, decays as a power law, following P (k) ~ k−γ, where γ is a positive real number (often about 2.0 in a wide variety of networks known from both the biological and the social sciences) (Barabási and Albert 1999).

An apparent paradox of biological networks is that within many such networks there are numerous small modules of densely interconnected nodes, while connections between modules are sparser (Ravasz et al. 2002). A modular organization would seem to contradict the scale-free property, since in such a network nodes would tend to have roughly equal numbers of connections. However, it has been shown in the case of metabolic networks from a variety of species that the network consists of many small highly connected modules that combine in a hierarchical manner, as a result of a small number of non-random inter-module links that connect modules in a nested fashion (Ravasz et al. 2002). This hierarchical mode of organization explains the observation that biological networks are scale-free and yet consist of functionally distinct modules.

It has been hypothesized that repeated gene duplication over evolutionary time can account for the properties of biological networks (Wagner 2001). However, this will only be true if the duplication process has certain characteristics. Figure 1A illustrates the simplest possible protein-protein interaction network: a network consisting of just two interacting proteins X and Y. This network is not scale-free. Suppose that the genes encoding X and Y are both duplicated, giving rise to two new genes encoding two new proteins (X’ and Y’). If X retains its interaction with Y, while X’ interacts only with Y’ (Figure 1B), the network will not be any more scale-free than was the original network. Similarly, if X interacts with both Y and Y’ while X’ interacts with both Y and Y’ (Figure 1C), the network will not be any more scale-free than was the original network.

Figure 1
Hypothetical scenarios of gene duplication in biological networks. (For explanation, see text.)

On the other hand, the scale-free property is increased if, after duplication, interactions are lost differentially. An example of such a process is illustrated in Figure 1D. Here, after duplication of both the genes encoding X and Y, X retains the capacity to interact with both Y and Y’, whereas X’ interacts only with Y’ (Figure 1D). Likewise, the scale-free property of the network will be increased if only one of the two genes is duplicated, but both duplicates retain the capacity to interact with the unduplicated gene (Figure 1E). The same effect would be produced if both genes were duplicated, but one duplicated gene was subsequently deleted.

These simple examples show that gene duplication will increase the scale-free property of networks if genes involved in networks are duplicated differentially, are deleted differentially after duplication and/or if interactions are retained differentially after duplication. A similar effect would also be produced if different interactions were acquired by each gene independently after duplication, especially if duplicates differered with respect to the number of interactions acquired.

Figure 1F represents a simple “module” of three interacting proteins (X, Y, and Z). If the genes encoding Y and Z are duplicated, and the proteins encoded by the duplicates continue to interact with X (Figure 1G), the result will be a network with two hierarchically combined modules. This simple example shows that differential duplication can, under appropriate circumstances, yield a network having the property of a modular and hierarchical organization. Again the same effect would be produced if both genes were duplicated, but one duplicated gene was subsequently deleted. Consistent with this reasoning, eukaryotic genomes include substantial numbers of unduplicated genes (“singletons”), in spite of the existence of numerous multi-gene families (Friedman and Hughes 2001a,b).

These theoretical considerations lead to the predictions that biological networks will be characterized by three phenomena: (1) After gene duplication, new interactions will be differentially acquired by duplicates and/or ancestral interactions will be differentially retained by duplicates. The latter will be the case particularly when the duplication involves genes constituting multi-connected nodes (“hubs”) in a gene interaction network. (2) Genes corresponding to network hubs will consist disproportionately of singletons. (3) When genes corresponding to network hubs are duplicated, it will often happen that one copy is quickly deleted. Here we test these predictions with data on gene interaction networks from two model organisms: the genetic interaction network of yeast Saccharomyces cerevisiae (Tong et al. 2004) and protein-protein interaction network of the nematode worm Caenorhabditis elegans (Li et al. 2004).

Methods

The complete sets of predicted protein translations for the following organisms were downloaded from the euGenes website, http://iubio.bio.indiana.edu:8089: yeast Saccharomyces cerevisiase (version 06/24/2002) and Caenorhabditis elegans (version 06/24/2002). For each of these genomes, gene families were assembled using the BLASTCLUST computer program available in the BLAST tools (Altschul et al 1997). This program establishes families by BLASTP homology search and the single-linkage method (i.e., if a match is scored between A and B and between B and C, then A, B, and C are placed in the same family). We used a value of 10−6 for the E parameter (representing the probability that a score as high as that observed between two sequences will be found by chance in a database of the size examined) of the BLAST algorithm. To score a match between two proteins, we further required that 30% of amino acids be identical and 50% of aligned amino acid sites be shared. These criteria have been shown to assemble multigene families whose members show evidence of homology throughout the length of the sequence, thus making them suitable for phylogenetic analysis and estimation of evolutionary distances (Hughes and Friedman 2004).

In the case of selected duplicate gene pairs, sequences were aligned at the amino acid level using the CLUSTALW program (Thompson et al. 1994), and this alignment was imposed on the DNA sequences. The number of synonymous nucleotide substitutions per synonymous site (dS) and the number of nonsynonymous nucleotide substitutions per nonsynonymous site (dN) were estimated by a maximum likelihood method (Yang and Nielsen 2000) using the software package PAML (Yang 1997). Since most synonymous mutations are selectively neutral or nearly so (Kimura 1977), dS is expected to be correlated with the amount of time since duplication of the two genes compared. By contrast, dN reflects the extent to which the two genes are subject to purifying selection arising from functional constraint on the amino acid sequence (Kimura 1977; Nei 1987).

Information for the yeast genetic interaction network was obtained from Tong et al. (2004), who determined about 4000 interactions for about 1000 genes using computational analysis of data from 132 synthetic gene array screens. Information for the C. elegans protein-protein interaction was obtained from Li et al. (2004), who obtained data on over 4000 interactions using high-throughput yeast two-hybrid screens and combined these results with previously described interactions and in silico predictions for a total of about 5500 interactions. Genes included in these networks (820 genes from yeast and 2606 genes from C. elegans) were matched with the gene families determined by homology search. We examined the relationship between the number of connections a gene had in the network and the size of the family to which it belonged. Family sizes were based on the complete sets of predicted proteins rather than on the sets included in the networks. The clustering coefficient for node i with ki links was defined as Ci = 2ni/ [ki(ki-1)] where ni is the number of links between the ki neighbors of i (Ravasz and Barabási 2003). Note that Ci is undefined for nodes with only one connection.

Information on duplicated segments in the yeast genome was obtained from Seoighe and Wolfe (1999). In order to obtain a conservative estimate of duplicated regions, for purposes of our analyses we did not include duplicated regions designated “possible” or “low-scoring” by those authors.

Phylogenetic analyses were conducted by the following methods: (1) the maximum parsimony (MP) method, implemented in the PAUP* program (Swofford 2002); (2) the quartet maximum likelihood method (QML), implemented in the PUZZLE 5.0 program (Strimmer and van Haeseler 1996); (3) the neighbor-joining (NJ) method (Saitou and Nei 1987), implemented in the MEGA2 program (Kumar et al. 2001). The NJ trees were based on the gamma-corrected amino acid distance, with the shape parameter estimated by the PUZZLE 5.0 program. The reliability of clustering in the MP and NJ trees was assessed by bootstrapping (Felsenstein 1985); 1000 bootstrap samples were used. In QML trees, the proportion of puzzling steps supporting a branch provided a similar index of the reliability of clustering patterns.Since all phylogenetic methods produced essentially identical trees, only the MP tree is shown in the following.

Results

Network Properties

Both the yeast gene interaction network and the C. elegans protein-protein interaction network showed patterns characteristic of a scale-free network, with a negative exponential relationship between the number of connections and the number of nodes in the network having that number of connections. In each case, a regression relating the log number of genes to the log number of connections produced a highly significant linear relationship with a negative slope (Figure 2). The slope of the relationship had a much higher absolute value in the C. elegans network (slope = −1.84; Figure 2B) than in the yeast network (slope = −1.16; Figure 2A). This difference is explained by a higher frequency of nodes with a small number of connections in the former network than in the latter network. In the C. elegans network, 1406 of 2606 nodes (54.0%) had a single connection, whereas in the yeast network only 254 of 820 (31.0%) of nodes had a single connection.

Figure 2
Plots of the log10 number of nodes (N) having a given number of connections vs. the log10 number of connections (Connections) for (A) the yeast genetic interaction network; and (B) the C. elegans protein-protein interaction network. In the yeast network, ...

In spite of this difference, clustering coefficients for the two networks were similar. The median clustering coefficient (Ci) for the yeast network was 0.500, while that for the C. elegans network was 0.533. These medians were not significantly different (Mann-Whitney test). In both networks, there was a strong negative correlation between Ci and the number of connections at a node (Figure 3). The Spearman rank correlation coefficient (rS) between Ci and the number of connections was –0.588 (P <0.001) in the case of the yeast network (Figure 3A) and –0.500 (P < 0.001) in the case of the C. elegans network (Figure 3B).

Figure 3
Plots of the clustering coefficient (Ci) of a node vs. the number connections at the node for (A) the yeast genetic interaction network (rS = −0.588; P < 0.001); and (B) the C. elegans protein-protein interaction network(rS = −0.500; ...

Family Size and Connections

In the yeast network, there was a negative rank correlation between the number of connections a gene had and its family size (rS = −0.124; P < 0.001; Figure 4A). This negative correlation was explained by the fact that most genes with large numbers of connections were singletons (Figure 4A). Nine of 10 genes with 100 or more connections were singletons, and 64 of 83 (77.1%) of genes with 25 or more connections were singletons (Figure 4A). In the C. elegans network, there was a similar, though weaker, negative rank correlation between the number of connections a gene had and its family size (rS = −0.053; P = 0.007; Figure 4B). The protein with the highest number of connections (90) was a member of a 2-member family, while the proteins with the next three highest numbers of connections (82, 74, and 66) were all encoded by singletons (Figure 4B). Of 27 proteins with 25 or more connections, 15 (55.6%) were proteins encoded by singletons.

Figure 4
Plots of the number of connections of a node vs. family size for (A) the yeast genetic interaction network (rS = −0.124; P < 0.001); and (B) the C. elegans protein-protein interaction network (rS = −0.053; P = 0.007).

Because of the negative correlation between Ci and number of connections in both networks (Figure 3), we further examined the relationship between family size and number of connections using rank partial correlation, controlling for the effect of Ci (Table 1). Since Ci is not defined for nodes with only one connection, such nodes were not included in the partial correlation analyses. Even excluding such nodes, there was a significant negative partial rank correlation between family size and number of connections in both the yeast network and the C. elegans network (Table 1). Likewise, in both networks there was a highly significant negative rank partial correlation between number of connections and Ci, controlling for the effect of family size (Table 1). On the other hand, in neither network was there a significant partial rank correlation between family size and Ci, controlling for the number of connections (Table 1).

Table 1
Rank partial correlations among three variables describing network nodes, in each case controlling for the other variable.

Gene Duplication and Network Connections

There is evidence of extensive ancient segmental duplication in the yeast genome, which has been attributed to an ancient polyploidization event which occurred about 200 million years ago (Wolfe and Shields 1997; Seoighe and Wolfe 1999; Friedman and Hughes 2001a; Hughes and Friedman 2003; Kellis et al. 2004). Of 68 single-member yeast families with 25 or more network connections, 28 (44.4%) were located in duplicated blocks believed to have originated from polyploidization (Seoighe and Wolfe 1999). The fact that, in spite of their location in duplicated regions, these families contain a single member implies that, after segmental duplication, one duplicate member of each of these 28 families was deleted from the genome.

In C. elegans, we compared the connections of 34 two-member families both members of which were included in the protein interaction network (Table 2). Most of the duplication events giving rise to the 34 pairs of paralogs were ancient, as indicated by high mean values of dS and dN (Table 2). In general, the paralogous gene pairs showed little tendency to share network connections; the mean number of connections shared was less than one, and the median number of connections shared was zero (Table 2). Furthermore, there was no significant correlation between either dS or dN and either the number of connections shared or the percent of connections shared (data not shown). In only three of the 34 gene pairs was dS less than 1.0, and in all three of these pairs no network connections were shared between pair members. C38C10.4 and F22B7.13 were the protein pair with the lowest dS (0.038); these two proteins shared none of the 5 connections of the former protein or of the 6 connections of the latter protein.

Table 2
Summary statistics for variables describing 34 two-member families in the C. elegans protein-protein interaction network.

Phylogenetic analyses of families with multiple members in a network were used to examine the relationship between phylogenetic relatedness and sharing of network connections. Figure 5A shows the phylogenetic tree of MAP kinases from the yeast network; this was the family showing the greatest within-family contrast in numbers of connections in either network. The two genes in this family with the highest numbers of connections, YPL031C (with 62 connections) and YHR030C (with 60 connections), were not sisters (Figure 5A). There was strong (100%) bootstrap support for clustering of YHR030C with YLR113W, which had only 24 connections (Figure 5A). The same clustering pattern received strong support in QML and NJ trees (data not shown). When sharing of connections among these genes was examined, YHR030C was found to share only a single connection with the closely related YLR113W (Figure 5B). On the other hand, YHR030C shared 14 connections with YPL031C (Figure 5B). All other members of this family included in the yeast network shared at most a single connection (Figure 5B).

Figure 5
(A) MP tree of yeast MAP kinases included in the genetic interaction network. Numbers in parentheses after each gene name are numbers of network connections. Numbers on the branches show the percentage of 1000 bootstrap samples supporting the branch. ...

Discussion

A number of general patterns emerged from the evolutionary analysis of two biological networks with rather different properties, a genetic interaction network of yeast and a protein-protein interaction network of C. elegans. First, in both networks, genes belonging to gene families represented by a single member in the genome (“singletons”) were disproportionately represented among the nodes having large numbers of connections. Furthermore, in the case of yeast, there was evidence that when singletons with large numbers of connections have been duplicated, one of the two duplicate copies has frequently been deleted. Of 68 single-member yeast families with 25 or more network connections, 28 (44.4%) were located in duplicated genomic segments believed to have originated from an ancient polyploidization event (Seoighe and Wolfe 1999). Each of these 28 loci was thus presumably duplicated along with the genomic segment to which it belongs, but one of the two duplicates has subsequently been deleted.

A second property shared by both networks was the strong negative correlation between the clustering coefficient (Ci) of a node and the number of connections at the node. This relationship held even when the effect of family size was controlled for statistically. Nodes connected to major “hubs” with a large number of connections, tended to be sparsely inter-connected among themselves.

Finally, there was evidence that network connections are remarkably labile over evolutionary time. Immediately after gene duplication, it seems reasonable to suppose that gene duplicates have the same network connections, unless one duplicate is partial or an exon-shuffling event or other recombinational event has accompanied gene duplication. But our results suggest that duplicated genes generally have quite distinct sets of connections, and that such changes can happen soon after duplication, as indicated by paralogous gene pairs in C. elegans with relatively low synonymous divergence.

Taken together, these observations paint a picture of the evolutionary process underlying the known characteristics of biological networks; namely, the properties of being scale-free and modular/hierarchical in organization. A multiply connected node is a hierarchical hub if the nodes connected to it have relatively little connection among themselves, whereas a module within a network would consist of a small set of mutually inter-connected nodes. Therefore, evidence of a negative correlation between the clustering coefficient and the number of connections at a node provides an insight into at least one mechanism maintaining the modular/hierarchical nature of biological networks.

There was evidence that duplicated copies of multiply connected genes are frequently deleted, as evidently happened with such genes in duplicated segments of the yeast genome. Such deletion of duplicate genes has happened frequently enough in yeast to suggest that it may result from natural selection against duplication of multiply connected hubs. Moreover, the evidence that network connections are highly labile over evolutionary time suggests that even when multiply-connected genes are duplicated and both duplicates are retained, one duplicate may lose numerous connections, while the other duplicate retains ancestral connections. This process would result in widely different numbers of connections within multi-gene families, as in the MAP kinase family of yeast (Figure 5).

Acknowledgments

This research was supported by grant GM066710 to A.L.H. from the National Institutes of Health.

References

  • Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
  • Barabási A-L, Albert R. Emergence of scaling in random networks. Science. 1999;286:509–512. [PubMed]
  • Felsenstein J. Confidence limits on phylogenies: an approach using the bootstrap. Evolution. 1985;39:783–791.
  • Friedman R, Hughes AL. Gene duplication and the structure of eukaryotic genomes. Genome Research. 2001a;11:373–381. [PMC free article] [PubMed]
  • Friedman R, Hughes AL. Pattern and timing of gene duplication in animal genomes. Genome Research. 2001b;11:1842–1847. [PMC free article] [PubMed]
  • Hughes AL, Friedman R. Parallel evolution by gene duplication in the genomes of two unicellular fungi. Genome Research. 2003;13:794–799. [PMC free article] [PubMed]
  • Hughes AL, Friedman R. Differential loss of ancestral gene families as a source of genomic divergence in animals. Pro R Soc Lond B (Suppl) 2004;271:S107–S109. [PMC free article] [PubMed]
  • Kanehisa M (2000) Post-genome Informatics Oxford University Press, Oxford
  • Kellis M, Birren BW, Lander ES. Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature. 2004;428:617–624. [PubMed]
  • Kimura M. Preponderance of synonymous changes as evidence for the neutral theory of molecular evolution. Nature. 1977;267:275–276. [PubMed]
  • Kumar S, Tamura K, Jakobsen IB, Nei M. MEGA2: molecular evolutionary genetics analysis software. Bioinformatics. 2001;17:1244–1245. [PubMed]
  • Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain P-O, Han J-DJ, Chesneau A, Hao T, Goldberg DS, Li N, Martinez M, Rual J-F, Lamesch P, Xu L, Tewari M, Wong SL, Zhang LV, Berritz GF, Jacotot L, Vaglio P, Reboul J, Hirozane-Kishikawa T, Li Q, Gabel HW, Elewa A, Baumgartner B, Rose DJ, Yu H, Bosak S, Sequerra R, Fraser A, Mange SE, Saxton WM, Strome S, van den Heuvel S, Piano F, Vandenhaute J, Sardet C, Gerstein M, Doucette-Stamm L, Gunsalus KC, Harper JW, Cusick ME, Roth FP, Hill DE, Vidal M. A map of the interactome network of the metazoan C. elegans. Science. 2004;303:540–543. [PMC free article] [PubMed]
  • Nei M (1987) Molecular evolutionary genetics. Columbia University Press, New York
  • Ravasz E, Barabási A-L. Hierarchical organization in complex networks. Phys Rev E. 2003;67:026112. [PubMed]
  • Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabási A-L. Hierarchical organization of modularity in metabolic networks. Science. 2002;297:1551–1555. [PubMed]
  • Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987;4:406–425. [PubMed]
  • Seoighe C, Wolfe KH. Updated map of duplicated regions in the yeast genome. Gene. 1999;238:253–261. [PubMed]
  • Strimmer K, van Haeseler A. Quartet puzzling: a quartet maximum-likelihood method for reconstructing tree topologies. Mol Biol Evol. 1996;13:964–969.
  • Swofford DL (2002) PAUP*: phylogenetic analysis using parsimony (*and other methods). Sinauer, Sunderland MA
  • Thompson JD, Higgins DG, Gibson T. CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22:4673–4680. [PMC free article] [PubMed]
  • Tong AHY, Lesage G, Bader GD, Ding H, Xu H, Xin X, Young J, Berriz GF, Brost RL, Chang M, Chen Y, Cheng X, Chua G, Friesan H, Goldberg DS, Haynes J, Humphries C, He G, Levinson JN, Lu H, Ménard P, Munyana C, Parsons AB, Ryan O, Tonikian R, Roberts T, Sdicu A-M, Shapiro J, Sheikh B, Suter B, Wong SL, Munro S, Sander C, Rine J, Greenblatt J, Peter M, Bretscher A, Bell G, Roth FP, Brown GW, Andrews B, Bussey H, Boone C. Global mapping of the yeast genetic interaction network. Science. 2004;303:808–813. [PubMed]
  • Wagner A. The yeast protein interaction network evolves rapidly and contains few redundant duplicated genes. Mol Biol Evol. 2001;18:1283–1292. [PubMed]
  • Wolfe KH, Shields DC. Molecular evidence for an ancient duplication of the entire yeast genome. Nature. 1997;387:708–713. [PubMed]
  • Yang Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci. 1997;13:555–556. [PubMed]
  • Yang Z, Nielsen R. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol Biol Evol. 2000;17:32–43. [PubMed]
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links