![]() | ![]() |
Formats:
|
||||||||||||||||||
Copyright © 2005 Wuchty and Almaas; licensee BioMed Central Ltd. Evolutionary cores of domain co-occurrence networks 1Northwestern Institute of Complexity, Northwestern University, Evanston, IL, USA 2Center for Complex Network Research and Department of Physics, University of Notre Dame, Notre Dame, IN 46556, USA Corresponding author.Stefan Wuchty: s-wuchty/at/northwestern.edu; Eivind Almaas: ealmaas/at/nd.edu Received November 22, 2004; Accepted March 23, 2005. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract Background The modeling of complex systems, as disparate as the World Wide Web and the cellular metabolism, as networks has recently uncovered a set of generic organizing principles: Most of these systems are scale-free while at the same time modular, resulting in a hierarchical architecture. The structure of the protein domain network, where individual domains correspond to nodes and their co-occurrences in a protein are interpreted as links, also falls into this category, suggesting that domains involved in the maintenance of increasingly developed, multicellular organisms accumulate links. Here, we take the next step by studying link based properties of the protein domain co-occurrence networks of the eukaryotes S. cerevisiae, C. elegans, D. melanogaster, M. musculus and H. sapiens. Results We construct the protein domain co-occurrence networks from the PFAM database and analyze them by applying a k-core decomposition method that isolates the globally central (highly connected domains in the central cores) from the locally central (highly connected domains in the peripheral cores) protein domains through an iterative peeling process. Furthermore, we compare the subnetworks thus obtained to the physical domain interaction network of S. cerevisiae. We find that the innermost cores of the domain co-occurrence networks gradually grow with increasing degree of evolutionary development in going from single cellular to multicellular eukaryotes. The comparison of the cores across all the organisms under consideration uncovers patterns of domain combinations that are predominately involved in protein functions such as cell-cell contacts and signal transduction. Analyzing a weighted interaction network of PFAM domains of Yeast, we find that domains having only a few partners frequently interact with these, while the converse is true for domains with a multitude of partners. Combining domain co-occurrence and interaction information, we observe that the co-occurrence of domains in the innermost cores (globally central domains) strongly coincides with physical interaction. The comparison of the multicellular eukaryotic domain co-occurrence networks with the single celled of S. cerevisiae (the overlap network) uncovers small, connected network patterns. Conclusion We hypothesize that these patterns, consisting of the domains and links preserved through evolution, may constitute nucleation kernels for the evolutionary increase in proteome complexity. Combining co-occurrence and physical interaction data we argue that the driving force behind domain fusions is a collective effect caused by the number of interactions and not the individual interaction frequency. Background Many complex systems can best be analyzed as networks where the basic building blocks of the system are represented as nodes and their interactions as links: Recent studies of systems as disparate as the network of scientific co-authorships, sexual contacts and the World-Wide-Web have revealed unexpected similarities, suggesting that their structure and growth is ruled by a set of generic organizing principles [1,2]. A variety of biological systems, like food webs and the various biochemical interactions between genes, proteins and metabolites, have been found to exhibit similar large-scale traits [3-6]. The most prominent is the scale-free property of the connectivity distribution. When combined with a modular structure, the resulting network consists of a hierarchy of interwoven clusters [7-10]. Protein crystallography reveals that the fundamental unit of protein structure is the domain. Independent of neighboring sequences, this region of a polypeptide chain folds into a distinct structure and mediates biological functionality [11]. Comparing domain architectures of proteins in multicellular organisms evidence emerged that preexisting domain architectures have predominantly been supplemented with single domains at their terminal sites [12,13]. Functional links between proteins have also been detected by analyzing the fusion patterns of protein domains. Two separate proteins A and B in one organism may be expressed as a fusion protein in other species. A protein sequence containing both A and B is termed a Rosetta Stone sequence. However, this framework only applies in a minority of cases [14]. The structure of the protein domain network, where individual domains are nodes and their co-occurrences in a protein are interpreted as links, also displays a scale-free structure [15-17]. Domains that are involved in cell-cell interactions, signal transduction and cell differentiation have been found to accumulate links, reflecting increasing complexity of the organisms specific evolutionary development in going from bacteria to eukaryotes. In a recent study [18], we classified yeast proteins as being either globally or locally central according to the number and density of links in their network neighborhoods. In particular, we applied an iterative decomposition method (see Methods) that systematically uncovered core networks with nodes having degrees of at least k. In nesting through the different cores, we gradually defined highly connected proteins in the innermost cores as globally central while we call proteins that have been placed in cores on the periphery locally central. This categorization allowed us to demonstrate that globally central proteins participate in a substantial number of complexes while simultaneously displaying a high level of evolutionary conservation. Here, we apply this core decomposition method to study the properties of the protein domain co-occurrence networks of the eukaryotes S. cerevisiae, C. elegans, D. melanogaster, M. musculus and H. sapiens, allowing us to classify the various domains as either globally or locally central. In going from the single celled Yeast to the considered highly evolved multi cellular organisms we find that the number of globally central domains increases with the organisms level of evolutionary development. Also the overlap network which consists only of the nodes and links shared by all the organisms specific cores reveals those domain fusions that have been preserved through evolution. Comparing the co-occurrence networks to the physical protein domain interaction network of S. cerevisiae [17,19] we find that links that appear in the innermost cores of the co-occurrence network of higher eukaryotes strongly coincide with physical interactions. The co-occurrences of domains that make them end up in the innermost cores of the co-occurrence networks might represent evolutionary patterns that serve as a putative proteome backbone. Since we find the driving force behind fusion events not to be a high frequency of interactions between a given protein domain pair but a large number of individual interaction partners, we conclude that links appearing in the innermost cores of the co-occurring networks are the result of underlying important domain interactions. Results Statistics of domain networks Table 1 summarizes basic statistics of the domain co-occurrence networks of H. sapiens, M. musculus, C. elegans, D. melanogaster and S. cerevisiae. All the domain co-occurrence networks have a major component containing the vast majority of the nodes, co-existing with many small, connected components. Both the average degree k and the clustering coefficients C of the networks gradually increase with elevated level of the organisms development. Determining the number of domains N proteins in the proteomes of H. sapiens, M. musculus, C. elegans, D. melanogaster and S. cerevisiae contain, we observe the presence of power-laws in frequency distributions thus obtained, P(N) ~ N-δ (Fig. (Fig.1b).1b k ~ Nε, suggesting that on average frequently occurring domains are combined with an increasing number of changing partners. All co-occurrence networks display a scale-free degree distribution [15] (Fig. (Fig.1d),1d C(k) = α(β + k)-γ, indicating the network's inherent modularity (Fig. (Fig.1e).1e
Cores of domain networks Due to the size of the domain co-occurrence networks considered we find different numbers of k-cores. While the networks of H. sapiens and M. musculus are decomposed into 8 nested k-cores, where k = 1, ..., 8 we find 6 k-cores in D. melanogaster and C. elegans (k = 1, ..., 6). There are only 4 in S. cerevisiae (k = 1, ..., 4). The placement of a node in a certain core allows an assessment of its meaning for the topology. A hub – a highly connected node – that is only a member of the peripheral k-cores is defined as locally central, while nodes (not necessarily the biggest hubs of the whole network) being members of the innermost cores are globally central (Fig. (Fig.2f).2f
In Fig. 2a–e Nesting through the innermost cores of the more evolved organisms, we find that the initial small innermost core of Yeast is enriched with clusters of densely connected domains (Fig. 2b–e Domain interaction network Information about protein domain interactions as of the InterDom database [21] constitute an undirected network of Yeast protein domain interactions. In contrast to domain co-occurrence networks, each link has a weight which reflects the frequency of the corresponding interactions relative to a random background distribution [21]. The degree distribution of the domain interaction network (Fig. (Fig.3a)3a C(k) ~ k-δ (Fig. (Fig.3a,3a (Note, that the strength si of a node i is the degree ki if we consider a network where all weights are 1). In the inset of Fig. Fig.3b,3b . This measure allows us to observe a decreasing trend of s(k) with k (Fig. (Fig.3b)3b
How is then the domain interaction network related to the domain co-occurrence network? In each core of the domain co-occurrence networks, we calculated the fraction of links present in the Yeast domain interaction network. Fig. Fig.4a4a
Overlap of domain network cores Many of the domains appear ubiquitous to the innermost eukaryotic cores of the co-occurrence network (see e.g. Fig. Fig.2).2
Discussion & conclusions Although the PFAM database provides comprehensive domain information, it covers only a part of the considered proteomes. Similarly, the determination of putative domain interactions depends on the quality and completeness of the underlying sets of protein interactions. Yet, the heterogeneity of scale-free networks indicates that the general characteristics of domain co-occurrence and interaction networks are independent of the webs actual size [17]. In particular, such networks are governed by the presence of highly connected hubs and cohesive areas, factors that not only influence their integrity but also the determination of k-cores. Since biological networks have been found to be stable upon random perturbations, we expect that the addition of new data will not dramatically impact our findings. The idea of analyzing the protein domain co-occurrence network as a sequence of nested cores and comparing the overlap between the central cores of eukaryotic organisms with increasing level of evolutionary development, gives new and fundamental insights into the qualitative arrangement and evolutionary utilization of the proteome. The evolutionary trend toward multicellularity requires proteomes capable of new and additional complex cellular processes such as signal transduction or cell-cell contacts. On a node based level, this trend toward higher complexity is reflected by an considerable heightened connectivity of domains that support such functions in multicellular organisms [15]. Turning our attention to a link-based level, panels in Fig. Fig.22 In fact, a significant portion of the protein architecture is found to be homolog in H. sapiens and D. melanogaster while substantial innovation in the creation of new protein architectures also has been detected [12]. The expansion of selected domain families and the accompanying evolution of complex domain architectures by joining presumably pre-existing domains coincides with the increase in the organisms level of evolutionary development. In particular, changes in the domain architectures are the consequence of a cellular mechanism commonly known as 'domain shuffling', appearing in different disguises [20]. In simple cases of creating a new domain architecture, domains are simply inserted in already preexisting domain arrangements, a mechanism known as domain insertion while domain duplication refers to the internal duplication of at least one domain in a gene. Comparing domain architectures of proteins in multicellular organisms evidence emerged that preexisting domain architectures have been supplemented with single domains at their terminal sites, another mechanism that is known as domain accretion [13]. Our results do not favor one mechanism over the other. Yet, the panels in Fig. Fig.22 The decomposition of the domain co-occurrence networks into k-cores allows us to uncover those sets of domains that are embedded in densely connected areas of the networks. The high connectivity as well as the nature of the partners those domains appear with indicate a central topological and functional role in the proteome of the considered organisms. Nesting toward the innermost cores the significance of these links is supported by the observation that pairs of co-occurring domains increasingly are present as physical interactions in Yeast. Utilizing the combined information of the co-occurrence network and the physical interaction network, we also find that domains tend to interact infrequently if they have many different interaction partners. In contrast, we observe that domains interact increasingly frequently once they have a small number of partners. Although we considered domain fusions on an indirect and qualitative basis this series of observations suggests that the driving force behind domain fusion events is not frequent interactions. In fact, it seems that the number of interactors, the connectivity, of the domains mainly influences a domains propensity to fuse with other interactors. The trend to spatially organize otherwise randomly diffusing domains might help to organize the flow of information in cells. Concluding, we find that domain fusion is a tool to superannuate the random diffusive interaction of a domain pair by embedding them in an architecture which ensures their interactions that would be difficult by random diffusion in a cell alone. Methods Network representation An undirected unweighted network of n nodes is conveniently represented as an symmetric n × n adjacency matrix A = (aij), where aij = 1 if there exists an edge between nodes i and j and aij = 0 otherwise. In a weighted network, the adjacency matrix reads as A = (aijwij), where wij represents the weight of edge ij. Consistently, the degree being the number of neighbors a node i has is . As a generalization of a nodes degree the strength si of a node i is defined as [22].Proteome databases and domain co-occurrence network The Integr8 database [24,25] provides comprehensive statistical and comparative analyzes of the proteomes of fully sequenced organisms. Every predicted protein is annotated with the domains it contains, utilizing the combined efforts of different domain sequence sources. For our analysis, we focused on the domain data retrieved from the PFAM database, a reliable collection of multiple sequence alignments of protein families and profile hidden Markov models [26]. We construct the protein domain networks by considering all PFAM domains (or nodes) that are co-occurring in a protein to be a fully connected clique of undirected links (see Fig. la). In Integr8 we find 19,061 proteins that have a PFAM annotation in H.sapiens, as well as 18, 953 of M. musculus, 9, 785 of D. melanogaster, 12, 587 of C. elegans and 3, 791 of S. cerevisae. Although domain combinations ij potentially occur repeatedly in a proteome, we assign weight wij = 1 to every link between domains i and j. Following this procedure, we generated domain networks for the proteomes of H. sapiens, M. musculus, C. elegans, D. melanogaster and S. cerevisiae. Domain interaction data The Interdom database [21,27,28] provides computationally derived putative domain interactions of Yeast. Based on PFAM domain information [26] for each set of protein interactions including pairwise protein interactions, protein complexes and Rosetta Stone sequences the presence of potential domain interactions is determined. The occurrence of a domain interaction in each protein interaction set is evaluated by comparing the observed frequencies to a random background model. A score thus obtained reflects the abundance of a particular pair-wise domain interaction, allowing the assessment of the reliability and the significance of the considered domain interaction. Considering these scores as weights wij of interactions between protein domains ij, we generate an undirected network of 3, 353 domains that are embedded in 28, 339 weighted interactions. Network degree distribution The simplest way to characterize a network is by the degree k (or connectivity) of the nodes, reflecting the number of neighbors each node has. Accordingly, we define the average degree of a network as k = (1/N) ki, where N is the total number of nodes. Recent studies of biological networks have produced compelling evidence that the network degree distribution – the probability that a node has k neighbors – is scale-free with the functional form P(k) ~ k-γ [1,9]. An important feature of the power-law distribution is the presence of a minority of nodes, carrying a vast number of connections, called 'hubs'.Network clustering Another important feature of biological networks is their tendency to exhibit cohesive areas: The clustering coefficient [31] of a node i measures the actual number of triangles that node i is a member of, relative to the possible number of triangles. Formally, it is defined as ![]() where ni denotes the number of triangles. Accordingly, we define the average clustering coefficient as C = (1/N) Ci. The clustering coefficient of a network also carries information about its modular nature, since C ~ 1 necessitates the presence of tightly interconnected clusters of nodes. Note that the network has a hierarchical architecture when C(k) ~ k-α, allowing the existence of discernible, yet topologically overlapping, functional modules. Apparently, networks with this structure are observed in most types of biological systems where a small subset of hubs play the important role of linking, and hence bridging, the various network modules [9,32].k-cores The k-core of a graph is defined as the largest subgraph for which every node has at least k links (Fig. (Fig.2):2 References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||
Science. 1999 Oct 15; 286(5439):509-12.
[Science. 1999]Nature. 2000 Oct 5; 407(6804):651-4.
[Nature. 2000]Nucleic Acids Res. 2003 Feb 1; 31(3):1108-17.
[Nucleic Acids Res. 2003]Science. 2002 Aug 30; 297(5586):1551-5.
[Science. 2002]Nature. 2001 Feb 15; 409(6822):860-921.
[Nature. 2001]Science. 2001 Feb 16; 291(5507):1304-51.
[Science. 2001]Science. 1999 Jul 30; 285(5428):751-3.
[Science. 1999]Mol Biol Evol. 2001 Sep; 18(9):1694-702.
[Mol Biol Evol. 2001]Proteomics. 2002 Dec; 2(12):1715-23.
[Proteomics. 2002]Proteomics. 2002 Dec; 2(12):1715-23.
[Proteomics. 2002]J Mol Biol. 2001 Mar 30; 307(3):929-38.
[J Mol Biol. 2001]Annu Rev Biochem. 1995; 64():287-314.
[Annu Rev Biochem. 1995]Mol Biol Evol. 2001 Sep; 18(9):1694-702.
[Mol Biol Evol. 2001]Nucleic Acids Res. 2003 Jan 1; 31(1):251-4.
[Nucleic Acids Res. 2003]Proc Natl Acad Sci U S A. 2004 Mar 16; 101(11):3747-52.
[Proc Natl Acad Sci U S A. 2004]Proteomics. 2002 Dec; 2(12):1715-23.
[Proteomics. 2002]Mol Biol Evol. 2001 Sep; 18(9):1694-702.
[Mol Biol Evol. 2001]Science. 2001 Feb 16; 291(5507):1279-84.
[Science. 2001]Nature. 2001 Feb 15; 409(6822):860-921.
[Nature. 2001]Science. 2001 Feb 16; 291(5507):1304-51.
[Science. 2001]Nature. 2001 Feb 15; 409(6822):860-921.
[Nature. 2001]Annu Rev Biochem. 1995; 64():287-314.
[Annu Rev Biochem. 1995]Science. 2001 Feb 16; 291(5507):1304-51.
[Science. 2001]Proc Natl Acad Sci U S A. 2004 Mar 16; 101(11):3747-52.
[Proc Natl Acad Sci U S A. 2004]Nucleic Acids Res. 2003 Jan 1; 31(1):414-7.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D138-41.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2003 Jan 1; 31(1):251-4.
[Nucleic Acids Res. 2003]Bioinformatics. 2003 May 22; 19(8):923-9.
[Bioinformatics. 2003]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D138-41.
[Nucleic Acids Res. 2004]Proteomics. 2002 Dec; 2(12):1715-23.
[Proteomics. 2002]Nature. 2001 May 3; 411(6833):41-2.
[Nature. 2001]Genome Res. 2004 Jul; 14(7):1310-4.
[Genome Res. 2004]Nature. 1998 Jun 4; 393(6684):440-2.
[Nature. 1998]Nature. 2004 Jul 1; 430(6995):88-93.
[Nature. 2004]