![]() | ![]() |
Formats:
|
||||||||||||||
Copyright © 2007 Cohen-Gihon et al; licensee BioMed Central Ltd. Comprehensive analysis of co-occurring domain sets in yeast proteins 1Sackler Institute of Molecular Medicine, Department of Human Genetics, Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel 2Center for Cancer Research Nanobiology Program, SAIC-Frederick, Inc., NCI-Frederick, Frederick, Maryland 21702, USA 3School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel Corresponding author.Inbar Cohen-Gihon: inbarg/at/tau.ac.il; Ruth Nussinov: ruthn/at/ncifcrf.gov; Roded Sharan: roded/at/post.tau.ac.il Received December 24, 2006; Accepted June 11, 2007. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract Background Protein domains are fundamental evolutionary units of protein architecture, composing proteins in a modular manner. Combinations of two or more, possibly non-adjacent, domains are thought to play specific functional roles within proteins. Indeed, while the number of potential co-occurring domain sets (CDSs) is very large, only a few of these occur in nature. Here we study the principles governing domain content of proteins, using yeast as a model species. Results We design a novel representation of proteins and their constituent domains as a protein-domain network. An analysis of this network reveals 99 CDSs that occur in proteins more than expected by chance. The identified CDSs are shown to preferentially include ancient domains that are conserved from bacteria or archaea. Moreover, the protein sets spanned by these combinations were found to be highly functionally coherent, significantly match known protein complexes, and enriched with protein-protein interactions. These observations serve to validate the biological significance of the identified CDSs. Conclusion Our work provides a comprehensive list of co-occurring domain sets in yeast, and sheds light on their function and evolution. Background Protein domains are fundamental evolutionary units of protein architecture. They function as independent units and occur in different combinations, formed by duplication, divergence and recombination of genes. In spite of their modularity, the actual number of combinations is only a small fraction of the number of potential combinations, mainly since the evolution of the protein repertoire is based on the expansion of existing protein families rather than on ab initio formation of new proteins [1]. While there is no doubt that the functionality of a protein is derived from its domain composition, the laws governing the domain content of proteins are still largely unknown. The recent availability of large-scale data on the domain content of proteins (in the form of sequence signatures [2]) allows us to ask fundamental questions regarding protein architecture: What are the common attributes of proteins sharing certain domains? Are domains used independently, or do they form synergistic combinations? Studies of the combinatorics of domain organization have shown that there are many kingdom-specific two-domain combinations of common domains and that recombinations of these common domain families have been a key factor in the divergence of organisms [3]. Vogel et al [4] studied combinations of adjacent pairs or triplets of domains, referring to those as supra-domains. About half of the supra-domains were found to be overrepresented within proteins in all kingdoms of life; moreover, these combinations occurred within proteins involved in a variety of functions like metabolism, regulation and others. A follow-up study suggested that these combinations are formed once during evolution of the protein repertoire and are duplicated as a single evolutionary unit [5]. Wuchty et al. [6] and Ye et al. [7] studied domain combinations within proteins using a co-occurrence network of domains, where two domains are linked if they are found within the same protein. Wuchty et al. showed that many domain co-occurrence networks have a giant component containing the vast majority of the nodes. A comparison of domain networks across several genomes revealed that there are similar numbers of domains in higher and lower eukaryotes, while the sizes of highly connected domain subgraphs grow with evolution. This suggests that the increasing complexity of multicellular organisms relates to the formation of new domain combinations. Ye et al. partitioned the co-occurrence network of domains into clusters and showed that domains within the same cluster tend to have similar functions. Betel et al. [8] devised a method to identify pairs of domains from different proteins that tend to co-occur within the same protein complex. They studied the global properties of the resulting domain networks from two different protein complex sources: manually curated and large scale experiments, and found different topologies for these data sources. The former contained large sub-networks corresponding to known biological assemblies, like ribosomal subunits. The latter was typically small-world and contained a few central hubs, mainly of RNA processing and binding domains. Hegyi and Gerstein [9] investigated the functional similarity of proteins that share domains. They found that about 80% of protein pairs sharing the same domain combination also share the same function. They further showed that about two-thirds of single-domain proteins that share the same domain have the same function. On the other hand, they found that only 35% of multi-domain protein pairs that share only a single domain, have the same function. Müller et al [10] suggested that changing the repertoire of domain partners in a combination, along with refinement and diversification of the domain repertoire, increases functional complexity. Other related works focused on identifying and analyzing domain-domain interactions. Several works aimed at inferring domain interactions from protein interactions [11,12] or integrating domain and protein interactions to better explain interactions at the domain level [13]. Others explored the interactions between families of domains, revealing that interactions within families are significantly more frequent than between families [14], or associated between domain interactions and their co-occurrence within proteins in other organisms [15]. Here we perform a comprehensive study of the domain composition of proteins in yeast. First, we study single domains, characterizing sets of proteins sharing each domain and the distribution of domain connectivities. Second, we use a novel network representation of the domain data to identify combinations of domains that co-occur in proteins more than expected by chance. In difference from previous works, our framework allows the identification of combinations of any size; moreover, these combinations are allowed to occur non-contiguously along the protein. We study the functional significance of these combinations, which we term co-occurring domain sets (CDSs), and the sets of proteins they induce. Results Bipartite graph representation of proteins and domains We analyzed the domain content of 3,321 S. cerevisiae proteins annotated with 1,588 domains from the Interpro database [2]. We represented these data using a bipartite graph, whose nodes correspond to proteins and domains, and whose edges connect proteins to their constituent domains (Figure (Figure1A1A
Next, we investigated the relation between protein degree and domain degree. We identified a significant positive correlation between the degree of a protein and the degrees of its constituent domains (Figure (Figure1C):1C Co-occurring domain sets We used the graphic representation to explore the repertoire of CDSs within yeast proteins. In the protein-domain bipartite graph representation, such a combination is represented by a biclique (a fully connected bipartite subgraph, see Figure Figure1A).1A In total, we identified 99 significant bicliques in the yeast protein-domain network, each corresponding to a distinct CDS [See Additional file 1]. An overview of the identified bicliques is given in Figure Figure2.2
A direct comparison with the combinations in Vogel et al [4] is not possible, as the latter focused on kingdom- rather than organism-specific combinations and examined only contiguous domain combinations. However, our organism-specific application yielded 89 new combinations that were not included in [4]. In particular, 14% of the combinations we identified included more than 3 domains, and 20% of the combinations had at least one non-adjacent occurrence (i.e., a protein in which the combination does not occur contiguously). This demonstrates the utility of our method that can search for CDSs, involving any number of possibly non-adjacent domains. Some of the CDSs we identified were well supported by previous studies. For example, we identified a combination consisting of the VHS domain (IPR002014) and the UIM (ubiquitin interacting motif) domain (IPR003903). The VHS domain has a membrane targeting role in vesicular trafficking in eukaryotic cells [20]. The UIM domain serves as a ubiquitin binding site [21]. The role of the combination of these domains was studied in the STAM2 (signal-transducing adaptor molecule) protein in [22]. It was shown that both VHS and UIM are required for ubiquitin binding. Specifically, the deletion of any one of these domains was shown to dramatically reduce the ubiquitin binding, whereas a mutant lacking both domains did not bind ubiquitin at all. As another example, we identified a combination of the motor region of the myosin head (IPR001609) and the IQ calmodulin-binding region (IPR000048) in the myosin family of proteins. These proteins are responsible for actin-based motility in eukaryotic cells, by using ATP hydrolysis to move on actin filaments [23]. They are characterized by three functional subunits: motor head, neck and tail. The head region, located at the N-terminal of the protein, is followed by the neck region. Both regions are well conserved in evolution (in contrast to the tail region) and are responsible for the actin-based movement. The head is composed of a single motor domain, which contains binding sites for ATP and actin [23]. The attached neck is composed of several repeats of the IQ calmodulin-binding region. This domain forms a rigid structure that serves as a mechanical lever, and the number of such domains in the neck determines the length of the lever arm and, hence, the step size of the myosin motor [24]. Functional annotation of CDSs and the associated proteins A statistically significant CDS suggests that its associated proteins are involved in similar biological processes. We examined whether the proteins in each of the CDSs exhibited functional coherency according to the gene ontology (GO) annotation (Methods). We found that 89 out of the 99 CDSs (90%) were significantly functionally coherent (Figure (Figure2).2 Bicliques sharing domains or proteins were further found to relate in function, as demonstrated by the biclique network in Figure Figure2.2 To investigate the relations between the GO terms characterizing the identified CDSs, we also created a network of GO terms, where each node is a GO slim annotation and two nodes are connected if they significantly share enriched combinations (Figure (Figure33
Protein-protein interactions within bicliques The functional coherence of proteins within bicliques has led us to investigate the physical connections among them. We expected proteins sharing similar CDSs to interact with one another and to match known yeast complexes. To test whether proteins sharing a particular CDS tend to interact, we compared the fraction of interacting proteins within bicliques to the overall fraction of interacting pairs. We found that proteins sharing a CDS significantly tended to co-interact (p <2.7e-11 by a hypergeometric test). As we had a reliability estimate to each reported interaction (Methods), we also compared the reliability distributions of within-biclique interactions and all other interactions. We found that the protein-protein interactions within bicliques were significantly more reliable than other reported interactions (p < 0.0014 by a Wilcoxon rank sum test). As further support for the identified functional relations between proteins sharing a CDS, we tested whether these protein sets are enriched for known protein complexes from the MIPS database [28]. To this end, we computed the fraction of bicliques whose protein sets significantly matched a known complex (Methods). Since the MIPS catalog contains only a limited collection of complexes, we restricted our analysis to bicliques that included at least t proteins that were annotated as members of some complex in the MIPS catalog. Overall, 73% (16/22) of the protein sets that had at least two MIPS annotated proteins were significantly enriched for a known complex; and 89% (8/9) of the sets having at least 3 MIPS annotated proteins were enriched. Domain age within combinations Finally, we studied the age distribution of domains within CDSs. To this end, we classified the yeast domains into ancient domains, which are found also in bacteria or archaea, and new domains, which are specific to yeast (cf. [35]). CDSs were significantly enriched for ancient domains (p < 9.3e-7, see Methods), and there was an evident correlation between the score of a combination (measuring its overrepresentation) and its enrichment level (p < 0.0047 by Spearman correlation test), as demonstrated in Figure Figure4.4
Discussion It has previously been shown that the repertoire of domain combinations in an organism's proteome is restricted to only a small fraction of the set of possible combinations [36]. Here we have used a novel representation of proteins and their domains to investigate the landscape of CDSs. We identified global properties of the protein-domain network, as well as specific highly recurrent and biologically significant CDSs. On the global scale, we have shown that the degree distribution of domains in this network follows a power law, and that highly modular proteins tend to contain abundant domains and proteins with a small amount of domains tend to contain rare domains. On the local scale, we identified highly recurrent CDSs and investigated the sets of proteins and domains that they induce. We observed that the proteins within these sets significantly tended to interact with one another, participate in similar biological processes, and be associated with the same protein complex. The CDSs were shown to include a significantly high fraction of ancient domains that are conserved from bacteria or archaea. Our analysis relied on the Interpro database, which includes domain annotations from both structure- and sequence-based sources. In order to investigate the influence of the domain type on our results, we devised a rough classification of domains into two categories: A domain is called sequence-based if it has a PRINTS [29] or SMART [30] source and structure-based if it has a PDB [31], SCOP [32] or CATH [33] source. Out of 1588 domains in our data set, 359 (22.6%) are sequence-based and 975 (61.4%) are structure-based. Some domains have both sequence and structural annotations (19.2%) and some have neither (35.2%). In addition, 1488 (93.7%) of the domains have a PFAM [34] source. As PFAM spans most of the domains in InterPro we focused our analysis on the other two types of domains. First, we examined whether the functional coherency of bicliques is more prominent for structure- or sequence-based domains. To this end, we defined a biclique as structure-based if all its constituent domains were structure-based, and sequence-based if it contained at least one sequence-based domain. (We used asymmetric definitions here to overcome the 1:3 bias in the numbers of sequence- and structure-based domains in the data, respectively.) We found that 88.5% of the sequence-based bicliques and 92.1% of the structure-based ones are functionally coherent. These rates are comparable to the observed 90% rate when considering all domains. Furthermore, 28% of the functionally coherent bicliques contained at least one sequence-based domain whose annotated function in InterPro matched the biclique's enriched function. Similarly, 70.8% of the functionally coherent bicliques contained at least one structure-based domain whose annotated function matched the enriched function. These percentages nicely match the frequencies of sequence- and structure-based domains in the IntePro domain collection. Second, we tested the correlation between the domain type and the tendency of a containing protein to interact. Specifically, we compared the fraction of interacting proteins within either sequence- or structure-based bicliques to the overall fraction of interacting pairs. We found that proteins within bicliques of both types significantly tended to co-interact (p <0.0059 for sequence-based bicliques and p <2.5e-5 for structure-based ones). We conclude that the domain type does not significantly bias the interaction enrichment results. While our work has produced a valuable list of CDSs in yeast, several of its limitations must be acknowledged. First, our method relies on accurate domain annotation of proteins. Even though InterPro is known to have a low false positive rate (0.2%, see [37]), it is far from complete, covering only 67% (3321/4930) of all SwissProt proteins. Second, in this work we adopted a combinatorial definition of CDSs. That is, a combination was defined as a set of specific domains, and a protein was considered to have this combination only if it contained the exact same set of domains. More general definitions that treat domain occurrences in a probabilistic fashion (e.g., similar to the way that sequence motifs are defined, cf. [38]), together with additional domain data, may uncover additional significant combinations that were missed by the current analysis. Methods Data acquisition We downloaded the Interpro [2] domain annotations for SwissProt proteins [39] in the yeast S. cerevisiae. We considered only Interpro entries of type domain and family, as described in Cohen-Gihon et al. [35]. In total, our data set contained 1,588 domains and 3,321 yeast proteins. Protein-protein interaction data were downloaded from the DIP database [40] (July 2005 download), with a total of 15,147 interactions. The interactions were assigned reliability estimates which were computed using a logistic regression model that takes into account the experimental techniques with which each of the interactions was detected [41]. Manually-curated protein complexes were obtained from the MIPS database [28]. We considered all complexes at the leaves of the MIPS hierarchy (excluding category 550 which includes complexes derived by high-throughput experiments). Levels greater than 3 were collapsed to level 3 (i.e., adding their proteins to the corresponding level 3 complex on the path to the root). Bipartite graph representation of proteins and domains We represented the domain content of proteins using a bipartite graph of domain and protein nodes whose edges connect proteins to their constituent domains. We assigned weights to the edges reflecting the chance of observing such edges in a random graph with the same node degrees. Precisely, for an edge connecting nodes of degrees d' and d", the edge's weight was set to -log (d'd"/m), where m represents the total number of edges in the graph [42]. Analysis of hub proteins in the yeast PPI network Hub proteins were classified as in Ekman et al. [18] as proteins involved in 8 or more interactions. To compute the enrichment of multi-domain proteins (with at least 3 domains) in hub proteins we used a hypergeometric score. In detail, let M denote the total number of proteins, let K denote the number of hubs, let N denote the number of multi-domain proteins, and let S denote the number of proteins that are both multi-domain and hub. Then the corresponding p-value is: Biclique search A biclique is defined as a fully connected bipartite subgraph, i.e., a subset of proteins P and a subset of domains D, such that each protein in P contains all the domains in D. The score of a biclique reflects the likelihood of observing such a biclique in a random, degree preserving network, and is defined as the sum of the weights assigned to the biclique's edges. We focused on maximal bicliques that contain at least two domains and at least two proteins. To detect the highest-scoring bicliques we adapted the method of Tanay et al. [19,43]. Briefly, for each protein we enumerated all possible combinations of the domains composing it (up to 27 such combinations are possible in our data). Each such combination was assigned a score according to the weight it induces with respect to the protein (i.e., the sum of weights of the edges connecting the member domains to the protein). Iterating over all the proteins in the data set, for each combination C we obtained the total weight of a biclique whose domain set is C. By applying the same algorithm to random protein-domain graphs with the same node degrees (see below), we were able to assign an empirical p-value to each CDS. The latter was determined by calculating the biclique's ranking in a list of 100 scores, representing the maximum weight obtained in each of 100 random runs. Only bicliques with p-value < 0.05 were retained (see Suppl. Figure S2). In total, we identified 99 significant bicliques. Randomized protein-domain bipartite graphs were created by starting with the original graph, and iteratively shuffling its edges while maintaining node degrees, using the "switch" method [44]. Functional coherency analysis Functional coherency of protein sets was based on the Gene Ontology (GO) [45] annotation. The analysis was conducted on the entire GO hierarchy, apart from the analysis related to the GO term network (Figure (Figure3),3 MIPS complex enrichment analysis To quantify the correspondence between the bicliques we identified and known complexes from the MIPS database [28], we applied a method described in [46]. Briefly, the set of proteins of each biclique was compared to the known yeast complexes cataloged in MIPS, and the most significant match was selected, using a hypergeometric score. Empirical p-values were calculated by comparing the hypergeometric scores to those obtained for random sets of proteins of the same size. These p-values were further FDR corrected for multiple testing. The fraction of sets with significant matches (p < 0.05) was measured. Analysis of domain age distribution For each set of domains in a biclique, the enrichment of ancient domains was measured using a hypergeometric score and compared to the enrichments under random labelings of domains as ancient and new, respecting the size of each class. Authors' contributions ICG, RS and RN participated in the design of the study and drafted the manuscript. ICG carried out the analyses. All authors read and approved the final manuscript Acknowledgements We thank Tomer Shlomi and Eitan Hirsh for their help with the MIPS complex enrichment calculations. This research was supported in part by the Intramural Research Program of the NIH, National Cancer Institute, Center for Cancer Research. This project has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under contract number NO1-CO-12400. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the US Government. ICG is a fellow of the Edmond J. Safra Bioinformatics Program and of the Ela Kodesz Research and Scholarship Fund at Tel-Aviv University. RS was supported by an Alon fellowship and by the Israel Science Foundation (grant no. 385/06). References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||
Science. 2003 Jun 13; 300(5626):1701-3.
[Science. 2003]Nucleic Acids Res. 2003 Jan 1; 31(1):315-8.
[Nucleic Acids Res. 2003]Bioinformatics. 2001; 17 Suppl 1():S83-9.
[Bioinformatics. 2001]J Mol Biol. 2004 Feb 20; 336(3):809-23.
[J Mol Biol. 2004]J Mol Biol. 2005 Feb 11; 346(1):355-65.
[J Mol Biol. 2005]BMC Evol Biol. 2005 Mar 23; 5(1):24.
[BMC Evol Biol. 2005]Genome Res. 2004 Mar; 14(3):343-53.
[Genome Res. 2004]Genome Res. 2001 Oct; 11(10):1632-40.
[Genome Res. 2001]Genome Res. 2002 Nov; 12(11):1625-41.
[Genome Res. 2002]Genome Res. 2002 Oct; 12(10):1540-8.
[Genome Res. 2002]J Mol Biol. 2001 Aug 24; 311(4):681-92.
[J Mol Biol. 2001]Bioinformatics. 2005 Apr 15; 21(8):1479-86.
[Bioinformatics. 2005]J Mol Biol. 2001 Mar 30; 307(3):929-38.
[J Mol Biol. 2001]Science. 1999 Jul 30; 285(5428):751-3.
[Science. 1999]Nucleic Acids Res. 2003 Jan 1; 31(1):315-8.
[Nucleic Acids Res. 2003]Nature. 2002 Nov 14; 420(6912):218-23.
[Nature. 2002]Genome Res. 1999 Jan; 9(1):17-26.
[Genome Res. 1999]Genome Biol. 2006; 7(6):R45.
[Genome Biol. 2006]Bioinformatics. 2002; 18 Suppl 1():S136-44.
[Bioinformatics. 2002]J Mol Biol. 2004 Feb 20; 336(3):809-23.
[J Mol Biol. 2004]FEBS Lett. 2002 Feb 20; 513(1):19-23.
[FEBS Lett. 2002]Nature. 2002 Mar 28; 416(6879):381-3.
[Nature. 2002]Mol Biol Cell. 2003 Sep; 14(9):3675-89.
[Mol Biol Cell. 2003]Biochim Biophys Acta. 2000 Mar 17; 1496(1):3-22.
[Biochim Biophys Acta. 2000]FEBS Lett. 2002 Feb 20; 513(1):107-13.
[FEBS Lett. 2002]J Mol Evol. 1998 Jan; 46(1):84-101.
[J Mol Evol. 1998]J Biol Chem. 1992 Jan 15; 267(2):1190-7.
[J Biol Chem. 1992]Curr Biol. 1994 Jul 1; 4(7):615-7.
[Curr Biol. 1994]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D41-4.
[Nucleic Acids Res. 2004]Trends Genet. 2005 Apr; 21(4):210-3.
[Trends Genet. 2005]J Mol Biol. 2001 Jul 6; 310(2):311-25.
[J Mol Biol. 2001]Nucleic Acids Res. 2003 Jan 1; 31(1):400-2.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D257-60.
[Nucleic Acids Res. 2006]Acta Crystallogr D Biol Crystallogr. 1998 Nov 1; 54(Pt 6 Pt 1):1078-84.
[Acta Crystallogr D Biol Crystallogr. 1998]J Mol Biol. 1995 Apr 7; 247(4):536-40.
[J Mol Biol. 1995]Nucleic Acids Res. 1999 Jan 1; 27(1):275-9.
[Nucleic Acids Res. 1999]Brief Bioinform. 2002 Sep; 3(3):225-35.
[Brief Bioinform. 2002]Nat Biotechnol. 2006 Apr; 24(4):423-5.
[Nat Biotechnol. 2006]Nucleic Acids Res. 2003 Jan 1; 31(1):315-8.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2003 Jan 1; 31(1):365-70.
[Nucleic Acids Res. 2003]Trends Genet. 2005 Apr; 21(4):210-3.
[Trends Genet. 2005]Nucleic Acids Res. 2000 Jan 1; 28(1):289-91.
[Nucleic Acids Res. 2000]BMC Bioinformatics. 2006 Apr 10; 7():199.
[BMC Bioinformatics. 2006]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D41-4.
[Nucleic Acids Res. 2004]Bioinformatics. 2004 Jul 22; 20(11):1746-58.
[Bioinformatics. 2004]Genome Biol. 2006; 7(6):R45.
[Genome Biol. 2006]Bioinformatics. 2002; 18 Suppl 1():S136-44.
[Bioinformatics. 2002]Proc Natl Acad Sci U S A. 2004 Mar 2; 101(9):2981-6.
[Proc Natl Acad Sci U S A. 2004]Phys Rev E Stat Nonlin Soft Matter Phys. 2003 Aug; 68(2 Pt 2):026127.
[Phys Rev E Stat Nonlin Soft Matter Phys. 2003]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D258-61.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D41-4.
[Nucleic Acids Res. 2004]Bioinformatics. 2007 Jan 15; 23(2):e170-6.
[Bioinformatics. 2007]