![]() | ![]() |
Formats:
|
||||||||||||||||||
Copyright © 2009 De Bodt et al; licensee BioMed Central Ltd. Predicting protein-protein interactions in Arabidopsis thaliana through integration of orthology, gene ontology and co-expression 1Department of Plant Systems Biology, Flanders Interuniversity Institute for Biotechnology (VIB), Technologiepark 927, B-9052 Gent, Belgium 2Department of Plant Biotechnology and Genetics, Gent University, Technologiepark 927, B-9052 Gent, Belgium Corresponding author.Stefanie De Bodt: stefanie.debodt/at/psb.vib-ugent.be; Sebastian Proost: sebastian.proost/at/psb.vib-ugent.be; Klaas Vandepoele: klaas.vandepoele/at/psb.vib-ugent.be; Pierre Rouzé: pierre.rouze/at/psb.vib-ugent.be; Yves Van de Peer: yves.vandepeer/at/psb.vib-ugent.be Received February 11, 2009; Accepted June 29, 2009. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background Large-scale identification of the interrelationships between different components of the cell, such as the interactions between proteins, has recently gained great interest. However, unraveling large-scale protein-protein interaction maps is laborious and expensive. Moreover, assessing the reliability of the interactions can be cumbersome. Results In this study, we have developed a computational method that exploits the existing knowledge on protein-protein interactions in diverse species through orthologous relations on the one hand, and functional association data on the other hand to predict and filter protein-protein interactions in Arabidopsis thaliana. A highly reliable set of protein-protein interactions is predicted through this integrative approach making use of existing protein-protein interaction data from yeast, human, C. elegans and D. melanogaster. Localization, biological process, and co-expression data are used as powerful indicators for protein-protein interactions. The functional repertoire of the identified interactome reveals interactions between proteins functioning in well-conserved as well as plant-specific biological processes. We observe that although common mechanisms (e.g. actin polymerization) and components (e.g. ARPs, actin-related proteins) exist between different lineages, they are active in specific processes such as growth, cancer metastasis and trichome development in yeast, human and Arabidopsis, respectively. Conclusion We conclude that the integration of orthology with functional association data is adequate to predict protein-protein interactions. Through this approach, a high number of novel protein-protein interactions with diverse biological roles is discovered. Overall, we have predicted a reliable set of protein-protein interactions suitable for further computational as well as experimental analyses. Background The complex regulation of diverse biological processes acting in eukaryotic organisms is only possible through interactions between different components in the cell. Proteins, for instance, can be part of extensive complexes, such as transcription factor complexes for the combinatorial control of their target genes or proteasome complexes for protein degradation. It is the specific timing and location of the activity of these protein complexes that defines their role in the cell. Several attempts have been made to infer protein-protein interaction maps in diverse model organisms through large-scale experimental methods [1,2]. In yeast [3-8], human [9,10], Drosophila [11] and C. elegans [12], genome-wide Y2H screens and large-scale affinity purification/mass spectrometry studies have been performed. Nevertheless, the reliability of the results of these studies is thought to be relatively poor because in general quite a small overlap between datasets of experimentally identified interactions is observed. However, there is growing evidence that this observation is due to the complementarity of different methods (e.g. sensitivity of Y2H versus TAP, or different experimental conditions) rather than to a high number of false interactions [8]. Yu et al. (2008) conclude that both Y2H and affinity-purification followed by mass spectrometry (AP/MS) data are of equally high quality but of a fundamentally different and complementary nature. These authors show that, compared to interaction maps based on complex purification and identification, the binary interaction map of yeast proteins is enriched for transient signaling interactions and inter-complex connections [8]. In any case, assessment of the data quality is necessary, not only to design future experiments but also to construct high confidence datasets (or gold standard datasets) used for the training and evaluation of computational methods [13-16]. Several efforts have been made to centralize protein-protein interaction data through the construction of databases such as DIP [17], MINT [18], BioGRID [19] and IntAct [20]. To make full use of the currently available interaction data, computational methods are being developed to assess the quality of experimentally generated protein-protein interactions and to predict new interactions [2,21,22]. Whereas earlier analyses focused on the relation between gene expression and protein-protein interaction only [12,23-26], the integration of several lines of evidence (further referred to as genomic features) in the prediction or validation of protein-protein interactions is highly valued in recent studies, as it increases the performance as well as the coverage of the method [15,27-29]. Typically used genomic features encompass (1) functional features such as Gene Ontology (GO) annotation of the proteins, co-expression of the encoding genes, coordinated protein abundance and co-essentiality, (2) structural features such as co-occurrence of protein domains and overrepresented sequence motifs, (3) comparative genomics-based features such as orthology and, primarily exploited in prokaryotes, phylogenetic profiles, gene neighborhood, co-evolution, and Rosetta Stone (gene split or fusion), and (4) network topology-based features, such as connectivity [30,31,21,27]. In principle, two approaches can be discerned in the prediction of protein-protein interactions. The first approach starts from protein pairs that are identified to be orthologous to known interacting proteins in other species (interolog detection). The interolog detection strategy was initially developed to transfer information on protein-protein interactions from yeast to higher organisms [32-34]. This method assumes that protein-protein interactions are conserved between organisms and that pairs of proteins whose orthologs are known to interact in other species probably interact in the species of interest as well. Although some shortcomings can be identified, protein complexes do show evolutionary conservation [35,36]. Numerous studies have been published in which, mainly human protein-protein interactions are predicted based on interolog detection [37,38]. Furthermore, predictions are made through integrative approaches in a probabilistic framework [39-41]. Other studies start from all possible protein pairs, often incorporating interolog detection as a genomic feature or do not include interolog detection at all [27,42]. The latter approach has the advantage that interactions do not need to be conserved over long evolutionary distances, but often identify associations between genes rather than protein-protein interactions [28,29,43,44]. For the model plant Arabidopsis thaliana, attempts to construct large-scale protein-protein interaction maps as well as the application and critical assessment of computational methods have been rather limited [45,46]. In this study, we aim to predict a reliable set of protein-protein interactions suitable for experimental validation as well as further computational analyses. First of all, we investigate whether the necessary assumptions taken in our approach are valid in the model plant Arabidopsis thaliana: namely, (1) (some) protein-protein interactions in yeast and animals (source organisms) are conserved in Arabidopsis thaliana (target organism), (2) interacting proteins co-localize, (3) interacting proteins function in the same biological process, and (4) genes encoding interacting proteins show similar expression patterns. Hereby, the relative contribution of these features to the prediction of protein-protein interactions in Arabidopsis thaliana is assessed. The prediction of Arabidopsis protein-protein interactions is performed, exploiting the conservation of these interactions between species on the one hand, and utilizing functional association data on the other hand. Finally, protein complexes are delineated from the predicted protein-protein interaction network and the function and evolutionary conservation of these protein complexes is studied. Results Integration of orthology, GO annotation and gene expression The basis of the prediction of protein-protein interactions in Arabidopsis thaliana performed in this study resides in the detection of interologs. Interologs are defined as protein-protein interactions that are conserved between two species and that can be detected through the identification of the orthologs in the target organism of the proteins known to interact in the source organism (see Fig. Fig.1).1
When identifying interologs, we need to deal with the high number of duplicated genes in the target organism Arabidopsis thaliana (see Methods; Fig. Fig.1).1
To decide if two proteins co-localize and/or function in the same biological process, both the GO cellular component (CC) and the GO biological process (BP) annotation of the interacting proteins were evaluated. To measure the similarity between the GO annotations of two proteins, we calculated the maximum depth of the common ancestor of all pairs of GO terms assigned to both proteins (see Methods; Additional file 2; Fig. Fig.2).2 In summary, 52.6% of the experimentally identified protein-protein interactions (767 out of the 1457 interactions) meet the conditions of the genomic features (GO biological process, GO cellular component, co-expression) (see Table 1 and Figs. Figs.22 Prediction of protein-protein interactions in Arabidopsis thaliana Starting from the protein-protein interactions in the source organisms yeast, C. elegans, Drosophila and human, protein-protein interactions were predicted in Arabidopsis using the above-mentioned genomic features. We downloaded the OrthoMCL database containing orthologous groups (OG) of 87 species. From this dataset, we extracted the evolutionary relationships between Arabidopsis, yeast, C. elegans, Drosophila and human genes. The approach taken to build this database has the advantage that in-paralogs (duplicates arisen after speciation) and thus orthologous groups rather than one-to-one orthologs can be identified. Employing these orthologous relationships between yeast, C. elegans, Drosophila and human genes on the one hand, and Arabidopsis genes on the other hand, interologs were detected (see Methods and Fig. Fig.11
The protein-protein interactions detected in the filtered and predicted interactome were compared to experimentally shown and previously predicted protein-protein interactions. Fig. Fig.33
Accessibility of the interactome We have developed an easy-to-use query and visualization system to represent the inferred interactome. The discussed protein clusters as well as the complete predicted interactome can be observed through a web-start version of Cytoscape that can be found at http://bioinformatics.psb.ugent.be/supplementary_data/stbod/athPPI/. A node and edge attribute system is employed to represent the different types of information. The color of the edge represents the degree of co-expression calculated as the Pearson correlation coefficient (green: correlation, purple: anticorrelation), while the line width of the edge represents the GO biological process similarity score (thick: similar biological process) and the line style of the edge represents the GO cellular component similarity score (solid: similar localization). The color of the nodes corresponds to the protein cluster the protein belongs to (see further). Proteins belonging to small or no clusters are colored in grey. TAIR functional descriptions are shown as node labels. Subsets of the interactome that are of interest to the researcher can be visualized easily by querying the interactome for (a) protein(s) or a functional description. Delineation of protein clusters in the predicted interactome In an attempt to reveal the functional repertoire of the predicted interactome, we have delineated highly interconnected regions in the protein-protein interaction network (hereafter called protein clusters). In addition, we tried to assign a function to the identified protein clusters. Finally, we investigated the evolutionary conservation of the protein clusters. Identification of protein clusters is performed using the CAST clustering algorithm (see Methods). This clustering procedure employs the connectivity of the proteins. Overall, 1802 proteins taking part in 16,498 interactions could be identified in protein clusters, accounting for the majority of originally identified interactions (see Additional file 9). The biological roles of the identified protein clusters were studied through identification of overrepresented GO categories (biological process, molecular function and cellular component) (see Methods). To judge the validity of the protein clusters, we inspected clusters involved in particular biological processes together. Using this approach of clustering and subsequent GO enrichment analysis, we can elegantly pinpoint protein complexes, the relationships between them and the encompassing biological processes they are involved in (see Supplementary Data site and more details below). As could be expected, well-conserved proteins and functions, such as those involved in transcription, translation, and proteolysis, are overrepresented. Typically, proteasome and ribosomal proteins are identified as highly connected (CAST clusters 1 and 3; see Supplementary data site). A number of protein-protein interaction networks involved in particular processes such as organelle organization and biogenesis, lipid metabolism, and ATP binding as well as the biological processes mentioned below can be viewed through our Supplementary data site. In addition, a considerable number of protein-protein interactions with a role in transmembrane transport, membrane receptor activity and vesicle trafficking were detected as previously reported by Geisler-Lee et al. [45] (see Supplementary data-Transmembrane activity). For example, a link was found between protein clusters of interacting VAMPs (Vesicle associated membrane proteins) and SNAREs (CAST clusters 10 and 70) with vacuolar H+ pumping ATPases (CAST clusters 9, 34 and 120) and cation/H+ exchangers (CAST clusters 100, 145 and 185). Although not connected to the above-mentioned clusters, a protein cluster containing components of the translocase inner membrane complex (CAST cluster 13) associated with carrier proteins (CAST cluster 94), was retrieved as well. Several links between cell cycle control, protein degradation and related processes are captured in the protein clusters enriched for GO categories containing 'cell cycle' (CAST clusters 3, 12, 18 and 61; see Supplementary data). Whereas ubiquitin-mediated proteolysis regulates the activity of cyclins during the progression through the different phases of the cell cycle, B-type cyclins interact with several microtubule-related components during cytokinesis. Similarly, A-type cyclins, DNA replication proteins (CDC, MCM and ORC subunits) and RBR/WEE proteins are associated during the G1/S transition [48] (see Supplementary data - Cell cycle + DNA repair + DNA replication). Overall, we observe a high similarity in expression patterns for the genes encoding proteins in the 'cell cycle' clusters most probably due to the tight regulation of proteins involved in DNA replication (green edges, see Fig. Fig.4,4
More detailed analysis of the 'response to' interaction network identified many proteins functioning in, for instance, response to DNA damage or response to oxidative stress caused by the accumulation of reactive oxygen species (see Supplementary data site). Proteins active in these stress responses are DNAJ heat shock proteins, calcium-dependent and MAP kinases (CAST cluster 1), DNA repair proteins (RAD1, RAD5, RAD50, RAD51, RAD54 – CAST 18), superoxide dismutases (CAST 49), glutathione peroxidases (CAST 74) and glutathione S-transferases (CAST 15). These proteins are involved in very diverse biological processes, such as response to heat, response to toxins and response to chemical stimulus, which is reflected in the dashed edges in the network (see Supplementary data). In order to verify that these predicted interactions also reflect actual conserved biological stress responses, we compared the expression patterns of all 500 Arabidopsis genes in these 'response to' protein clusters using a recently compiled comparative stress response matrix [49]. Whereas 11% of all stress-induced Arabidopsis gene families shows a conserved stress response in human or yeast (44/390 families in the complete matrix), the 'response to' set (representing 137 gene families) shows a more than five-fold enrichment for conserved stress response (16/25 gene families with responsive Arabidopsis genes are also responsive in human or yeast). These findings confirm that several components as well as protein-protein interactions in Arabidopsis indeed function in diverse responses and that this response is evolutionary conserved in eukaryotes. Besides interactions functioning in well-conserved biological processes, we could also identify the recruitment of well-conserved proteins and protein-protein interactions in plant-specific processes. In the following examples, the regulatory mechanisms such as chromatin remodeling and actin polymerization are common to different lineages. However, these mechanisms are put into play in highly lineage-specific biological processes, such as seed and trichome development. In a first example, we predict that the proteins FERTILIZATION INDEPENDENT ENDOSPERM (FIE), CURLY LEAF (CLF), SWINGER (SWN), MULTICOPY SUPPRESSOR OF IRA1 (MSI1), and MEDEA (MEA) interact and that this protein complex functions in flowering and embryonic development (see Fig. Fig.5A5A
In a second example, we have predicted interactions between different actin-related proteins (ARPs), constituting the ARP2/3 complex involved in leaf development (see Fig. Fig.5B5B Discussion In this study, we have predicted protein-protein interactions in the model plant Arabidopsis thaliana through interolog detection with yeast, worm, fly and human as source organisms and using genomic features (expression correlation, localization and biological process) as filters to increase the confidence in our predictions. As such, a set of highly reliable interactions could be delineated that can be further employed in both computational and experimental studies. In contrast to previous efforts to predict protein-protein interactions in Arabidopsis thaliana, we do not only provide a list of all interologs and their associated genomic feature values, but rather focus on the subset of protein-protein interactions that is supported by the different genomic features (or the so-called filtered interactome). The extensive study of the behavior of the genomic features for experimentally identified protein-protein interactions compared to random protein pairs allowed us to validate the protein-protein interactions adequately. The setup of this study allowed us to rigorously investigate the functional repertoire and evolutionary conservation of the identified protein-protein interactions. We conclude that although this type of protein-protein interaction prediction is highly dependent on the degree of conservation of protein-protein interactions between Arabidopsis and yeasts or animals, we were able to predict interactions with roles in diverse biological processes. Interestingly, these cover both interactions in evolutionary conserved and plant-specific processes. Future improvements to the prediction of protein-protein interactions are manifold. For instance, it has been shown that not all protein-protein interactions are equally stable. Some interactions are permanent while others are transient, meaning that proteins only come together at certain time points or locations possibly resulting in different expression profiles of the encoding genes (just-in-time assembly) [55]. Although we could identify at least some transient interactions (e.g. interactions with transmembrane activity), the use of global expression correlation measures such as the Pearson correlation coefficient might be replaced by a measure of "partial" expression similarity that is able to capture even less stable protein-protein interactions. Recent proteomics profiling (e.g. [56]) will allow the consideration of protein activity rather than transcript activity. On the other hand, with the large-scale experimental identification of protein-protein interactions in many species, a gold standard positive set of interactions can be built more rigorously. This gold standard positive set will increase the strength of machine learning methods in protein-protein interaction detection. Nevertheless, a gold standard negative set of interactions remains problematic. Conclusion In conclusion, this study showed that the integration of orthology with functional association data, such as localization, biological process and co-expression, is adequate to predict protein-protein interactions. In particular, for organisms with limited existing knowledge on protein-protein interactions, such as Arabidopsis, our approach is very valuable. On the contrary, sophisticated machine learning approaches perform poorly because of the lack of gold standard sets of interactions. We could predict a high number of new protein-protein interactions, and analysis of the functional repertoire of identified protein clusters supports the significance of these putative interactions. The approach described here can easily be adapted for estimating the reliability of experimentally identified interactions. Finally, with the growing availability of expression and gene ontology information, this approach can be applied to the detection of protein-protein interactions in agronomically and economically interesting plants, such as rice, corn and poplar. Methods Interaction data Although numerous efforts have been made to obtain uniformity in interaction databases, interaction datasets for yeast, animals and Arabidopsis are not readily available. Protein interaction data sets were compiled from DIP [17], BioGRID [19], MINT [18] and IntAct [20] for the source organisms Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster and Homo sapiens and the target organism Arabidopsis thaliana (for Arabidopsis, see Table 1), containing most of the large-scale interaction studies. Binary interaction data was extracted. For each interaction, the method of detection, PMID number and the bait/prey information were downloaded. As such, identical entries in the different databases were identified. In addition to the combined interaction datasets from DIP, BioGRID, MINT and IntAct, two manually curated datasets of literature-derived interactions were employed. For yeast, a set of 15,456 interactions involving 4554 proteins available at MIPS (Munich Information Center for Protein Sequences) was used. Although this curated set is half as large as the interaction dataset downloaded from the four databases, a considerable number of proteins is covered and the quality of the data is believed to be considerably higher [57]. For human, a set of 37,072 interactions involving 9565 proteins is provided by the HPRD (Human Protein Reference Database) [58]. For Arabidopsis, the experimentally identified protein-protein interactions available at TAIR http://www.arabidopsis.org were included. Altogether, 89,537 interactions among 6515 proteins could be found in public databases of Saccharomyces cerevisiae, 8167 interactions among 4126 proteins of Caenorhabditis elegans, 56,088 interactions among 14,112 proteins of Drosophila melanogaster, 60,775 interactions among 15,126 proteins of Homo sapiens, and 3587 interactions among 1722 proteins of Arabidopsis thaliana (see Table 1). Negative datasets were built by randomizing protein pairs. To analyze equal sample sizes and to take into account the availability of genomic feature data, the negative as well as the positive datasets contained 1000 protein pairs for the assessment of individual genomic features and 500 protein pairs for the combined genomic features. This approach has the disadvantage that some positive pairs may be included in the dataset. However, the number of positive pairs will be extremely low taking into account the number of possible combinations between all Arabidopsis proteins and the estimated size of interactomes of higher organisms. An alternative approach would be to consider pairs that consist of proteins that are not present in the same cellular compartment. However, this method can be biased as not all proteins that localize in the same cellular compartment interact. Moreover, in this study, we use the cellular component annotation to identify positive interactions. Consequently, we opted for the randomization approach throughout our study. Positive and negative datasets were compared to choose appropriate thresholds for the genomic features (similarity in expression, biological process and cellular localization, see further). We have compared balanced datasets (equal number of positive and negative interactions) to estimate the reliability of our genomic feature filtering. The thresholds chosen in this study would correspond to a positive predictive value (number of true positives/(number of true positives + number of false positives)) of 95% in a one to one ratio. However, in reality, although difficult to estimate, the positive and negative protein pairs do not occur in a one to one ratio. Therefore, the positive predictive value is actually smaller. In a one to ten or one to 100 ratio, the PPV drops to ~50% or ~30%, respectively. However, these positive predicted values are probably not robust and should be considered with caution due to the small sample size (sparse distributions of genomic features of the positive dataset compared to the negative dataset). The number of positive interactions for which the genomic features pass the thresholds is extremely low, probably due to the fact that so few protein-protein interactions have been experimentally identified and/or possess sufficient gene ontology information. Even more importantly, the calculation of these positive predictive values does not take into account the fact that, through the interolog detection, our initial predictions (before application of genomic feature thresholds) are already enriched for true interactions. Through this step, the PPV increases to 88% and only the most likely interactions remain. Identification of interologs Whereas earlier methods used BLAST to identify orthologous genes, nowadays more dedicated tools such as OrthoMCL and INPARANOID, which take into account in-paralogs (duplicates arisen after speciation), are applied [59-61]. In this study, orthologous relationships were identified based on the OrthoMCL database containing orthologous groups for 87 species [61]. Like INPARANOID [59], the OrthoMCL software takes into account in-paralogs (genes duplicated in one species after speciation with another species) [62]. We extracted data from five organisms, namely the source organisms Saccharomyces cerevisiae, Caenorabditis elegans, Drosophila melanogaster and Homo sapiens, and the target organism Arabidopsis thaliana. For a certain pair of interacting proteins in the source organism, all combinations between the (co-)orthologous proteins in the target organism were made (see Fig. Fig.1).1 Gene ontology information The Gene Ontology (GO) consortium provides a structured standard vocabulary for describing the function of gene products [63]. It is divided into three ontologies: biological process, molecular function and cellular component, represented by directed acyclic graphs in which nodes correspond to GO terms and edges to their relationships. For each protein, GO terms (GO cellular component and biological process annotation) were extracted from the Gene Ontology database [64] and annotations for Arabidopsis proteins were downloaded from TAIR [65]. These GO terms were used to assess the relatedness of interacting proteins by calculating a GO similarity score (see Additional file 2). We test if interacting proteins are localized in the same cellular compartment and if interacting proteins function in the same biological process. For each protein pair, all GO terms of both proteins are compared to each other. For each pair of GO terms, the depth of the common ancestor of the terms, which is the shortest path of the common ancestor to the root (GO:0003673), is calculated. Subsequently, the maximum value of the calculated depths is taken as the GO similarity score for a certain protein pair. Although this disregards how far away the GO terms are from their common ancestor, this approach has proven valuable for the aims put forward in this study, namely distinguishing between actual protein-protein interactions and random protein-protein pairs (see Results). For both cellular component and biological process annotation, GO term assignments based on physical interactions (IPI, see http://www.geneontology.org/GO.evidence.shtml for details on evidence codes) or electronically assigned and less reliably assigned GO terms (with evidence codes ND, NR, NAS and IEA) are removed. Although the number of proteins with a GO annotation decreases considerably, the reliability of the GO similarity scores increases through this procedure (data not shown). Nevertheless, ISS-based annotations were included. We rigorously assessed the possibility of including annotations based on ISS as this accounts for a considerably number of annotations and could conclude that the low reliability of these annotations does not pose a problem to our approach (see Additional file 4; Additional file 5; see Results). Gene expression data A heterogenic set of microarray expression data containing amongst others growth, stress, and mutation experiments (86) was compiled from NASC (Nottingham Arabidopsis Stock Centre) to detect co-expression between Arabidopsis genes [66]. Microarray experiments with at least 2 replicates were taken. Expression values were processed using RMA (robust multichip average) [67,68]. Co-expression was identified through the calculation of Pearson correlation coefficients (PCC) between the expression profiles of genes possibly encoding interacting proteins. Clustering of protein-protein interactions The predicted interactome can be represented as a graph of nodes, corresponding to the proteins, and connecting edges, corresponding to the interactions. Protein complexes were delineated from this graph making use of the Cluster Affinity Search Technique (CAST) algorithm [69]. This algorithm was originally designed to identify clusters of co-expressed genes. For this purpose, a measure for co-expression of two genes, e.g. the Pearson correlation coefficient, is used as the weight of an edge. However, in this study, all edges were treated equally, avoiding a bias towards protein-protein interactions for which the encoding genes have highly similar expression profiles. A cluster is initiated by choosing the protein with the maximum number of neighbors using a heuristic independent from the CAST algorithm. Subsequently, neighbors of that protein are added to the protein cluster if the neighbor is connected to more than 25% of the proteins already present in the cluster. Although the connectivity of the protein clusters depends on the functional role of the cluster, we could conclude that a connectivity of 25% (compared to 0%, 50%, 75% and 100%) yielded the most robust and functionally relevant protein clusters (data not shown). GO overrepresentation analysis The identified protein complexes are subjected to functional analysis. The assignments of genes to the original GO categories were extended to include parental terms (that is, a gene assigned to a given category was automatically assigned to all the parent categories as well). All GO categories containing less than 20 genes were discarded for further analysis. Enrichment values were calculated as the ratio of the relative occurrence in a set of genes to the relative occurrence in the genome. Overrepresentation of GO categories (biological process, molecular function and cellular component) was tested using the Fisher exact test. P values were adjusted using the Bonferroni correction for multiple hypotheses testing. GO categories are assumed to be significantly overrepresented when the corrected P value is smaller than 0.01. Authors' contributions SB and PR conceived the study. SB and KV designed the study. SP was responsible for data retrieval and predictions. SB performed the assessment of genomic features, predictions and biological interpretation. KV performed clustering and GO overrepresentation analysis of the predictions. SB and KV drafted the manuscript. YVDP and PR critically revised the manuscript. All authors read and approved the final manuscript. Additional file 1 False positive rates of different combinations of genomic features. BP = biological process, CC = cellular component, PCC = Pearson correlation coefficient. Click here for file(34K, doc) Additional file 2 Calculation of GO similarity score. All possible GO terms of two proteins are compared in a pairwise manner. For each pair of GO terms (in green), the depth of the common ancestor (in blue) of these terms is calculated. The maximum depth of all pairwise combinations of GO terms is considered as the GO similarity score between two proteins. Click here for file(8.1M, tiff) Additional file 3 Figure S2. Assessment of combinations of genomic features. Click here for file(12M, tiff) Additional file 6 Filtered interactome. Protein-protein interactions that passed the genomic feature filters. BP score > = 8, CC score > = 5 or BP score > = 5 + PCC > = 0.3. Click here for file(365K, txt) Additional file 7 Predicted interactome. Extension of filtered interactome. Interactions with orthologous group combination already present in filtered interactome are included. Click here for file(1013K, txt) Additional file 8 Overlap of the different sets of protein-protein interactions. Overlap of the filtered (1) and predicted (2) interactome with previously experimentally shown (A) and predicted protein-protein interactions (B). Click here for file(663K, tiff) Acknowledgements S.D.B. and S.P. are indebted to the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT). S.D.B. and K.V. are postdoctoral fellows of the Fund for Scientific Research Flanders (FWO). This work is also supported by the Belgian Federal Science Policy Office: IUAP P6/25 (BioMaGNet). References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||
PLoS Comput Biol. 2007 Mar 30; 3(3):e42.
[PLoS Comput Biol. 2007]Curr Opin Struct Biol. 2004 Jun; 14(3):292-9.
[Curr Opin Struct Biol. 2004]Proc Natl Acad Sci U S A. 2000 Feb 1; 97(3):1143-7.
[Proc Natl Acad Sci U S A. 2000]Science. 2008 Oct 3; 322(5898):104-10.
[Science. 2008]Nature. 2005 Oct 20; 437(7062):1173-8.
[Nature. 2005]Curr Opin Struct Biol. 2004 Jun; 14(3):292-9.
[Curr Opin Struct Biol. 2004]PLoS Comput Biol. 2007 Apr 27; 3(4):e43.
[PLoS Comput Biol. 2007]Curr Opin Struct Biol. 2003 Jun; 13(3):377-82.
[Curr Opin Struct Biol. 2003]Science. 2004 Jan 23; 303(5657):540-3.
[Science. 2004]Nat Genet. 2001 Dec; 29(4):482-6.
[Nat Genet. 2001]Plant Physiol. 2007 Oct; 145(2):317-29.
[Plant Physiol. 2007]Nucleic Acids Res. 2008 Jan; 36(Database issue):D999-1008.
[Nucleic Acids Res. 2008]Genome Biol. 2007; 8(5):R95.
[Genome Biol. 2007]Plant Physiol. 2007 Oct; 145(2):317-29.
[Plant Physiol. 2007]Nucleic Acids Res. 2008 Jan; 36(Database issue):D999-1008.
[Nucleic Acids Res. 2008]Nucleic Acids Res. 2007 Jan; 35(Database issue):D213-8.
[Nucleic Acids Res. 2007]Plant Physiol. 2007 Oct; 145(2):317-29.
[Plant Physiol. 2007]Plant Physiol. 2007 Oct; 145(2):317-29.
[Plant Physiol. 2007]Plant Physiol. 2005 Sep; 139(1):316-28.
[Plant Physiol. 2005]Mol Biol Evol. 2008 Mar; 25(3):507-16.
[Mol Biol Evol. 2008]EMBO Rep. 2006 Sep; 7(9):947-52.
[EMBO Rep. 2006]Nucleic Acids Res. 1999 Jun 1; 27(11):2393-9.
[Nucleic Acids Res. 1999]Plant Cell. 2003 Jul; 15(7):1671-82.
[Plant Cell. 2003]PLoS Biol. 2008 Aug 12; 6(8):e194.
[PLoS Biol. 2008]Nature. 2006 Oct 5; 443(7111):594-7.
[Nature. 2006]Science. 2008 May 16; 320(5878):938-41.
[Science. 2008]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D449-51.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D535-9.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2007 Jan; 35(Database issue):D572-4.
[Nucleic Acids Res. 2007]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D452-5.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D436-41.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D476-80.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D363-8.
[Nucleic Acids Res. 2006]Genome Res. 2003 Sep; 13(9):2178-89.
[Genome Res. 2003]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D258-61.
[Nucleic Acids Res. 2004]Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]Nucleic Acids Res. 2003 Jan 1; 31(1):224-8.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D575-7.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2003 Feb 15; 31(4):e15.
[Nucleic Acids Res. 2003]Biostatistics. 2003 Apr; 4(2):249-64.
[Biostatistics. 2003]J Comput Biol. 1999 Fall-Winter; 6(3-4):281-97.
[J Comput Biol. 1999]