![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||
Copyright © 2008 Baudot et al; licensee BioMed Central Ltd. Defining a Modular Signalling Network from the Fly Interactome 1Institut de Biologie du Développement de Marseille-Luminy (IBDML), UMR6216, CNRS/Université de la Méditerranée, Marseille, France 2Institut de Mathématiques de Luminy (IML), UMR6206, CNRS/Université de la Méditerranée, Marseille, France 3Spanish National Cancer Research Centre (CNIO), Structural Biology and Biocomputing, Melchor Fernández Almagro, 3 E-28029 Madrid, Spain Corresponding author.Anaïs Baudot: abaudot/at/cnio.es; Jean-Baptiste Angelelli: angele/at/iml.univ-mrs.fr; Alain Guénoche: guenoche/at/iml.univ-mrs.fr; Bernard Jacq: jacq/at/ibdm.univ-mrs.fr; Christine Brun: brun/at/ibdm.univ-mrs.fr Received April 25, 2008; Accepted May 19, 2008. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract Background Signalling pathways relay information by transmitting signals from cell surface receptors to intracellular effectors that eventually activate the transcription of target genes. Since signalling pathways involve several types of molecular interactions including protein-protein interactions, we postulated that investigating their organization in the context of the global protein-protein interaction network could provide a new integrated view of signalling mechanisms. Results Using a graph-theory based method to analyse the fly protein-protein interaction network, we found that each signalling pathway is organized in two to three different signalling modules. These modules contain canonical proteins of the signalling pathways, known regulators as well as other proteins thereby predicted to participate to the signalling mechanisms. Connections between the signalling modules are prominent as compared to the other network's modules and interactions within and between signalling modules are among the more central routes of the interaction network. Conclusion Altogether, these modules form an interactome sub-network devoted to signalling with particular topological properties: modularity, density and centrality. This finding reflects the integration of the signalling system into cell functioning and its important role connecting and coordinating different biological processes at the level of the interactome. Background 'Interactomes' are novel biological entities that correspond, ideally and formally, to the complete set of interactions existing between all the macromolecules of an organism [1]. Currently, the available interactomes are primarily formed by protein-protein interaction (PPIs) networks in which the interactions have been experimentally obtained either from high throughput experiments (such as large-scale two hybrid screens and affinity purifications/mass spectrometry [2-12]) or by different types of low-scale experiments. Despite the fact that interactomes are far from being complete, the current PPI maps (for yeast, worm, fly and human) form large intricate networks [13-16]. Concurrently to the deciphering of the interactomes, bioinformatics methods allowing their analysis have been developed. Since interaction networks are represented by complex graphs in which nodes correspond to proteins and edges to their interactions, a number of these methods have been grounded on principles deriving from graph partitioning theory such as the search for interaction-dense regions [17,18], shortest paths in the graph [19], graph disconnection according to edge betweenness [20,21], the sharing of interactors [22,23] or a combination thereof [24] (see [25,26] for review). All these algorithms partition the interaction network into sub-graphs. Among those, the PRODISTIN method that we previously proposed [22] allows the functional classification of the proteins through the computation of a distance reflecting the sharing of interactors between each possible protein pair. As previously demonstrated, proteins participating to the same cellular processes are clustered by the method into the same PRODISTIN classes [22,27]. Signalling consists of multiple sequential events which relay information by transmitting signals from cell surface receptors to intracellular effectors that eventually activate the transcription of target genes. These events are promoted by specific interactions between signalling molecules (proteins, lipids, ions) among which the more prominent and numerous are the protein-protein ones. Essentially due to the temporal dynamics of signal transduction and to the experimental choices often made for investigation (mainly genetics), the signalling mechanisms have long been perceived and described as distinct and isolated linear cascades of reactions, namely the signalling pathways. Nowadays, this vision is progressively changing with the discovery of important crosstalks between pathways [28,29] and the assumption that unknown crosstalks should be responsible for the difficulty to predict output states for particular pathways [29,30]. Finally, the discovery of large numbers of new participants to well known signalling pathways in metazoans, resulting from novel investigations using functional genomics and proteomics methods [31-39], is widening the signalling space [40]. We have taken advantages of our recent efforts 1) in participating to the deciphering [5] and the manual curation of the Drosophila interactome [41] and 2) in developing PRODISTIN, an interactome analysis method [22,27], to investigate the topological organization of the signalling pathways when embedded within a global PPI network. This may predict the participation of unforeseen actors to the pathways and provide an integrated view of the signalling mechanisms by assessing the existence of important interactions between them. Thus, we have applied the PRODISTIN method to a high quality Drosophila interactome. We established and analyzed the functional classification of the proteins participating to 9 canonical signalling pathways and identified 12 classes which potentially correspond to 12 functional modules. From the detailed analysis of these modules, their composition and interconnections, it appears that the linear perception of the signalling pathways does not resist to a global interactome analysis. Rather, our work delineates a highly modular and interconnected signalling network showing a central and plausibly organizational position within the global interactome. Results Functional classification from the protein-protein interaction network: a bi- to tri-partite organization of the signalling pathways based on the sharing of interactors In Drosophila, at least eleven pathways are crucial for signalling in development and adult cell physiology, namely the Wingless (WG), Hedgehog (HH), Notch (N), Decapentaplegic (TGF), Janus Kinases and Signal Transducers and Activators of Transcription (JAK-STAT), Sevenless (SEV), Torso (TOR), Epidermal Growth Factor Receptor (EGFR), Insulin (INS), Toll (TOL) and Fibroblast Growth Factor (FGF) pathways. These pathways have been chosen for the subsequent analysis and according to our curation of the literature (see Methods for details, Additional file 1), each of them transmits external signals through a cascade of reactions mediated by a ten of canonical proteins (10.63 on average). Given that these proteins are parts of a larger PPI network, studying the structure of these signalling pathways with an interactome perspective may bring new insights not only into their composition and possible regulation but also their integration into cell functioning. For this purpose, a high confidence Drosophila interactome composed of 2894 binary protein-protein interactions involving 2939 proteins (see Methods) was analyzed with the PRODISTIN method [22,27]. Briefly, this interactome analysis method first calculates a functional distance between each possible pairs of proteins in the interaction network with regard to the number of interactors they share (in order to reduce the weight of spurious interactions in the computation, proteins must have a connectivity k ≥ 3 to be considered further); the distance values are then clustered leading to a classification tree which is subsequently subdivided into formal classes, using the tree topology and the functional annotations of the proteins (see Methods and [22] for detailed explanations and statistical assessments). In the present work, we obtained a classification tree containing 472 proteins (Figure (Figure1)1
Then, in order to annotate this tree following the PRODISTIN procedure, classes of proteins involved in the same cellular function(s) were defined according to the GO Biological Process ontology. Among the PRODISTIN classes containing less than 20 proteins (which are nested into larger ones since the method uses a tree representation and Gene Ontology displays a hierarchical organization), those containing at least one canonical protein have been chosen to be further investigated. Twelve such classes have been identified, containing 56/58 of the canonical proteins present in the tree (Figure (Figure1,1 We then investigated the detailed distribution of the canonical proteins by analyzing the 12 PRODISTIN classes. Whereas one could have anticipated the clustering of the proteins participating to the same signalling pathway into one single class, it strikingly turned out that without any exception for the 9 considered pathways, the canonical proteins were distributed into 2 or 3 classes per pathway (2.33 on average). Since the PRODISTIN method clusters proteins sharing common interactors [22], this result means that the canonical proteins of a particular pathway belonging to a same class share more common interactors between them and with the other non-canonical proteins of the same class than with the other canonical proteins of the same pathway found in other classes. As a control, ten randomization experiments in which protein names have been randomly assigned to tree branches (see Methods) showed that this clustering pattern could not have happened by chance since the canonical proteins of a pathway are then found distributed among 5.75 (± 0.6) classes on average. As a conclusion, the analysis based on the sharing of interactors overall suggests a bi- to tri-partite organization of each fly signalling pathway within the interactome. The distribution of GO annotations among PRODISTIN classes reflects the polarity of signal transduction: toward the notion of 'signalling modules' For a better understanding of the functional significance of this bi- to tri-partite organization of the signalling pathways, we investigated the class composition and the functions of the classified proteins based on Gene Ontology annotations [42]. Using the 'Cellular Component' and the 'Molecular Function' ontologies, we investigated the repartition of protein localizations and molecular functions in each PRODISTIN class (see Methods for details, Figure Figure2a2a
Qualitatively, when considering the main localization of the classes containing the canonical proteins of a given pathway, the sub-cellular polarity of the pathway is perceptible. Indeed, these proteins are subdivided into classes mainly membrane or membrane/cytoplasm-located, and classes mainly cytoplasmic or cytoplasm/nucleus-located (Figure (Figure2a).2a The composition of the signalling modules reflects close relationships between pathways Since only 12 modules have been identified and each of the 9 pathways is split into 2 to 3 modules, some modules must necessarily include components belonging to more than one pathway. Indeed, whereas 8 out of 12 signalling modules contain canonical proteins from a single pathway, the 4 others enclose proteins from several pathways (Figure (Figure1).1 The fourth mosaic module, HH2 EGFR, suggests a potential link between the Hedgehog and the EGFR pathways since patched (ptc), the receptor of the Hedgehog pathway, is classified with the membrane/cytoplasmic part of the EGFR cascade (Figure (Figure1).1 Finally, the fact that the HH2 EGFR and the SEV RAS1 INS1 classes contain the membrane proteins of four signalling pathways known to use lipid rafts for signal transduction [47] is also remarkable and reinforces the functional consistency of the modules found by the analysis. Overall, these results show that although the analysis is anchored on separate linear cascades (represented by their canonical proteins), the co-classification of some of their components by the PRODISTIN method reveals their close functional relationships. Signalling modules contain other proteins related to signalling: validations and predictions The PRODISTIN method clusters proteins participating in the same cellular process by grouping proteins that share interactors. The obtained clusters then provide a mean to identify potential new players in given cellular processes based on their co-classification with proteins clearly involved in these processes [22]. On average, half of the proteins contained in the signalling modules are not canonical proteins. This accounts for 61 proteins out of 117 proteins contained within the signalling modules. Although these 61 proteins are candidates to participate in signalling, our results obtained on a high quality interactome may be dependent on its relative small size. Indeed, a large number of available but possibly false interactions have not been incorporated in the high quality interactome and an unknown number of not yet detected but physiological interactions is probably missing. We thus address the robustness of our predictions based on co-classifications by proposing as putative new members or regulators of the signalling pathways only the proteins systematically found clustered with the canonical proteins both in the high quality network and in a larger one, containing almost all available Drosophila interactions (22819 interactions) (see Methods for details). As a result, 45/61 non canonical proteins contained in the signalling modules are robustly clustered with the canonical proteins (Table 1, Additional file 4). Twenty of them correspond to already known regulators of the pathways after literature and Gene Ontology annotation searches, thus validating our approach. In addition and noticeably, 15 other proteins are members of alternate or other pathways. Here again, we observed that a given signalling module may contain canonical proteins from a particular pathway as well as proteins from another pathway. Interestingly, proteins belonging to signalling pathways not chosen to 'anchor' the analysis (such as IMD (= Immune Deficiency) and JNK (= Jun N-terminal kinase)) are classified within signalling modules defined by the 'anchoring' pathways (TGF1 and TOL2 for IMD, HH2 EGFR and TOR RAS2 for JNK). This finding raises the number of classified pathways from 9 to 11, therefore heavily underlining the fact that with a PPI network perspective, signalling pathways are intermingled. The functional consequences of this observation may reside in the integration of the signalling processes and their capacity to rapidly respond to diverse extra-cellular stimuli.
Signalling modules contain new potential actors of the pathways Finally, among the 45 non-canonical pathway proteins robustly found in the 12 signalling modules, 10 proteins are neither known regulators of the pathways nor members of other pathways (Table 1, Additional file 4). They are thus predicted by the classification to participate to signalling processes. Five of them have been previously described as involved in other biological processes (Additional file 4) but their domain composition is compatible with a possible implication in signalling. The five others (described below) did not have any Gene Ontology Biological Process annotations in FlyBase at the time of the work but arguments from the literature available for 4 of them suggest their potential role as components, regulators or effectors of the signalling pathways in Drosophila. Hedgehog pathway CG32209/Serpentin is found in the HH1 module, containing the hedgehog (hh), smoothened (smo), costal-2 (cos), fused (fu), Suppressor of fused (Su(fu)) and cubitus interruptus (ci) proteins. It has been recently involved in late tracheal development [48,49]. Its possible involvement in the Hedgehog pathway is suggested by several lines of evidence. First, CG32209 is interacting with the receptor patched (ptc) and the transcription factor ci via two different domains with a high confidence score [5]. Second, CG32209 contains a Low Density Lipoprotein (LDL)-like receptor domain and hedgehog is a lipidated molecule. From a genetic point of view, a transposon insertion into CG32209 is presumably lethal [50] and the gene belongs to a complementation group rescuing the jaft mutation, identified in an enhancer/suppressor genetic screen designed to characterize novel components of the Hedgehog pathway [51]. Notch pathway CG3962/Keap1 is found in the signalling module WG1 N1 containing the Notch (N) and Delta (Dl) proteins and interacts with Dl. Taken together with the facts that 1) the human ortholog KEAP1 is an adaptor protein regulating steady-state levels of the transcription factor NRF2 in response to oxidative stress [52] and 2) the accumulation of reactive oxygen species in mammalian cells was recently shown to occur following disruption of Notch signalling [53], CG3962 may play a role in the Notch pathway. Ras pathway In agreement with its classification in the signalling module containing the cytoplasmic part of the Ras cascade (TOR RAS2), CG8965, which contains two RA (Ras-associated) domains, was recently proposed to represent a Ras effector candidate based on its interactors [5]. Wingless pathway CG3402 belongs to the WG2 module containing the cytoplamic part of the Wingless pathway (armadillo (arm), shaggy (sgg), Axin (Axn), Adenomatous polyposis coli tumor suppressor homolog 2 (Apc2), Casein Kinase I alpha (CkIalpha) and CG3402). The protein contains a single PDZ domain – usually found in diverse signalling proteins – of 123 amino-acids, evolutionarily conserved from arthropods to humans, which is interacting with arm with a very high confidence score [6]. The interaction is conserved throughout the evolution since recorded in C. elegans [4], M. musculus and H. sapiens [54]. Intriguingly, CG3402 also interacts with an arrestin-like protein of unknown function (CG18745) [6], whereas arrestins are regulators of (G-protein)-coupled receptor signalling [55], and G-proteins are involved in both the canonical and the non-canonical Wingless pathways [56]. Finally, the facts that the mammalian ortholog is implicated in (G-protein)-coupled receptor signalling [57] and inhibits the transcriptional activity of β-catenin [54], strongly suggest that CG3402 may represent a new player of the Drosophila Wingless pathway. Interactions between modules define the signalling network The composition of the signalling modules defined by our interactome analysis method shows that signalling pathways are closely intertwined. This observation naturally prompted us to next investigate the interactions between signalling modules. An interaction linking any protein of a signalling module to a protein belonging to another signalling module is considered here as a link between those modules. We then differentiate intra-pathway links i.e between the different modules assigned to a same signalling pathway, excluding links within a module, and inter-pathways links between modules of different signalling pathways. Among the 30 links connecting the 12 signalling modules together (Additional file 5), 53% correspond to intra-pathway links whereas the other 47% are inter-pathways links (Figure (Figure3).3
Is the interaction pattern between signalling modules different from the one between other PRODISTIN classes composing the interactome? To test this, we calculated the density of interactions linking the modules as the number of existing interactions between modules compared to the number of possible ones (see Methods for details). We found that whereas the average density of interactions within modules does not show any important discrepancy between the signalling modules and the other classes (0.39 ± 0.17 vs. 0.35 ± 0.16 respectively), the average density between them shows variations (Table 2). Indeed, the average density of interactions between signalling modules is 6 times higher than between non-signalling PRODISTIN classes. It is also 2 times higher than between all the classes composing the interactome (Table 2). Finally, it is 3 times higher than the average density calculated between 12 classes picked randomly, taken as a control (see Methods for details).
These results are obtained on a high quality but relatively reduced interactome. Consequently, they might be influenced by the size of the interactome studied. We re-calculated the density measures between modules after considering the interaction patterns of each protein in the larger dataset. We showed that again, the average density of interactions between the signalling modules is still 1.6 times higher than between all classes and twice as high as between random classes (Table 2). A numerical bias towards signalling interactions in the studied datasets may have influenced the density results. As a matter of fact, whereas the mean connectivity of the proteins of the high quality PPI network is almost 2, the canonical proteins are connected to 5.17 interactors on average. Are the signalling proteins genuinely more connected than others datasets' proteins because of their intrinsic signalling function or is it explained by the fact that a larger number of interactions is known for signalling proteins due to their extensive investigation? This question has been addressed by analyzing the number of interactors identified for both types of proteins (signalling canonical vs. others) in a same set of experiments, therefore in a context devoid of any bias. We compared the number of interactors identified for 13 of the canonical proteins when tested as baits in the LS-Y2H screen of Formstecher et al. [5] to the number of interactors found for the 86 other baits tested in the same screen. In these conditions, canonical proteins are 1.6 times more connected than other proteins (9.15 interactors on average vs. 5.89 for non canonical and 5.97 for randomly picked proteins, Additional file 6). In addition, whereas 7.6% of the interactions of the canonical baits involve another canonical protein, only 1.4% of the interactions of non canonical proteins do. Therefore, these results provide support to the fact that the observed higher density of links between signalling modules is not due to a bias towards signalling interactions but rather to the natural higher connectivity of the signalling proteins and their propensity to interact with other proteins involved in signalling. Taken together, these results thus suggest that we here define a modular sub-network of the interactome devoted to signalling into which interactions between signalling modules are prominent. The signalling network lies centrally in the interactome The betweenness of an edge is the number of shortest paths between all possible pairs of nodes that run along it (Figure (Figure4A).4A
Edge-betweenness is also interpreted as a measure of network centrality [58,59]. Indeed, edges with high EB values have been proposed to control the communications between network's nodes and to contribute to the cohesiveness of the network. In this respect, signalling processes are expected to contain a large number of edges with a high EB value. In order to test this possibility, we studied the distribution of the EB values on the complete network. Then, considering separately two subsets of edges, internal and external to the signalling network, we determined their repartitions among each of the 4 interquartile intervals of the EB values distribution (see Methods for details) (Figure (Figure4,4 Discussion By computing the Drosophila interactome, we identified a modular signalling network lying centrally in the network after its topological properties. The deciphering of this highly connected sub-network underlines the topological importance of the interactions between signalling modules for the coherence of the interactome. The study also contributes to identify potential new players in Drosophila signalling. It is important to note that the PPI network is static and does not contain any spatio-temporal information. Therefore, our conclusions are drawn from the analysis of 'a long-exposure photograph' [60] of the interactions between proteins, i.e. the set of all possible interactions in all possible biological contexts. Computation of the Drosophila interactions: quality assessment Protein-protein interaction networks have been often suspected to contain erroneous interactions, to be incomplete and biased towards certain type of interactions. For these reasons, our conclusions have been drawn from the analysis of a high quality interaction dataset in order to minimize the weight of potential false positive interactions. In addition, in order to build robust predictions and conclusions, we have reinforced and validated them on a larger dataset containing almost all the currently known Drosophila PPIs. This step insures that the results obtained on the high quality dataset are not sensitive to missing interactions and robust to potential false positive ones. The PRODISTIN method has also been largely statistically assessed for robustness against the presence of false interactions in a previous study [22]. Functional modules devoted to signalling The present analysis identified a signalling network formed by 12 groups of proteins organized around signalling proteins, that we assimilated to 'signalling modules'. Although several different definitions of 'modules' are found in the literature, from static to dynamic (for examples [61-63]), it is however admitted that they form groups of molecules, possibly evolutionary conserved, involved in the same pathway, the same protein complex or the same cellular process. In our experiments as well as in others [17,19,22], modules identified by the computation of interaction networks are generally more than molecular complexes. They may contain proteins belonging to complexes as well as regulatory proteins and/or proteins involved in the same cellular process through interactions. These proteins thus do not necessarily act and bind each other at the same location and time in the cell. As a consequence, a common sub-cellular localization of the proteins of a same functional module is not mandatorily expected since, as known for numbers of signalling proteins, they may shuttle and translocate from one sub-cellular compartment to another to perform their function(s). Indeed, we observed that half of the Drosophila signalling modules are distributed between two sub-cellular localizations and the other half between 3 (Figure (Figure2a2a The identification of functional modules allowed us to predict the participation of 10 potential new actors to Drosophila signalling. None of them was found in the high-throughput RNAi screens recently performed on Drosophila signalling pathways [31-35]. This lack of overlap is probably due to the fact that RNAi screens identify regulators of the pathways which may act not only through protein-protein interactions but also through protein and other molecules (nucleic acids, lipids, ions) direct or indirect interactions. Modularity and signalling By anchoring our analysis on the currently known signalling pathways herein considered as models [64], and by computing the PPI network they belong to, we showed their systematic bi- to tri-partite modular organization. Here, we generalize an observation made on one pathway of a unicellular eukaryote [19] to the major signalling pathways of a metazoan organism. Moreover, preliminary results obtained on the human interactome (Baudot, Brun, Jacq, unpublished) confirm this organization, at least for the Wnt pathway. Indeed, the human functional homologs of the canonical proteins of the Drosophila Wingless pathway are also distributed between 3 signalling modules (Figure (Figure5).5
In theory, the modular organization of the biological networks has been proposed to favour and even ensure the insulation necessary to the correct accomplishment of certain cellular process on one hand, and the connections needed to integrate information from multiple sources on the other hand [61]. Remarkably, these properties constitute two important needs of the signalling process. So, how does this modular organization of the signalling pathways delineated by computational means fit with the functional requirements and principles of the signalling mechanisms? In mammalian cells, the Ras/MAPK signalling takes place in several subcellular compartments (plasma membrane, endosomes, endoplasmic reticulum, Golgi apparatus and mitochondria) [65]. It was shown experimentally that the sensitivity to an input of a MAPK module downstream of Ras – composed of RAF, ERK, MEK and KSR – is determined by its spatial localization [66]. In Drosophila, we found the Ras pathway organized in 2 modules: SEV RAS1 INS1, formed by the membrane-bound proteins of the pathway and TOR RAS2, formed by kinases and the scaffold protein. Interestingly, the latter recapitulates the tested mammalian MAPK module. Indeed, it contains the Drosophila counterparts of the mammalian proteins belonging to the tested module (namely phl for RAF, Dsor1 for MEK and ksr for KSR). Taken together, these results lead to the proposal that the organization of the signalling pathways into different modules may provide the flexibility necessary to the functioning of the same signalling pathway in different spatial, cellular or developmental contexts, aiming probably at increasing the output repertoire complexity. High density and high betweenness of edges: two topological features specific to the signalling network Two graph features, density and betweenness of edges, allowed us to delineate the signalling sub-network from the rest of the network. These topological characteristics – a high density of edges linking the modules and the high number of shortest paths running through the signalling network's edges – reveal the central position of the signalling network within the global interactome. Hence, the role of the signalling mechanisms in connecting and coordinating the diverse cellular processes is here underlined by graph features. Edge-betweenness is a common concept in graph analysis. However, the question of its exact functional biological meaning remains open. The signalling network encompasses a large number of edges with high EB values. This leads to envisage that this could reflect, like in social networks [58], an information flow in spite of the fact that edges in PPI networks are not directed. This last statement agrees with the recent proposal of Yu and colleagues that nodes linked by such edges correspond to the dynamic components of the PPI network [59]. Conclusion We propose here a systems-level analysis of signal transduction from a protein-protein network point of view. Overall, our results reflect the integration of the signalling system into cell functioning and its important role in connecting and coordinating the different biological processes at the level of the interactome. Methods Protein-protein interactions datasets The high quality protein-protein interactions dataset is composed of 2894 binary interactions between 2939 proteins. It was created by joining 970 interactions extracted from literature (deposited in the Intact database [41]), to the interactions identified in two LS-Y2H screens with a high confidence score: 584 interactions from Formstecher et al. (with A, B or C PBS scores) [5] (i.e. 25% of the interactions identified in this screen) and 1395 interactions from Giot et al. (score > 0,8) [6] (i.e. 7% of the interactions identified in this screen). The large dataset contains 22819 interactions and is intended to represent our present view of the Drosophila interactome (probably largely incomplete and with an unknown proportion of false positive interactions). It contains the 970 interactions extracted from the literature and the complete sets of interactions identified in the two LS-Y2H screens cited above, depleted of 7% of their interactions respectively. These 7% correspond to the interactions with the lowest confidence scores in each of the screens. Finally, 1654 interactions from Stanyon et al. [67] were also added. Furthermore, and with the aim of limiting the effect of false positive interactions, each interaction was weighted depending on its reliability. Taken into account that interactions are taken from different sources and therefore are provided with confidence scores calculated differently, we determined the weights as follows: - 2894 interactions coming from the high quality dataset are weighted with the maximum value, 1; PRODISTIN functional classification Applying the Prodistin method on the high quality protein-protein interactions dataset We used the PRODISTIN method [22] through the Prodistin Web Site [27]. Starting with a list of binary interactions, only proteins involved in at least three binary interactions are selected for further classification in order to reduce the weight of spurious interactions. The server then computes the Czekanowski-Dice distance between all possible pairs of proteins (for details, see [22]). The obtained values are subsequently clustered using the BioNJ algorithm [68], leading to a classification tree containing 472 proteins. PRODISTIN functional classes are identified as the largest possible sub-tree composed of at least 3 proteins sharing the same Gene Ontology (February 29th, 2006 version) Biological Process annotations [42] and representing at least 50% of the class members for which an annotation is available. For each protein, the complete hierarchy of annotation terms is considered (child and all parents). Given the large number of annotations for proteins, the majority of PRODISTIN classes are nested within other PRODISTIN classes. The result of the computation is then visualized as a coloured classification tree using an integrated TreeDyn module [69] as a tree viewer. The method has formerly led to the prediction of the cellular function of uncharacterized yeast proteins [22] and the definition of a scale of functional divergence for yeast paralogs based on PPIs [13]. It has also been recently used through its automated version [27] to explore a predicted genetic interaction network of C. elegans [70]. Canonical proteins and PRODISTIN classes identification Canonical proteins (see Additional file 1) were identified from the literature as the main actors of the 'canonical signaling pathways', defined according to STKE as 'idealized or generalized pathways that represent common properties of a particular signalling module or pathway' [71]. Signalling classes are identified as classes containing less than 20 proteins and containing at least one canonical protein. Other PRODISTIN classes are identified as non-overlapping classes of the same size. PRODISTIN method on the large weighted protein-protein interactions dataset Aiming at increasing the efficiency and the reliability of the PRODISTIN classification for large networks, we used in the computation of the distance, the interactions' confidence scores provided in the different large-scale experiments. These confidence scores can be considered as probabilities of interactions and were used to weight the edges of the interaction graph. In order to enable the PRODISTIN method to be applied to weighted networks, we propose to extend the formula of the Czekanowski-Dice distance as follows [72]. This distance was used to cluster graphs Γ = (X, E) where X is a set of n vertices and E a set of m edges. The original distance formula is: where Δ is the symmetric difference between two sets, {y|(x, y) E}.The new distance between each pair (x, y) is computed in a graph Γ = (X, E, W) where W is the weight function W : X × X → [0,1]. where: with Y = Γ (x) Γ (y)We then applied the PRODISTIN method based on the new distance formula for weighted edges, on a list of 22819 binary weighted interactions. The obtained classification tree contained 3975 proteins. Since more than 40% of the tree proteins are of unknown function, only a small number of PRODISTIN classes were identified in this tree. Thus, for comparison purposes, we here considered sub-trees (based on topology only) instead of PRODISTIN classes when necessary. GO annotation analysis The Gene Ontology (version February 29th, 2006) Cellular Component annotations of each signalling module protein have been slimmed to a list of annotations comprising the 4 following terms: Membrane or Extracellular, Cytoplasm, Nucleus and Other. Similarly, the GO Molecular Function annotations have also been simplified to a list of 4 terms: Receptor or Ligand, Kinase or Hydrolase, Transcription Factor and Others. Pie charts shown in Figure Figure22 Density calculation The density of links Di within a class X1 and De between two classes X1 and X2 in a graph Γ = (X, E) are defined by the two formulas: The density of links within a class Di corresponds to the number of observed links between the class' vertices divided by the number of possible links. The density of links between a set of more than 2 classes is calculated as the sum of densities between all pairs of classes divided by the number of possible pairs between classes. Edge betweenness calculation and distribution study The total number of shortest paths between all pairs of vertices in a graph that run through a given edge defines its edge-betweenness. If there is more than one shortest path between a pair of vertices, each path is given an equal weight such as the total weight of all paths is equal to 1 [21]. The EB value has been calculated for the 2061 edges of the interactome. Then, we considered 2 subsets of edges: - the 188 edges linking the proteins belonging to the signalling network, hereafter called 'internal' to the signalling network - the other 1873 edges connecting either two proteins outside of the signalling network or one inside and one outside, hereafter called 'external' to the signalling network. For each subset, the repartition of the corresponding EB values is represented according to the interquartile intervals of the initial EB value distribution as piecharts (Figure (Figure4b4b P-values are calculated using the hypergeometric distribution. Authors' contributions AB performed all bioinformatic analysis, participated to the design of the study and to the manuscript's writing. JBA and AG elaborated the classification method for weighted graphs. BJ was involved in data acquisition, conceived of the study, and participated in its design. CB conceived of the study, participated in its design and coordination, and wrote the manuscript. All authors read and approved the final manuscript. Additional file 1 Canonical signalling proteins. List of proteins selected as canonical proteins (see Methods) for each of the 10 signalling pathways. Protein names correspond to FlyBase gene symbols. Click here for file(15K, xls) Additional file 2 Signalling modules details. Class name and class members, class annotations and p-values for each annotation; annotation(s) of the larger class and number of proteins in the larger class when the signalling classes are nested into larger ones; p-value of the slimmed GO Cellular Component annotation which is over-represented in each class, computed with the GOStat statistical tool of GOToolBox [33]. Click here for file(15K, xls) Additional file 3 Interactions between canonical pathways. List of protein-protein interactions between canonical proteins from different signalling pathways belonging to a same signalling module. For each interaction, the literature source and a short description are given, when available. Click here for file(21K, xls) Additional file 4 Functional status and Gene Ontology annotations of the proteins classified with canonical proteins in each signalling module. The functional status of each protein found in signalling modules (except canonical proteins) is defined as R when the protein is already known to regulate the pathway; AO, when the protein is known to be involved in alternate or other pathways; P, when the protein is predicted to be involved in the pathway by this analysis. For each protein, all available GO annotations in the Biological Process, Molecular Function and Cellular Component ontologies are given. GO annotations are printed in blue if they correspond to a predicted annotation (inferred from Sequence Similarity or Inferred from Electronic Annotation) and not to an experimentally proven function. Click here for file(99K, xls) Additional file 5 Protein-protein interactions between the different signalling modules. The 30 links connecting the signalling modules are detailed. The source column indicates whether the interaction has been identified using small scale approaches or large scale two-hybrid screens. In the latter case, the PMID and the reference of the paper have a purple background. Click here for file(14K, xls) Additional file 6 Comparison between the connectivity of signalling proteins vs. others. Click here for file(43K, xls) Additional file 7 Analysis of edge-betweenness distribution in the large Drosophila PPI network. Click here for file(19K, xls) Acknowledgements We dedicate this work to the memory of our friend and colleague Florence Horn. We thank Emmanuelle Becker, Carl Herrmann, Patrick Lemaire and Laurence Röder for their careful reading of the manuscript and helpful suggestions. AB and JBA were supported by fellowships from the French 'Ministère de l'Enseignement Supérieur et de la Recherche' and the 'Association pour la Recherche contre le Cancer'. This work was supported by an ACI IMPBio grant (EIDIPP project) to BJ. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||
Nucleic Acids Res. 1999 Jan 1; 27(1):89-94.
[Nucleic Acids Res. 1999]Proc Natl Acad Sci U S A. 2001 Apr 10; 98(8):4569-74.
[Proc Natl Acad Sci U S A. 2001]Nature. 2006 Mar 30; 440(7084):637-43.
[Nature. 2006]Genome Biol. 2004; 5(10):R76.
[Genome Biol. 2004]Nature. 2004 Jul 1; 430(6995):88-93.
[Nature. 2004]Proc Natl Acad Sci U S A. 2003 Oct 14; 100(21):12123-8.
[Proc Natl Acad Sci U S A. 2003]BMC Bioinformatics. 2003 Jan 13; 4():2.
[BMC Bioinformatics. 2003]Proc Natl Acad Sci U S A. 2003 Feb 4; 100(3):1128-33.
[Proc Natl Acad Sci U S A. 2003]BMC Bioinformatics. 2005 Mar 1; 6():39.
[BMC Bioinformatics. 2005]Proc Natl Acad Sci U S A. 2002 Jun 11; 99(12):7821-6.
[Proc Natl Acad Sci U S A. 2002]Br J Cancer. 2006 Mar 27; 94(6):771-5.
[Br J Cancer. 2006]Science. 2000 Oct 6; 290(5489):68-9.
[Science. 2000]Nat Cell Biol. 2006 Jun; 8(6):571-80.
[Nat Cell Biol. 2006]Science. 2005 May 6; 308(5723):826-33.
[Science. 2005]Nat Biotechnol. 2003 Mar; 21(3):315-8.
[Nat Biotechnol. 2003]Genome Res. 2005 Mar; 15(3):376-84.
[Genome Res. 2005]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D452-5.
[Nucleic Acids Res. 2004]Genome Biol. 2003; 5(1):R6.
[Genome Biol. 2003]Bioinformatics. 2006 Jan 15; 22(2):248-50.
[Bioinformatics. 2006]Genome Biol. 2003; 5(1):R6.
[Genome Biol. 2003]Bioinformatics. 2006 Jan 15; 22(2):248-50.
[Bioinformatics. 2006]Bioinformatics. 2006 Jan 15; 22(2):248-50.
[Bioinformatics. 2006]Genome Biol. 2003; 5(1):R6.
[Genome Biol. 2003]Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]Genome Biol. 2004; 5(12):R101.
[Genome Biol. 2004]Science. 1996 Mar 29; 271(5257):1826-32.
[Science. 1996]Genes Dev. 2002 Aug 1; 16(15):1964-76.
[Genes Dev. 2002]EMBO J. 2001 Oct 15; 20(20):5725-36.
[EMBO J. 2001]Nat Rev Mol Cell Biol. 2000 Oct; 1(1):31-9.
[Nat Rev Mol Cell Biol. 2000]Genome Biol. 2003; 5(1):R6.
[Genome Biol. 2003]Curr Biol. 2006 Jan 24; 16(2):180-5.
[Curr Biol. 2006]Curr Biol. 2006 Jan 24; 16(2):186-94.
[Curr Biol. 2006]Genome Res. 2005 Mar; 15(3):376-84.
[Genome Res. 2005]Genetics. 2005 May; 170(1):173-84.
[Genetics. 2005]Mol Cell Biol. 2003 Nov; 23(22):8137-51.
[Mol Cell Biol. 2003]J Immunol. 2006 Oct 15; 177(8):5041-50.
[J Immunol. 2006]Genome Res. 2005 Mar; 15(3):376-84.
[Genome Res. 2005]Science. 2003 Dec 5; 302(5651):1727-36.
[Science. 2003]Science. 2004 Jan 23; 303(5657):540-3.
[Science. 2004]J Biol Chem. 2003 Oct 3; 278(40):38758-64.
[J Biol Chem. 2003]Nat Rev Neurosci. 2001 Oct; 2(10):727-33.
[Nat Rev Neurosci. 2001]Cell. 2005 Jan 14; 120(1):111-22.
[Cell. 2005]Genome Res. 2005 Mar; 15(3):376-84.
[Genome Res. 2005]BMC Bioinformatics. 2005 Mar 1; 6():39.
[BMC Bioinformatics. 2005]Proc Natl Acad Sci U S A. 2002 Jun 11; 99(12):7821-6.
[Proc Natl Acad Sci U S A. 2002]PLoS Comput Biol. 2007 Apr 20; 3(4):e59.
[PLoS Comput Biol. 2007]Mol Syst Biol. 2006; 2():66.
[Mol Syst Biol. 2006]Genome Biol. 2003; 5(1):R6.
[Genome Biol. 2003]Nature. 1999 Dec 2; 402(6761 Suppl):C47-52.
[Nature. 1999]PLoS Comput Biol. 2006 Jul 28; 2(7):e59.
[PLoS Comput Biol. 2006]Proc Natl Acad Sci U S A. 2003 Oct 14; 100(21):12123-8.
[Proc Natl Acad Sci U S A. 2003]Proc Natl Acad Sci U S A. 2003 Feb 4; 100(3):1128-33.
[Proc Natl Acad Sci U S A. 2003]Genome Biol. 2003; 5(1):R6.
[Genome Biol. 2003]Science. 2005 May 6; 308(5723):826-33.
[Science. 2005]Nature. 2006 Nov 9; 444(7116):230-4.
[Nature. 2006]Nat Biotechnol. 2001 Jul; 19(7):626-7.
[Nat Biotechnol. 2001]Proc Natl Acad Sci U S A. 2003 Feb 4; 100(3):1128-33.
[Proc Natl Acad Sci U S A. 2003]Nature. 1999 Dec 2; 402(6761 Suppl):C47-52.
[Nature. 1999]Annu Rev Immunol. 2006; 24():771-800.
[Annu Rev Immunol. 2006]Curr Biol. 2005 May 10; 15(9):869-73.
[Curr Biol. 2005]PLoS Comput Biol. 2007 Apr 20; 3(4):e59.
[PLoS Comput Biol. 2007]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D452-5.
[Nucleic Acids Res. 2004]Genome Res. 2005 Mar; 15(3):376-84.
[Genome Res. 2005]Science. 2003 Dec 5; 302(5651):1727-36.
[Science. 2003]Genome Biol. 2004; 5(12):R96.
[Genome Biol. 2004]Science. 2003 Dec 5; 302(5651):1727-36.
[Science. 2003]Genome Biol. 2004; 5(12):R96.
[Genome Biol. 2004]Science. 2003 Dec 5; 302(5651):1727-36.
[Science. 2003]Genome Res. 2005 Mar; 15(3):376-84.
[Genome Res. 2005]Genome Biol. 2003; 5(1):R6.
[Genome Biol. 2003]Bioinformatics. 2006 Jan 15; 22(2):248-50.
[Bioinformatics. 2006]Mol Biol Evol. 1997 Jul; 14(7):685-95.
[Mol Biol Evol. 1997]Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]BMC Bioinformatics. 2006 Oct 10; 7():439.
[BMC Bioinformatics. 2006]Genome Biol. 2003; 5(1):R6.
[Genome Biol. 2003]Genome Biol. 2004; 5(10):R76.
[Genome Biol. 2004]Bioinformatics. 2006 Jan 15; 22(2):248-50.
[Bioinformatics. 2006]Science. 2006 Mar 10; 311(5766):1481-4.
[Science. 2006]Proc Natl Acad Sci U S A. 2002 Jun 11; 99(12):7821-6.
[Proc Natl Acad Sci U S A. 2002]Science. 2003 Mar 28; 299(5615):2039-45.
[Science. 2003]