![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||
Copyright © 2009 Andreopoulos et al; licensee BioMed Central Ltd. Triangle network motifs predict complexes by complementing high-error interactomes with structural information 1Biotechnology Center (BIOTEC), Technische Universität Dresden, 01307 Dresden, Germany 2nanometis, Tatzberg 47-49, 01307 Dresden, Germany Corresponding author.Bill Andreopoulos: williama/at/biotec.tu-dresden.de; Christof Winter: winter/at/biotec.tu-dresden.de; Dirk Labudde: dirk.labudde/at/biotec.tu-dresden.de; Michael Schroeder: ms/at/biotec.tu-dresden.de Received January 14, 2009; Accepted June 27, 2009. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background A lot of high-throughput studies produce protein-protein interaction networks (PPINs) with many errors and missing information. Even for genome-wide approaches, there is often a low overlap between PPINs produced by different studies. Second-level neighbors separated by two protein-protein interactions (PPIs) were previously used for predicting protein function and finding complexes in high-error PPINs. We retrieve second level neighbors in PPINs, and complement these with structural domain-domain interactions (SDDIs) representing binding evidence on proteins, forming PPI-SDDI-PPI triangles. Results We find low overlap between PPINs, SDDIs and known complexes, all well below 10%. We evaluate the overlap of PPI-SDDI-PPI triangles with known complexes from Munich Information center for Protein Sequences (MIPS). PPI-SDDI-PPI triangles have ~20 times higher overlap with MIPS complexes than using second-level neighbors in PPINs without SDDIs. The biological interpretation for triangles is that a SDDI causes two proteins to be observed with common interaction partners in high-throughput experiments. The relatively few SDDIs overlapping with PPINs are part of highly connected SDDI components, and are more likely to be detected in experimental studies. We demonstrate the utility of PPI-SDDI-PPI triangles by reconstructing myosin-actin processes in the nucleus, cytoplasm, and cytoskeleton, which were not obvious in the original PPIN. Using other complementary datatypes in place of SDDIs to form triangles, such as PubMed co-occurrences or threading information, results in a similar ability to find protein complexes. Conclusion Given high-error PPINs with missing information, triangles of mixed datatypes are a promising direction for finding protein complexes. Integrating PPINs with SDDIs improves finding complexes. Structural SDDIs partially explain the high functional similarity of second-level neighbors in PPINs. We estimate that relatively little structural information would be sufficient for finding complexes involving most of the proteins and interactions in a typical PPIN. Background Protein-protein interaction networks (PPINs) derived from high-throughput studies are known to have many errors [1,2]. Data from different studies usually exhibit low overlap; for instance, two large-scale human interactome screens [3,4] share only six interactions, while each has several thousand interactions [5-7]. In some PPINs, more than 50% of reported interactions are estimated to be false positives (FPs) or wrong interactions [8,9]. Moreover, current PPINs are incomplete with an estimated false negative (missing interactions) rate approaching 90% [10-12]. False positives often result when the matrix model, which fully connects the pray and bait proteins, is used for interpreting results of affinity purification followed by mass spectrometry experiments [13]. Not all interactions occur at the same place and time in all cellular states. This implies that representing a PPIN as a set of binary protein-protein interactions (PPIs) is often incomplete [14]. Instead, one wants to restructure protein complexes in PPINs, which are modular units of physical interactions occurring at the same time and cellular component [15,16]. For predicting complexes one wants to include complementary data, such as structural domain-domain interactions (SDDIs) representing binding evidence on proteins [17-22]. At the same time, one wants to leave out of predicted complexes the false positives [22-26]. It was proposed that triangle network motifs represent the basic building blocks of PPINs [27-32]. In this paper, we complement PPIs with SDDIs to form PPI-SDDI-PPI triangle network motifs. Triangle network motifs integrate high-throughput PPINs with complementary knowledge, such as structural data, to account for missing edges [25,33-38]. Our proposed paradigm of PPI-SDDI-PPI triangle network motifs integrate: • PPINs from high-throughput experimental studies, which have considerable coverage but also errors, and A theme encompasses several PPI-SDDI-PPI triangle network motifs with one SDDI edge as their common organizational principle. Figure Figure1a1a
The purpose of PPI-SDDI-PPI triangles is to support revealing biological insights, such as finding complexes of physical interactions occurring at the same time and location [50-55]. Besides complementing PPINs with SDDIs, we additionally form triangle network motifs with other complementary datatypes (CD), such as threading results, and PubMed protein co-occurrence data, thus expanding to other PPI-CD-PPI triangles [56-59]. The complex prediction with other CD is comparable to SDDIs; this supports that the improved complex prediction results are due to a physical relation between proteins and not just coincidence [40,60,61]. A rationale for triangles and themes is the observation that proteins with common interaction partners are likely to have common functions [62-65]. Second-level neighbors in PPINs are functionally similar, and are useful for functional prediction [66-70]. By this "Guilt by Association of Common Interaction Partners" approach, themes can be tied to specific biological phenomena and processes [71-73]. For instance, it was shown for the E. coli and C. elegans transcriptional network that subgraphs matching two types of transcriptional regulatory circuit triangle – feed-forward and bi-fan – overlap with one another and form large clusters [28,74-76]. Another rationale for triangles and themes is that PPINs are "small-world" implying neighborhood clustering, where neighbors of a given node tend to interact with one another; this results in triangle network motifs of three-node interconnection patterns [77,78]. This led to the "transitive module" hypothesis that is used for predicting missing interactions, as shown in Figure Figure1a,1a Extracting triangle network motifs and themes from high-throughput interaction networks Figure Figure22
This paper is organized as follows. Next, we present related work on finding errors in PPINs via motifs of interconnection patterns. Then, we present the results on prediction of true positive complexes using triangles. We illustrate this with an example of myosin-actin related activities. Next, we explain the biological basis for triangles: a model for SDDIs that explains the functional similarity of second-level neighbors in PPINs. Finally, we conclude the paper with an outlook of using other data sources to complement interactomes. Related work Several papers aim to find errors in PPINs by completing them for missing edges or finding false positives [79-83]. Our approach differs from all of these approaches, since we integrate structural information with PPINs derived from high-throughput studies to find triangle network motifs and themes, which can be used to predict complexes. Moreover, we offer the biological basis for the ability of this structural-PPI hybrid method to predict complexes. A first category of work involves collecting ensembles of data, such as structural or literature information. Alber et al. (2007) [84] collect diverse high-quality data, and analyse the ensemble to produce a detailed architectural map of the nuclear pore complex. This work translates the data into spatial restraints, instead of using network motifs as in our approach. Ramirez et al. (2007) [22] assessed the quality and value of publicly available human protein network data, by comparing predicted datasets, high-throughput results from yeast two-hybrid screens, and literature-curated protein-protein interactions. This analysis revealed major differences between datasets. Rhodes et al. (2005) [85] demonstrate that a probabilistic analysis integrating model organism protein interactome data, structural domain data, genome-wide gene expression data and functional annotation data predicts nearly 40,000 interactions in humans. Bader et al. (2004) [19] perform an integrated analysis of proteomics data with data from genetics and gene expression. Combining temporal gene expression clustering with proteomics network topology provides an automated method for extracting biological subnetworks. Huang et al. (2004) [86] present POINT, the "prediction of interactome database". POINT integrates several publicly accessible databases, with emphasis placed on mouse, fruit fly, worm and yeast protein-protein interactions datasets from the Database of Interacting Proteins (DIP), followed by converting them into a predicted human interactome. POINT also incorporates correlated mRNA expression clusters obtained from cell cycle microarray databases and subcellular localization from Gene Ontology to pinpoint the likelihood of biological relevance of each predicted set of interacting proteins. Patil et al. (2005) [87] find that a combination of sequence, structure and annotation information is a good predictor of true interactions in large and noisy interactomes. Another large body of work attempted to predict the missing interactions or assign confidences to large noisy interactomes. Some of these use network topology and others use information on SDDIs, while others use Bayesian networks or probabilistic measures. Yu et al. (2006) [68] describe predicting missing PPIs, using only the PPIN topology as observed by a high-throughput experiment. The method searches the interactome for defective cliques, nearly complete complexes of pairwise interacting proteins, and predicts the interactions that complete them. Chen et al. (2008) [88] propose using triplets of observed PPIs to predict and validate interactions. Yeast is the only data set large enough to warrant application of this method. Singhal et al. (2007) [23] present DomainGA, a computational approach that uses information about SDDIs to predict PPIs. This method achieves good prediction for the positive and negative PPIs in yeast. Pitre et al. (2006) [89] present PIPE, a system for predicting PPIs for any target pair of the yeast proteins from their primary structure. Chen et al. (2005) [24] introduce a novel measure called IRAP, "interaction reliability by alternative path", for assessing the reliability of PPIs based on the underlying PPIN topology. IRAP measure is effective for discovering reliable PPIs in large noisy PPIN datasets. Ng et al. (2003) [90] propose an integrative approach that applies SDDIs to predict and validate PPIs. Chen et al. (2005) [24] introduce a SDDI-based random forest of decision trees to infer PPIs. This method is capable of exploring all possible SDDIs and making predictions based on all the protein domains. Wu et al. (2006) [91] propose using the similarity between two Gene Ontology (GO) terms for reconstructing and predicting a yeast PPIN based solely on knowledge of functional associations between the GO annotations. We have also experimented with using GO similarities in our approach. Chinnasamy et al. (2006) [92] present a probabilistic-based naive Bayesian network to predict PPIs using protein sequence information. This framework provides a confidence level for every predicted PPI. Jansen et al. (2003) [93] also developed an approach using Bayesian networks to predict PPIs in yeast. Han et al. (2004) [94] propose PreSPI, a domain combination based PPI prediction approach. PPIs are interpreted as the result of groups of multiple SDDIs. This approach also provides an interacting probability for PPIs. Recently, Vidal and colleagues [95] used reference sets to calculate the probability that a newly identified PPI is a true biophysical interaction, and assigned confidence scores to all PPIs in interactome networks. Yu et al. (2009) [96] assign confidence scores that reflect the reliability of each PPI, by using multiple independent sets of training positives to reduce the bias inherent in using a single training set. Another body of work has performed large scale analysis of networks, statistical network motif analysis or error estimation, which is of interest for our work as well. Jin et al. (2007) [32] use network motifs to solve the open question about 'party hubs' and 'date hubs' which was raised by previous studies. At the level of network motifs instead of individual proteins, they found two types of hubs, motif party hubs and motif date hubs, whose network motifs display distinct characteristics on biological functions. Zhang et al. (2005) [28] observed that different types of networks exhibit different triangle profiles, providing a means for network classification. They extended the network triangle concept to an integrated network of many interaction types. Mathivanan et al. (2006) [97] analyzed the major publically available databases that contain literature curated PPI information for human proteins, finding a large difference in their content. This included BIND, DIP, HPRD, IntAct, MINT, MIPS, PDZBase and Reactome databases [98]. Chiang et al. (2007) [1] assess the error statistics in all published large-scale datasets for S. cerevisiae. Vidal and colleagues [99,100] used an empirically-based approach to assess the quality and coverage of existing human interactomes. They found that high-throughput human interactomes are more precise than literature-curated PPIs from publications. Several papers used clustering or graph theoretic methods to predict complexes in PPINs. Bader et al. (2003) detected complexes as highly connected subgraphs [101]. Andreopoulos et al. (2007) detected complexes as groups of proteins with similar interaction partners [62]. Cakmak et al. (2007) [102] go beyond complexes to discover unknown pathways in organisms, using Gene Ontology (GO)-based functionalities of enzymes involved in metabolic pathways. Results and discussion In our experiments, we employ three high-throughput PPINs, derived by affinity purification followed by mass spectrometry (AP/MS). Krogan06 is based on [103]. Gavin06MATRIX and Gavin06SPOKE are matrix and spoke model interpretations, respectively, of [104]. The matrix model of interpreting pull-down studies connects all prey proteins that were pulled out with a bait, while the spoke model connects only the preys with the bait. We focus on yeast PPINs, since yeast is a well-annotated organism with Gene Ontology terms. The Krogan06 and Gavin06SPOKE yeast PPINs have low overlap. To evaluate the success of our approach, we employ known complexes from the MIPS database [105,106]. We evaluate whether known MIPS complexes could be predicted using triangles and theme motifs, consisting of PPINs combined with complementary data such as SDDIs. For illustratory purposes, we use three manually curated networks of myosin-actin involvement in different cellular processes [see Additional files 1, 2, 3, and 4] Low overlaps of PPINs with complexes The biological motivation for our work includes low overlap of high-throughput PPINs with known complexes. We compared the overlaps of two high-throughput PPINs, the Gavin06MATRIX and Krogan06 networks, with the MIPS protein complexes dataset. Table 1 shows full results for the overlaps of Gavin06MATRIX and Krogan06 networks to the MIPS complexes. For protein pairs that appear in both PPINs and complexes, we evaluated the number of overlapping edges PPIN ∩ complexes. We found Gavin06 ∩ MIPS has 305 overlapping edges, Krogan06 ∩ MIPS has 359 overlapping edges.
Gavin06MATRIX and Krogan06 had thousands of edges connecting these same proteins, which were not in MIPS. Figure Figure33
PPI-SDDI-PPI triangles predict complexes Given the many false negatives (missing interactions) and false positives (wrong interactions) in protein-protein interaction networks (PPINs) derived from high-throughput experiments, we evaluated the success of triangle network motifs and themes in finding known MIPS complexes. With structural domain-domain interactions (SDDIs) representing binding evidence on proteins, PPI-SDDI-PPI triangle network motifs are likely to reflect true complexes. To evaluate this, we examined the overlap of triangles from Gavin06 and Krogan06 with MIPS complexes. For the common proteins we evaluated the interactions that are true positives (overlap) or false positives (no overlap) with MIPS. The first row of table 2 shows the low overlap between PPIN second-level neighbors (without complementary data) and MIPS complexes; where all three proteins in an indirect relation occur in MIPS complexes (denominator), rarely both PPIs occur (numerator). Despite the observed functional similarity of second-level neighbors in PPINs [62-70], second-level neighbors have overlap with MIPS lower than 1%. The other rows show that integrating complementary datatypes (CD) in a PPIN to form PPI-CD-PPI triangle network motifs results in a higher overlap with MIPS complexes. In Table 2 the second row shows the PPI-SDDI-PPI triangle overlap with MIPS complexes as a true positive rate as high as 31%; the other triangle interactions are likely false positives. For Gavin06MATRIX the triangle true positive rate is lower than for Krogan06, since Gavin06MATRIX reflects the matrix model interpretation, which resulted in 93, 881 edges including many false positives. Gavin06MATRIX has many errors when overlayed with the MIPS complex dataset. The success rate is higher for Gavin06SPOKE, since there are fewer false positives than Gavin06MATRIX.
Table 3 shows that with varying confidence thresholds for SDDIs, the true positive rate changes. This shows that it is preferable to use the highest-confidence SDDIs. It also shows the significance of using SDDIs for complex prediction.
Triangles with other complementary data We added to PPINs other complementary datatypes, besides structural SDDIs, to form triangles: PubMed literature co-occurrences of protein mentions, and Interpro Pfam domain co-occurrences in PPIs [107] (see methods section). Table 2 rows 3–4 show the MIPS complex overlaps with triangle network motifs using other complementary datatypes to form triangles. The triangles with other complementary datatypes exhibit little difference in their overlap with MIPS complexes. In the last row 5 where all datatypes are combined, the overlap with MIPS increases. Triangles that include SDDIs or other complementary data to match second-level neighbors have higher overlap with MIPS complexes than second-level neighbors without any complementary data. These results point to the direction of complementing the PPINs with other datatypes as triangle network motifs, rather than simple edges, for improved prediction of MIPS complexes. Table 4 shows the individual ability of various datatypes to predict the MIPS complexes, showing the edge overlap without forming triangles. As shown under the column "Edges in edge overlap", all datatypes have moderate edge overlap with MIPS. The individual datatypes have little difference in their ability to predict MIPS.
Example: reconstructing distinct myosin-actin biopathways via themes of PPI-SDDI-PPI triangle network motifs Type I myosin motor proteins (MYO3 or MYO5) have distinct but overlapping functions in multiple cellular processes and locations [108]. Figure Figure44
MYO3 is one of two type I myosins, which utilize the cytoskeleton for movement, moving along microfilaments through interaction with actin. Deletion of MYO3 causes severe defects in growth and actin cytoskeleton organization [111]. Besides myosin, SHE4 is also important for the organization of the actin cytoskeleton. SHE4 is of special interest because it is involved in all of organization of the actin cytoskeleton, asymmetric mRNA localization, and endocytosis [112]. SHE4 has similar Gene Ontology annotations as myosin. Next, we explore whether triangle network motifs and themes in Gavin06MATRIX can help reconstruct distinct myosin-actin pathways for cellular localization of biomolecules. Cytoskeletal actin organization Figure Figure4b4b RSR1, BNI1, GEA1 play a role in cytoskeletal actin localization [113,114]. The correct localization of RSR1 has been shown to be critical for actin cytoskeleton organization. Localization of the Ras-like GTPase RSR1 and its regulators are required for selection of a specific growth site [115]. Regulators direct the correct localization of RSR1 in various organisms. In Figure Figure4b,4b What is of special interest in this example is the intersection of the neighborhoods of RSR1, ARF2, BNI1 comprising EF1A-RL3, which were previously observed to have a functional significance for F-actin localization [118]. In addition, BNI1 and GEA1 appear to be connected to the ARF2 complex via PYR1 intermediary. Thus, RSR1, GEA1 and BNI1 appear to be linked to one another via EF1A-RL3-PYR1, which are also common partners of ARF2. This suggests a role of EF1A-RL3-PYR1 as the regulators for the RSR1-GEA1-BNI1 complex localization in yeast cytoskeletal actin localization [119]. Overexpression of GEA1 or GEA2 was observed to bypass the requirement for profilin in actin cable formation [116]. Profilin is an actin-binding protein involved in cytoskeleton dynamics. Profilin enhances actin growth as follows: Profilin binds to monomeric actin on the plus end of the filament inducing a shape change of the actin subunit, allowing the G-actin to replace the ADP to which it is bound by ATP and form F-actin. The F-actin then forms a heterodimer which can bind to the plus end of an actin filament. In the process of binding to the actin monomers it also stereochemically inhibits addition to the minus end [120]. On the other hand, in a separate study it was observed that loss of the activity to bind EF1A-RL3 displayed an abnormal phenotype represented by dissociated localizations of F-actin, which were co-localized in wild-type cells [118]. This observation links the two studies, suggesting that the significance of EF1A-RL3 for F-actin localization may help explain why overexpression of GEA1 or GEA2 bypassed the requirement for profilin in actin cable formation. Nuclear actin and myosin I required for RNA polymerase I, II, III transcription Figure Figure4c4c TBA1/RAP1 play a role in nucleus transciption from RNA polymerase II promoter. TBA1/RAP1 is a DNA-binding protein involved in either activation or repression of transcription, depending on binding site context; it also binds telomere sequences and plays a role in telomeric position effect (silencing) and telomere structure. In Figure Figure4c,4c mRNA localization: The SHE protein complex is required for cytoplasmic transport of mRNAs in yeast Figure Figure4d4d ARF2, EF1A, IMDH3 play a role in mRNA localization for translation. ARF2 is an ADP-ribosylation factor involved in regulation of coated formation vesicles in intracellular trafficking within the Golgi [130]. In Figure Figure4d,4d • EF1A: Translation elongation factors are responsible for two main processes during protein synthesis on the ribosome [131]. EF1A (or EF-Tu) is responsible for the selection and binding of the cognate aminoacyl-tRNA to the A-site (acceptor site) of the ribosome. EF2 (or EF-G) is responsible for the translocation of the peptidyl-tRNA from the A-site to the P-site (peptidyl-tRNA site) of the ribosome, thereby freeing the A-site for the next aminoacyl-tRNA to bind. Elongation factors are responsible for achieving accuracy of translation and both EF1A and EF2 are remarkably conserved throughout evolution (InterPro annotation). • IMDH3: Involved in the amino acid biosynthesis pathway. Biological interpretation of PPI-SDDI-PPI triangles: A structural basis for functional similarity of second-level neighbors in PPINs In this section we propose an explanation for the observation that SDDIs can complement high-error PPINs to improve the finding of complexes. A structural SDDI between two proteins implies that they are likely to be observed with common groups of interaction partners in an experimental study. This especially holds in affinity purification experiments followed by mass spectrometry (AP/MS), since the bait-prey technologies used will cause structurally connected proteins to be detected as prey for similar bait protein(s). Of course this only holds for proteins that are detectable as prey [132]. A SDDI is the likely reason why two proteins are observed with common friends in PPINs from high-throughput AP/MS studies. Then, the SDDI's interaction partners are likely to be observed in different cellular components; Figure Figure55
Gene Ontology (GO) similarity in triangle PPI edges Figure Figure66
Why are few SDDIs detected in high-throughput PPINs experiments? Table 6 shows that few SDDIs overlap with PPINs, even when considering the highest-confidence SDDIs only. Figure Figure77
Conclusion How many SDDIs are needed to predict all complexes for an entire PPIN? Figure Figure88
SDDIs and the PubMed co-occurrences relate to two different aspects. SDDIs are based on experimental results that are likely to imply a structural interaction. In the case of SDDIs, we can use all information found by mapping structural domains to proteins using BLAST sequence similarity and still get good prediction accuracy. On the other hand, for literature we have to apply a strict filtering, keeping only the top 1% of protein co-occurrences appearing in PubMed as complementary data. We observed that the literature co-occurrences appear to give slightly better results than using SDDIs as complementary data. The main limitation of SDDIs at present is the sparsity of known structural interactions. Since PubMed is expected to grow faster than structural knowledge, using literature co-occurrences might give even better prediction accuracy in the future, as long as a strict cut-off is set. Conclusion With the amount of PPINs from high-throughput experiments, structural data and literature-based interactions on the rise, we studied their combined ability to predict known complexes. We found a low overlap of PPINs derived from high-throughput studies with known complexes, as well as low overlap with structural domain-domain interactions. We proposed PPI-SDDI-PPI triangle network motifs as a model for analysing PPINs and predicting complexes. PPI-SDDI-PPI triangles have higher overlap with MIPS complexes than random second-level neighbors, indicating that structural SDDIs are useful for complementing PPINs in triangles to create a more complete picture of protein cellular involvement. We complemented PPINs with several other datatypes besides SDDIs to create triangle and theme motifs, resulting in similar overlaps with complexes. Themes of PPI-SDDI-PPI triangles helped us to reconstruct complexes in myosin-actin processes that were not detected by PPINs. Our approach is useful for finding true positives in PPINs, as structural knowledge on proteins increases in the future. SDDIs partially explain the high functional similarity of second-level neighbors in PPINs. A SDDI may cause a structurally connected pair of proteins to be observed with common interaction partners in high-throughput affinity purification experiments followed by mass spectrometry (AP/MS) that use bait-prey technologies. We examined why some SDDIs are detected in PPINs, and we found that SDDIs detected by PPINs are part of highly connected components/complexes, therefore they are more likely to be detected by experimental studies. Methods In this section we give an overview of the methods used in this study. Figure Figure22 PPI-CD-PPI triangle network motifs PPI-CD-PPI triangles contain three proteins connected by two PPIs and an edge of a complementary datatype (CD), such as a structural SDDI; in this case, we refer to PPI-SDDI-PPI triangles, as Figure Figure1a1a Let σSDDI denote the number of PPI-SDDI-PPI triangles a structural SDDI is involved in. A structural SDDI may be involved in σ ≥ 1 triangles, which we refer to as a theme. A theme is given by the σ common interaction partners (intersecting neighborhoods) of a SDDI's protein pair, and some PPIs in a theme may be False Positives. Complementary datatypes As structural information to complement PPINs, we used the SCOPPI database, which contains SDDIs observed in known protein complex structures [134]. To assign domains, we BLASTed the sequences of all proteins in the "Saccharomyces Genome Database" (which includes yeast PPINs) against all domains sequences of SCOPPI. We considered only BLAST hits with an E-value ≤ 0.01 and a sequence identity percentage s ≥ 30%. In addition, we required 75% of the domain to appear in the protein. Other complementary datatypes (CD) edges we used included The Genomic Threading Database (GTD) [141]. GTD contains yeast protein assignments to SCOP domain structural annotations and interacting structures. An assigned Confidence value gives an indication of the strength of a hit, ranging from "certain" to "guess", which is based on a P-value measure of significance. The next CD dataset we used was PubMed literature co-occurrences of protein mentions. To extract these, we used the GoPubMed protein mention extraction algorithm to assign proteins to all PubMed documents [142]. Then, we used a version of the Blosum co-occurrence score to find if two proteins p1 and p2 co-occur frequently in PubMed documents: . A cutoff of 10 was strict enough to filter out the majority of protein co-occurrences in PubMed, resulting in a network of 170,638 edges. The last CD dataset we used was Interpro Pfam domain co-occurrences in PPIs. For this, we took all IntAct yeast PPIs and assigned to the proteins Pfam domains from InterPro [107]. Then, we used the Blosum co-occurrence score to find which Pfam domains co-occur frequently in the IntAct yeast PPIs. Based on the most co-occurring Pfam domains, we build a network over the yeast PPIs.High-throughput PPINs and known complexes We use two yeast PPINs that we denote as Gavin06 [104] and Krogan06 [103]. For Gavin06 we used both the matrix and the spoke model to interpret it, which we refer to as Gavin06MATRIX and Gavin06SPOKE throughout the text. Gavin06MATRIX had 93,881 edges, while Gavin06SPOKE had 22,452 edges. Krogan06 had 14,292 edges, consisting of the binary interactions as provided by the publication. For validation, we used MIPS complexes [105,106]. For MIPS we used the SPOKE model for the interpretation of complexes, since otherwise the result would be biased to give a high overlap with the PPINs [see Additional files 5, 6]. The MIPS complexes had 2,099 edges. Moreover, for our illustrations we manually curated three network examples from the literature, representing myosin-actin involvement in cytoskeleton organisation, nucleus transcription, and mRNA translocation. Developing these networks involved reading papers from the biomedical literature and recording any interaction(s) described in the articles. Gene Ontology similarity It is likely that a PPI is not physical, but a false positive, which may be detected by a GO similarity of zero. PPIs with a GO similarity of zero hint at false positives. For calculating the similarity based on Gene Ontology terms, we searched for GO terms in the current abstract and compared them to the set of GO terms assigned to each gene candidate. For each potential tuple taken from the two sets (text and gene annotation), we calculated a distance of the terms in the ontology tree. These distances yielded a similarity measure for two terms, even if they did not belong to the same sub-branch or were immediate parents/children of each other. The distance took into account the shortest path via the lowest common ancestors, as well as the depth of this lowest common ancestor in the overall hierarchy (comparable to Schlicker et al., 2006 [133]). The distances for the closest terms from each set then defined a similarity between the gene and the text [142]. Correlation We computed the correlation coefficient between A and B, where A and B are matrices or vectors of the same size. A matrix entry contains a measure of Gene Ontology similarity (0 – 1) for a protein pair involved in a PPI or SDDI. We used the matlab corr2 correlation coefficient:
HIERDENC supplementary material We implemented the HIERDENC online database, which contains all of the datasets we used. HIERDENC helps a user to visualize and find true positives in PPINs via triangles of high-throughput PPINs and complementary data. http://www.hierdenc.com/ or http://projects.biotec.tu-dresden.de/HIERDENC/ Authors' contributions All authors read and approved the final manuscript. BA planned the paper, carried out most of the experiments, wrote the software, wrote the python scripts for making the networks, built the manually curated networks, and wrote most of the paper. CW helped in conceptualising the paper with discussions and provided the complementary data. DL helped in formulating the paper with discussions. MS supervised the work and contributed discussions and ideas. Additional file 1 An excel file with Uniprot accession numbers for all protein names used in the text. Click here for file(7.5K, xls) Additional file 2 Manually curated network example from the literature, representing myosin-actin involvement in cytoskeleton organisation. Click here for file(949 bytes, txt) Additional file 3 Manually curated network example from the literature, representing myosin-actin involvement in mRNA translocation. Click here for file(147 bytes, txt) Additional file 4 Manually curated network example from the literature, representing myosin-actin involvement in nucleus transcription. Click here for file(1.0K, txt) Additional file 6 A script for converting MIPS complexes to a SPOKE model network. Click here for file(1021 bytes, zip) Acknowledgements This work was funded by the EU Sealife project. Joerg Hakenberg helped with Gene Ontology similarity. Rainer Winnenburg helped with discussions. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||
Genome Biol. 2007; 8(9):R186.
[Genome Biol. 2007]Bioinformatics. 2009 Jan 15; 25(2):243-50.
[Bioinformatics. 2009]Cell. 2005 Sep 23; 122(6):957-68.
[Cell. 2005]Nature. 2005 Oct 20; 437(7062):1173-8.
[Nature. 2005]Genome Biol. 2006; 7(11):120.
[Genome Biol. 2006]BMC Bioinformatics. 2008 Apr 21; 9():203.
[BMC Bioinformatics. 2008]Proc Natl Acad Sci U S A. 2006 Oct 3; 103(40):14718-23.
[Proc Natl Acad Sci U S A. 2006]Trends Genet. 2002 Oct; 18(10):529-36.
[Trends Genet. 2002]Pac Symp Biocomput. 2006; ():403-14.
[Pac Symp Biocomput. 2006]Proteomics. 2007 Aug; 7(15):2541-52.
[Proteomics. 2007]Science. 2002 Oct 25; 298(5594):824-7.
[Science. 2002]PLoS One. 2007 Nov 21; 2(11):e1207.
[PLoS One. 2007]Bioinformatics. 2001; 17 Suppl 1():S296-305.
[Bioinformatics. 2001]Genome Inform. 2002; 13():42-50.
[Genome Inform. 2002]Genome Res. 2002 Oct; 12(10):1540-8.
[Genome Res. 2002]J Mol Biol. 2006 Sep 29; 362(4):861-75.
[J Mol Biol. 2006]BMC Genomics. 2006 May 23; 7():122.
[BMC Genomics. 2006]BMC Bioinformatics. 2007 Jul 5; 8():239.
[BMC Bioinformatics. 2007]Nat Rev Mol Cell Biol. 2004 May; 5(5):410-5.
[Nat Rev Mol Cell Biol. 2004]Nat Cell Biol. 2004 Nov; 6(11):1013-4.
[Nat Cell Biol. 2004]Bioinformatics. 2007 Jan 15; 23(2):207-14.
[Bioinformatics. 2007]Biochem Biophys Res Commun. 2006 Jun 23; 345(1):302-9.
[Biochem Biophys Res Commun. 2006]BMC Bioinformatics. 2007; 8 Suppl 10():S6.
[BMC Bioinformatics. 2007]BMC Bioinformatics. 2004 Oct 18; 5():154.
[BMC Bioinformatics. 2004]Genome Res. 2008 Sep; 18(9):1500-8.
[Genome Res. 2008]Bioinformatics. 2007 May 1; 23(9):1124-31.
[Bioinformatics. 2007]J Mol Biol. 2000 Jun 2; 299(2):283-93.
[J Mol Biol. 2000]J Bioinform Comput Biol. 2008 Jun; 6(3):435-66.
[J Bioinform Comput Biol. 2008]Comput Biol Chem. 2006 Dec; 30(6):445-51.
[Comput Biol Chem. 2006]Bioinformatics. 2006 Jul 1; 22(13):1623-30.
[Bioinformatics. 2006]J Med Syst. 2006 Feb; 30(1):39-44.
[J Med Syst. 2006]Nucleic Acids Res. 2008 May; 36(9):3025-30.
[Nucleic Acids Res. 2008]Nature. 2007 Nov 29; 450(7170):683-94.
[Nature. 2007]Proteomics. 2007 Aug; 7(15):2541-52.
[Proteomics. 2007]Nat Biotechnol. 2005 Aug; 23(8):951-9.
[Nat Biotechnol. 2005]Nat Biotechnol. 2004 Jan; 22(1):78-85.
[Nat Biotechnol. 2004]Bioinformatics. 2004 Nov 22; 20(17):3273-6.
[Bioinformatics. 2004]Bioinformatics. 2006 Apr 1; 22(7):823-9.
[Bioinformatics. 2006]BMC Bioinformatics. 2007 Jun 13; 8():199.
[BMC Bioinformatics. 2007]BMC Bioinformatics. 2006 Jul 27; 7():365.
[BMC Bioinformatics. 2006]Bioinformatics. 2005 Dec 15; 21(24):4394-400.
[Bioinformatics. 2005]Bioinformatics. 2003 May 22; 19(8):923-9.
[Bioinformatics. 2003]Comput Biol Med. 2006 Oct; 36(10):1143-54.
[Comput Biol Med. 2006]Science. 2003 Oct 17; 302(5644):449-53.
[Science. 2003]Nucleic Acids Res. 2004; 32(21):6312-20.
[Nucleic Acids Res. 2004]Nat Methods. 2009 Jan; 6(1):91-7.
[Nat Methods. 2009]Bioinformatics. 2009 Jan 1; 25(1):105-11.
[Bioinformatics. 2009]PLoS One. 2007 Nov 21; 2(11):e1207.
[PLoS One. 2007]J Biol. 2005; 4(2):6.
[J Biol. 2005]BMC Bioinformatics. 2006 Dec 18; 7 Suppl 5():S19.
[BMC Bioinformatics. 2006]Nucleic Acids Res. 2009 Jan; 37(Database issue):D1-4.
[Nucleic Acids Res. 2009]Genome Biol. 2007; 8(9):R186.
[Genome Biol. 2007]BMC Bioinformatics. 2003 Jan 13; 4():2.
[BMC Bioinformatics. 2003]Bioinformatics. 2007 May 1; 23(9):1124-31.
[Bioinformatics. 2007]Bioinformatics. 2007 Oct 15; 23(20):2775-83.
[Bioinformatics. 2007]Nature. 2006 Mar 30; 440(7084):637-43.
[Nature. 2006]Nature. 2006 Mar 30; 440(7084):631-6.
[Nature. 2006]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D169-72.
[Nucleic Acids Res. 2006]Bioinformatics. 2005 Mar; 21(6):832-4.
[Bioinformatics. 2005]Bioinformatics. 2007 May 1; 23(9):1124-31.
[Bioinformatics. 2007]Comput Biol Chem. 2006 Dec; 30(6):445-51.
[Comput Biol Chem. 2006]Nucleic Acids Res. 2007 Jan; 35(Database issue):D224-8.
[Nucleic Acids Res. 2007]PLoS Biol. 2008 Jan; 6(1):e1.
[PLoS Biol. 2008]Nature. 2006 Mar 30; 440(7084):631-6.
[Nature. 2006]Science. 2006 Dec 22; 314(5807):1938-41.
[Science. 2006]Nat Struct Mol Biol. 2005 Sep; 12(9):742-6.
[Nat Struct Mol Biol. 2005]J Cell Biol. 2000 Jan 24; 148(2):353-62.
[J Cell Biol. 2000]Mol Biol Cell. 2003 Jun; 14(6):2237-49.
[Mol Biol Cell. 2003]Mol Biol Cell. 2003 Jun; 14(6):2237-49.
[Mol Biol Cell. 2003]Science. 2002 Jul 26; 297(5581):612-5.
[Science. 2002]Nat Cell Biol. 2002 Mar; 4(3):260-9.
[Nat Cell Biol. 2002]J Biol Chem. 2002 Jul 26; 277(30):26721-4.
[J Biol Chem. 2002]Genetics. 2003 Nov; 165(3):985-95.
[Genetics. 2003]Science. 1997 Apr 4; 276(5309):118-22.
[Science. 1997]Biochem Biophys Res Commun. 1997 Mar 17; 232(2):503-7.
[Biochem Biophys Res Commun. 1997]Nature. 2003 Apr 17; 422(6933):766-74.
[Nature. 2003]Genetics. 2003 Nov; 165(3):985-95.
[Genetics. 2003]FEMS Yeast Res. 2007 Sep; 7(6):782-95.
[FEMS Yeast Res. 2007]Biochem Biophys Res Commun. 1997 Mar 17; 232(2):503-7.
[Biochem Biophys Res Commun. 1997]Nat Rev Mol Cell Biol. 2004 May; 5(5):410-5.
[Nat Rev Mol Cell Biol. 2004]Nat Cell Biol. 2004 Nov; 6(11):1013-4.
[Nat Cell Biol. 2004]Nat Cell Biol. 2004 Nov; 6(11):1094-101.
[Nat Cell Biol. 2004]Mol Genet Genomics. 2003 Mar; 268(6):791-8.
[Mol Genet Genomics. 2003]J Biol Chem. 1991 Sep 15; 266(26):16992-5.
[J Biol Chem. 1991]Nat Cell Biol. 2004 Nov; 6(11):1013-4.
[Nat Cell Biol. 2004]Cell. 2007 Oct 5; 131(1):174-87.
[Cell. 2007]Science. 1997 Jul 18; 277(5324):383-7.
[Science. 1997]Mol Cell Biol. 1990 Dec; 10(12):6690-9.
[Mol Cell Biol. 1990]Curr Opin Struct Biol. 2005 Jun; 15(3):349-54.
[Curr Opin Struct Biol. 2005]Bioinformatics. 2008 Jan 15; 24(2):218-24.
[Bioinformatics. 2008]BMC Bioinformatics. 2006 Jun 15; 7():302.
[BMC Bioinformatics. 2006]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D310-4.
[Nucleic Acids Res. 2006]Bioinformatics. 2007 May 1; 23(9):1124-31.
[Bioinformatics. 2007]Nat Rev Genet. 2007 Sep; 8(9):699-710.
[Nat Rev Genet. 2007]Bioinformatics. 2008 Nov 1; 24(21):2546-8.
[Bioinformatics. 2008]BMC Bioinformatics. 2007 Jul 18; 8():259.
[BMC Bioinformatics. 2007]PLoS Comput Biol. 2007 Sep; 3(9):1761-71.
[PLoS Comput Biol. 2007]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D310-4.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D196-9.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D322-6.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2007 Jan; 35(Database issue):D224-8.
[Nucleic Acids Res. 2007]Nature. 2006 Mar 30; 440(7084):631-6.
[Nature. 2006]Nature. 2006 Mar 30; 440(7084):637-43.
[Nature. 2006]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D169-72.
[Nucleic Acids Res. 2006]Bioinformatics. 2005 Mar; 21(6):832-4.
[Bioinformatics. 2005]BMC Bioinformatics. 2006 Jun 15; 7():302.
[BMC Bioinformatics. 2006]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D322-6.
[Nucleic Acids Res. 2006]