• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of narLink to Publisher's site
Nucleic Acids Res. Jan 2012; 40(D1): D862–D865.
Published online Nov 8, 2011. doi:  10.1093/nar/gkr967
PMCID: PMC3244997

PINA v2.0: mining interactome modules

Abstract

The Protein Interaction Network Analysis (PINA) platform is a comprehensive web resource, which includes a database of unified protein–protein interaction data integrated from six manually curated public databases, and a set of built-in tools for network construction, filtering, analysis and visualization. The second version of PINA enhances its utility for studies of protein interactions at a network level, by including multiple collections of interaction modules identified by different clustering approaches from the whole network of protein interactions (‘interactome’) for six model organisms. All identified modules are fully annotated by enriched Gene Ontology terms, KEGG pathways, Pfam domains and the chemical and genetic perturbations collection from MSigDB. Moreover, a new tool is provided for module enrichment analysis in addition to simple query function. The interactome data are also available on the web site for further bioinformatics analysis. PINA is freely accessible at http://cbg.garvan.unsw.edu.au/pina/.

INTRODUCTION

Protein–protein interactions (PPIs) mediate biological function and play a pivotal role in many cellular processes. Different small- and large-scale experimental approaches generate ever-increasing amounts of publicly accessible data. Given the availability of vast amounts of PPI data, analysis of PPI networks has become a major challenge and considerable efforts have been undertaken.

A common type of analysis focuses on the whole network of protein interactions for a given species (‘interactome’) (1). A number of studies have shown that interactomes follow a power-law degree distribution, exhibit small world behavior and tend to be modular (2,3). Identification of sub-networks with special characteristics using graphical approaches can also lead to biologically relevant insights. It is well established that densely interconnected regions of a global PPI network often correspond to functionally related groups of genes/proteins that can be identified as modules (4). Understanding how these modules are organized can lead to a better understanding of how cellular processes are coordinated in normal cells and perturbed under pathological conditions. Several efforts have been undertaken to identify modules, which might represent protein complexes or signaling pathways, from interactome networks (5–10). However, there is no unified resource for biologists to interrogate these interactome modules extracted from a regularly updated PPI database with extensive functional annotations and advanced network analysis tools.

In PINA v2.0, we generated the interactome data for six model organisms based on the existing PINA PPI integration database and applied different clustering algorithms to identify collections of modules. To improve biological interpretation, the identified modules have been comprehensively annotated by different knowledge databases. Both modules and annotations were saved in PINA v2.0 database and an advanced tool was developed for module enrichment analysis in addition to a simple query form. These new data and tools have been seamlessly integrated and can co-operate with the existing resources in PINA, which together provides a unique portal for biologists to better understand their genes of interest in the context of a PPI network.

INTERACTOME DATASET

The first version of PINA (11) has established a non-redundant PPI database, updated quarterly based on the integration of protein interaction data from six publicly available, manually curated databases: IntAct (12), MINT (13), BioGRID (14), DIP (15), HPRD (16) and MIPS MPact (17). We exported interactome data from the PINA PPI integration database in PSI-MI (Proteomics Standards Initiative Molecular Interactions) tab-delimited data exchange format (18) for six model organisms (Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Caenorhabditis elegans and Saccharomyces cerevisiae). Each exported interactome file includes self and binary interactions with one interaction per row. The UniProt accession number was used as the protein identifier. These files will be updated concurrently with each new release of the PINA integration database and can be freely downloaded from the PINA website for further bioinformatics analysis.

INTERACTOME MODULES

Module identification from interactomes

Several algorithms have been developed to identify highly interconnected groups of nodes within a network (5–10). These algorithms are mostly either agglomerative (‘bottom-up’) or divisive (‘top-down’). We selected molecular complex detection (MCODE) method (6) and Markov clustering (MCL) method (5) as representatives from each category (19), and applied them to the interactomes from each of the six species. We selected a range of parameter settings (Supplementary Table S1) to control the properties of the resulting modules, from small and densely interconnected (protein-complex-like), to large and loosely interconnected (pathway-like). From a total of 30 analysis runs, we detected approximately 2400 modules containing at least five proteins. Modules identified from each run were saved as a module collection in the PINA module database. End users can select which collection to be used in the query or the enrichment analysis depending on whether they are looking for protein-complex-like modules, or pathway-like modules, with advice given on the PINA website. As the body of PPI data accrues over time, the interactomes will become more complete, and thus some modules identified from an interactome may change. To facilitate historical comparisons, we will timestamp each module collection and retain the last five releases.

Module annotation and visualization

Following module identification, we annotated each module by looking for enriched terms from multiple functional databases including Gene Ontology (20), KEGG pathways (21), Pfam domains (22) and the chemical and genetic perturbations collection from MSigDB (23). Since modules often show strong functional coherence (24), the diverse set of annotations provide a complementary overview of module function. The back-end module annotation tool uses a hypergeometric test to identify the overrepresented terms, with a correction for multiple testing using false discovery rate (25). For each module, we stored at least the 10 most significant terms, and any other significant terms (adjusted P-value < 0.05) in the PINA annotation database. Based on approximately 25.7 million comparisons, there are approximately 270,000 significant terms saved in the PINA annotation database.

A thumbnail image is available for each module (Figure 1b), which offers a quick impression of the module's topology. Users can also launch our previously developed visualization tool to interactively visualize and manipulate the selected module. Since each module can be treated as a network, other existing PINA tools can be applied to filter and analyze the selected module, through web pages or the visualization tool.

Figure 1.
An example of the module enrichment result, showing the top two modules. (a) The top link is to the page showing the complete list of functional annotations, while the bottom link is to the page showing the list of protein interactions in the module. ...

Module query and enrichment analysis tool

There are two ways to make use of the interactome modules in PINA. Users can either perform a simple search to find modules which have at least one protein from their query proteins or use the newly developed module enrichment tool to identify statistically enriched modules. The module enrichment tool compares a list of proteins from a user query with all the modules in a specified collection, by using a hypergeometric test to identify modules that are overrepresented in query proteins relative to the background frequency in either the interactome or the whole proteome. Fig. 1 shows the module enrichment result of a set of proteins, which contain non-synonymous coding single nucleotide variations (SNV) in two primary pancreatic adenocarcinoma tumors (APGI-1959 and APGI-1992) and one pancreatic cancer cell line (CRL-2557 Panc-05.04). These mutated genes (Supplementary Table S2) were detected by next-generation sequencing and downloaded from the International Cancer Genome Consortium (ICGC) (26) data portal (http://dcc.icgc.org; Pancreatic cancer AU project). The annotation summary indicates that the top module may play an important role in cancer through its influence on the cell's transcriptional machinery.

IMPROVED USER ACCESS

In the first version of PINA, protein annotations were fetched on the fly through the UniProt web service, which was slow for construction of a large PPI network consisting of hundreds of proteins. In PINA v2.0, we have saved the UniProt annotations into the PINA annotation database, which has significantly improved the query speed for large networks. The PINA web services were also updated for easier use and quicker response, by adopting a lightweight RESTful web service, as opposed to the previously used SOAP service. In addition, PINA has been wrapped as a component of the Anduril framework (27), which is a component-based workflow framework for large-scale biological data analysis. The PINA component can be executed as either standalone or as one step of a complex workflow analyzing high-throughput screening data, such as SNP, gene expression or exon microarray, which can start from preprocessing and normalization of raw data to functional annotation of the identified genes/proteins.

IMPLEMENTATION

AllegroMCODE 1.0 (http://www.allegroviva.com/allegromcode), which is a GPU-enabled Cytoscape 2.8.1 (28) plug-in for running MCODE (6), and MCL v10-201 (5) were used for identifying interactome modules. The specified parameters are listed in Supplementary Table S1. The output files of each tool were parsed using custom R scripts and the functional enrichment analyses were performed on an SGE cluster using GOstats (29) and a custom extension to the Category package from the Bioconductor project v2.8, using R v2.13.1. The mappings from genes to KEGG, GO and PFAM were from the AnnotationDbi package, and the c2.cgp.v3.0.symbols.gmt geneset collection were from MSigDB (23). Module thumbnails were generated using igraph (30). The RESTful web services were implemented using a Java library jersey, and example code for a Java client is available on the PINA web site.

FUTURE DIRECTION

In PINA v2.0, we seamlessly integrate interactome modules and the associated functional annotations with the existing PINA resource including the PPI integration database and a set of network-based tools, providing significant new functionalities for researchers looking to analyze PPI data at a network level. We intend to continue this effort and plan to integrate built-in network alignment tools, which will allow the comparison of two networks either generated by user queries, or selected from the interactome modules. In addition, another important model organism Arabidopsis thaliana will be added to PINA in the near future.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online: Supplementary Tables 1 and 2.

FUNDING

Cancer Council New South Wales, Australia (grant SRP11-01, ICGC 09-01); National Health and Medical Research Council, Australia (grant 631701); Cancer Institute New South Wales, Australia (grant 10/CRF/1-01 to A.V.B); Academy of Finland (grant 125826); Avner Nahmani Pancreatic Cancer Foundation; R. T. Hall Trust. Funding for open access charge: Cancer Council New South Wales, Australia (grant ICGC 09-01).

Conflict of interest statement. None declared.

ACKNOWLEDGEMENTS

We thank Dr Warren Kaplan and Derrick Lin for their support with high performance computing infrastructure.

REFERENCES

1. Cusick ME, Klitgord N, Vidal M, Hill DE. Interactome: gateway into systems biology. Hum. Mol. Genet. 2005;14 Spec No. 2:R171–R181. [PubMed]
2. Maslov S, Sneppen K. Specificity and stability in topology of protein networks. Science. 2002;296:910–913. [PubMed]
3. Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U. Network motifs: simple building blocks of complex networks. Science. 2002;298:824–827. [PubMed]
4. Rives AW, Galitski T. Modular organization of cellular networks. Proc. Natl Acad. Sci. USA. 2003;100:1128–1133. [PMC free article] [PubMed]
5. Enright AJ, Van Dongen S, Ouzounis CA. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002;30:1575–1584. [PMC free article] [PubMed]
6. Bader GD, Hogue CW. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics. 2003;4:2. [PMC free article] [PubMed]
7. Adamcsek B, Palla G, Farkas IJ, Derenyi I, Vicsek T. CFinder: locating cliques and overlapping modules in biological networks. Bioinformatics. 2006;22:1021–1023. [PubMed]
8. Yan X, Mehan MR, Huang Y, Waterman MS, Yu PS, Zhou XJ. A graph-based approach to systematically reconstruct human transcriptional regulatory modules. Bioinformatics. 2007;23:i577–i586. [PubMed]
9. Jiang P, Singh M. SPICi: a fast clustering algorithm for large biological networks. Bioinformatics. 2010;26:1105–1111. [PMC free article] [PubMed]
10. Rhrissorrakrai K, Gunsalus KC. MINE: module identification in networks. BMC Bioinformatics. 2011;12:192. [PMC free article] [PubMed]
11. Wu J, Vallenius T, Ovaska K, Westermarck J, Makela TP, Hautaniemi S. Integrated network analysis platform for protein-protein interactions. Nat. Methods. 2009;6:75–77. [PubMed]
12. Aranda B, Achuthan P, Alam-Faruque Y, Armean I, Bridge A, Derow C, Feuermann M, Ghanbarian AT, Kerrien S, Khadake J, et al. The IntAct molecular interaction database in 2010. Nucleic Acids Res. 2010;38:D525–D531. [PMC free article] [PubMed]
13. Ceol A, Chatr Aryamontri A, Licata L, Peluso D, Briganti L, Perfetto L, Castagnoli L, Cesareni G. MINT, the molecular interaction database: 2009 update. Nucleic Acids Res. 2010;38:D532–D539. [PMC free article] [PubMed]
14. Stark C, Breitkreutz BJ, Chatr-Aryamontri A, Boucher L, Oughtred R, Livstone MS, Nixon J, Van Auken K, Wang X, Shi X, et al. The BioGRID interaction database: 2011 update. Nucleic Acids Res. 2011;39:D698–D704. [PMC free article] [PubMed]
15. Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D. The database of interacting proteins: 2004 update. Nucleic Acids Res. 2004;32:D449–D451. [PMC free article] [PubMed]
16. Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, et al. Human protein reference database–2009 update. Nucleic Acids Res. 2009;37:D767–D772. [PMC free article] [PubMed]
17. Guldener U, Munsterkotter M, Oesterheld M, Pagel P, Ruepp A, Mewes HW, Stumpflen V. MPact: the MIPS protein interaction resource on yeast. Nucleic Acids Res. 2006;34:D436–D441. [PMC free article] [PubMed]
18. Kerrien S, Orchard S, Montecchi-Palazzi L, Aranda B, Quinn AF, Vinod N, Bader GD, Xenarios I, Wojcik J, Sherman D, et al. Broadening the horizon–level 2.5 of the HUPO-PSI format for molecular interactions. BMC Biol. 2007;5:44. [PMC free article] [PubMed]
19. Brohee S, van Helden J. Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics. 2006;7:488. [PMC free article] [PubMed]
20. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 2000;25:25–29. [PMC free article] [PubMed]
21. Kanehisa M, Goto S, Furumichi M, Tanabe M, Hirakawa M. KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 2010;38:D355–D360. [PMC free article] [PubMed]
22. Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, et al. The Pfam protein families database. Nucleic Acids Res. 2010;38:D211–D222. [PMC free article] [PubMed]
23. Liberzon A, Subramanian A, Pinchback R, Thorvaldsdottir H, Tamayo P, Mesirov JP. Molecular signatures database (MSigDB) 3.0. Bioinformatics. 2011;27:1739–1740. [PMC free article] [PubMed]
24. Lysenko A, Defoin-Platel M, Hassani-Pak K, Taubert J, Hodgman C, Rawlings CJ, Saqi M. Assessing the functional coherence of modules found in multiple-evidence networks from Arabidopsis. BMC Bioinformatics. 2011;12:203. [PMC free article] [PubMed]
25. Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 2001:1165–1188.
26. Hudson TJ, Anderson W, Artez A, Barker AD, Bell C, Bernabe RR, Bhan MK, Calvo F, Eerola I, Gerhard DS, et al. International network of cancer genome projects. Nature. 2010;464:993–998. [PMC free article] [PubMed]
27. Ovaska K, Laakso M, Haapa-Paananen S, Louhimo R, Chen P, Aittomaki V, Valo E, Nunez-Fontarnau J, Rantanen V, Karinen S, et al. Large-scale data integration framework provides a comprehensive view on glioblastoma multiforme. Genome Med. 2010;2:65. [PMC free article] [PubMed]
28. Smoot ME, Ono K, Ruscheinski J, Wang PL, Ideker T. Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics. 2011;27:431–432. [PMC free article] [PubMed]
29. Falcon S, Gentleman R. Using GOstats to test gene lists for GO term association. Bioinformatics. 2007;23:257–258. [PubMed]
30. Csardi G, Nepusz T. The igraph software package for complex network research. InterJ. Complex Syst. 2006;1695:1–9.

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...