![]() | ![]() |
Formats:
|
||||||||||||||||||||
Copyright © 2009 Chen et al; licensee BioMed Central Ltd. HAPPI: an online database of comprehensive human annotated and predicted protein interactions 1School of Informatics, Indiana University – Purdue University, Indianapolis, IN, USA 2Department of Computer & Information Science, Purdue University, Indianapolis, IN, USA 3Indiana Center for Systems Biology and Personalized Medicine, Indianapolis, IN, USA 4School of Life Sciences, Shandong University, PR China Corresponding author.Jake Yue Chen: jakechen/at/iupui.edu; SudhaRani Mamidipalli: nsudhara/at/iupui.edu; Tianxiao Huan: huant/at/iupui.edu SupplementThe 2008 International Conference on Bioinformatics & Computational Biology (BIOCOMP'08) Youping Deng, Mary Qu Yang, Hamid R Arabnia, and Jack Y Yang Publication of this supplement was made possible with support from the International Society of Intelligent Biological Medicine (ISIBM). http://www.biomedcentral.com/content/pdf/1471-2164-10-S1-info.pdfConferenceThe 2008 International Conference on Bioinformatics & Computational Biology (BIOCOMP'08) 14–17 July 2008 Las Vegas, NV, USA This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background Human protein-protein interaction (PPIs) data are the foundation for understanding molecular signalling networks and the functional roles of biomolecules. Several human PPI databases have become available; however, comparisons of these datasets have suggested limited data coverage and poor data quality. Ongoing collection and integration of human PPIs from different sources, both experimentally and computationally, can enable disease-specific network biology modelling in translational bioinformatics studies. Results We developed a new web-based resource, the Human Annotated and Predicted Protein Interaction (HAPPI) database, located at http://bio.informatics.iupui.edu/HAPPI/. The HAPPI database was created by extracting and integrating publicly available protein interaction databases, including HPRD, BIND, MINT, STRING, and OPHID, using database integration techniques. We designed a unified entity-relationship data model to resolve semantic level differences of diverse concepts involved in PPI data integration. We applied a unified scoring model to give each PPI a measure of its reliability that can place each PPI at one of the five star rank levels from 1 to 5. We assessed the quality of PPIs contained in the new HAPPI database, using evolutionary conserved co-expression pairs called "MetaGene" pairs to measure the extent of MetaGene pair and PPI pair overlaps. While the overall quality of the HAPPI database across all star ranks is comparable to the overall qualities of HPRD or IntNetDB, the subset of the HAPPI database with star ranks between 3 and 5 has a much higher average quality than all other human PPI databases. As of summer 2008, the database contains 142,956 non-redundant, medium to high-confidence level human protein interaction pairs among 10,592 human proteins. The HAPPI database web application also provides …” should be “The HAPPI database web application also provides hyperlinked information of genes, pathways, protein domains, protein structure displays, and sequence feature maps for interactive exploration of PPI data in the database. Conclusion HAPPI is by far the most comprehensive public compilation of human protein interaction information. It enables its users to fully explore PPI data with quality measures and annotated information necessary for emerging network biology studies. Background Protein-protein interactions (PPIs) is an important foundation for understanding how biological processes take place in cells, how cellular signals are modulated, and how molecules orchestrate in response to external environmental stimuli [1]. High-throughput projects that map protein-protein interactions in model organisms were first initiated less than a decade ago, including those for Saccharomyces cerevisiae, (resulted in the detection of 957 putative interactions involving 1,004 proteins) [2], Drosophila melanogaster (20,405 interactions from 7048 proteins), Caenorhabditis elegans (~5,500 interactions), and Mus musculus [3-5]. In 2003, Chen et al. first reported the generation of 13,656 high-throughput human protein interactions in homogenized human brain using a random yeast two-hybrid platform [6]; in 2005, Stelzl et al. identified 3,186 mostly novel interactions among 1,705 human proteins [7]; then, Rual et al. reported the mapping of ~2,800 proteins in a human protein-protein interaction network [8]; in 2007, Ewing et al. reported a large-scale study of protein-protein interactions in human cells using a mass spectrometry-based approach, producing a data set of 6,463 interactions among 2,235 distinct human proteins [9]. These high-throughput experimental determinations of PPIs have led to an influx of PPI experimental data. By early 2008, BioGrid reported a comprehensive collection of 198,000 protein and genetic interactions from major organisms, including S. cerevisiae, S. pombe, D. melanogastor, C. elegans, M. musculus, and H. sapiens [10]. However, the coverage of data directly captured from experimental platforms in human is still quite poor. In the most recent release 7 of the Human Protein Reference Database (HPRD) [11], there are only 38,167 protein interactions reported – an average of only 1.5 interactions reported for each of the 25,661 human proteins included in HPRD. While it remains an open question how many measurable human protein interactions there are, the use of PPI data in building disease-relevant molecular interaction network models has already emerged as a major theme for "translational bioinformatics", studies that aim to facilitate the transformation of bioinformatics discoveries from "Omics" experiments into biomedical applications via bi-directional information exchange [12,13]. Recent research studies have shown that, by building comprehensive disease-relevant PPI sub-networks, researchers can generate and validate biological hypothesis that could lead to novel biomarkers or therapeutic developments for many complex diseases such as Huntington's disease, Alzheimer's disease, Breast Cancer, Fanconi Anemia, and Ovarian Cancer [14-18]. These studies, however, were primarily based on available human PPIs in existing PPI database repositories with limited coverage and/or uncertain qualities. It is expected that new comprehensive database collections of human PPIs, with expanded data coverage and quantifiable reliability measures, could significantly enhance the impact of future network modeling research. Several human PPI databases have begun to expand experimental human PPI data coverage that is bottlenecked by experimental data throughput and cost. There are four common approaches for PPI data expansions: 1) manual curation from the biomedical literature by experts; 2) automated PPI data extraction from biomedical literature with text mining methods; 3) computational inference based on interacting protein domains or co-regulation relationships, often derived from data in model organisms; and 4) data integration from various experimental or computational sources. Partly due to the difficulty of evaluating qualities for PPI data, a majority of widely-used PPI databases, including DIP, BIND, MINT, HPRD, and IntAct [11,19-22], take a "conservative approach" to PPI data expansion by adding only manually curated interactions. Therefore, the coverage of the protein interactome developed using this approach is poor. In the second literature mining approach, computer software replaces database curators to extract protein interaction (or, association) data from large volumes of biomedical literature [23]. Due to the complexity of natural language processing techniques involved, however, this approach often generates large amount of false positive protein "associations" that are not truly biologically significant "interactions". The advantages of computational inferences are attributable to various biological models that can be used to expand data coverage. For example, the HPID database was developed from existing structural and experimental data by homology searching [24]; OPHID was also constructed by mapping interacting proteins from model organisms to their human protein orthologs [25]. In an integrative approach, PPI data from different sources are evaluated and combined, thus providing maximal likelihood for quality and coverage. For example, the STRING database (version 7) [26] has now integrated known and predicted interactions from a variety of sources, and covers all domains of life (prokaryotes to higher eukaryotes). Xia et al. applied a probabilistic model and integrated 27 heterogeneous genomic, proteomic and functional annotation datasets to predict human PPI networks [27]. UniHI and IntNetDB are both based on several major interaction maps derived by computational and experimental methods [27,28]. The challenge for the integrative approach is how to balance quality with coverage. In particular, different databases may contain many redundant PPI information derived from the same sources, while the overlaps between independently derived PPI data sets are quite low [29,30]. In this work, we describe a new PPI web database resource, Human Annotated Protein-Protein Interactions (HAPPI), located at http://bio.informatics.iupui.edu/HAPPI/. As of early 2008, HAPPI (version 1.1) contains 142,956 non-redundant, medium to high-confidence human protein interaction pairs among 10,592 human proteins identified by UniProt protein names. The HAPPI database aims to become the most comprehensive public compilation of human protein interaction information. The protein interactions are integrated from multiple data sources including both experimental and computationally-derived PPI. Each protein interaction in HAPPI is assigned a PPI confidence grade of 1, 2, 3, 4, or 5 to help users evaluate the reliability and confidence of reported interactions. Each interaction is computationally annotated with information including biological pathways, gene functions, protein families, protein structures, sequence features, and literature sources. These database capabilities will enable both biomedical researchers and network biology users to evaluate the biological significance of specific protein interactions, from which they can build network models for future translational bioinformatics research. Methods Human protein interaction data were collected, extracted, and integrated from the HPRD [11], BIND [20], MINT [21], STRING [26], and OPHID [25] databases, using data warehousing techniques. The primary reason for the choice of these databases was that these sources are relatively complementary to each other and representative of PPIs derived from a variety of methods, including high-throughput experimental PPIs (from HPRD and BIND), literature-curated PPIs (from BIND), text-mined PPIs (from STRING), and computational predicted PPIs (from STRING and OPHID). An overview of the data integration process that involves several of these existing public-domain PPIs databases is shown in Figure Figure1.1
Data model We represented the semantic relationships among different concepts involved in protein interactions as an Entity-Relationship (ER) data model shown in Figure Figure2,2
Interaction ranking model We developed a unified scoring scheme to assess the reliability of integrated human protein-protein interactions from the public domain. First, an interaction scoring system for each individual data source is either preserved (e.g., adoption of the "combined_score" from STRING) or created (e.g., for OPHID). In the later case, we assigned a heuristic confidence score Si (between 0 and 1) to each interaction pair, based on the type of its experimental/computational derivation method and the database source. Si provided an estimate of the degree of reliability of user confidence in the interaction data. Therefore, the more trustworthy the experimental or computational protocols were, the higher the confidence score (Si) was. Second, to combine the individual confidence scores from different sources into a final hscore for the interaction, we used the following formula:
where N represented the count of different data sources and conditions, for each of which an independent assessment of protein interaction reliability score, Si, exists. The hscore ranges in value between 0 and 1. Third, to convert hscore to ranks, we use a ranking method that works in principle by clustering the interactions with closely-related hscore values for all interactions managed in the HAPPI database (see supplemental material for details). Then, a five-star ranking model was developed to set the cut-off threshold at the hscore distribution cluster boundary. The results are summarized in Table 1. Because the hscore values for both high-throughput experimental data (default is 0.75) and curated experimental data from BIND, HPRD, and MINT (default is 0.80) are above 0.75, we therefore selected a combined score of hscore >= 0.75, or a final star rank of 4 or 5, as the minimal criteria for reporting interactions and their statistics for HAPPI. A complete initial scoring scheme to assess the reliability of human protein-protein interactions is shown in Additional file 1.
Data annotation All interacting proteins in the HAPPI database were annotated with gene function, pathway, protein domain, protein structure, and sequence feature map data. The data were separately imported into the Oracle 10g data warehouse from UniProt [32], GenBank [35], HUGO Nomenclature [36], Ensembl [33], PubMed [37], PDB [38], Pfam [34], and KEGG [39] databases. Altogether, we organized inside the data warehouse 70,829 curated human proteins and their descriptions, of which 13,601 proteins contain protein interaction information in the HAPPI database. We kept 361,975 literature abstract IDs where human gene/protein co-occurrence was detected by the STRING database, 52,186 protein domains/families from Pfam, 715 pathways from KEGG, 2,282 protein 3-D structures from PDB, and 76,797 annotated human gene features from GeneBank. All the information was linked to the original source databases on the HAPPI web site, so that HAPPI users can navigate to database sources to determine the reliability of queried PPIs. Quality assessment In this study, we chose to apply evolutionarily conserved co-expression pairs to the assessment and comparisons of PPI data qualities for different sources, including the HAPPI database. High-quality conserved gene co-expression profiles were used to assess protein interaction quality. Many protein interaction data sets were cross-validated with human gene co-expression profiles such as [40]. While interacting proteins may share highly similar gene expression profiles, it was often suggested that such expected correlation between protein interactions and gene expression is quite weak in human and in transient protein interactions. Furthermore, comprehensive expression profiles are difficult to compile for all cellular conditions. To improve the development of a co-expression based confidence measure for interacting proteins, Tirosh and Barkai showed that a method using co-expression of orthologs of interacting partners performed quite well [41]. Their method was based on the assumption that conserved co-expression relationship preserved true protein interactions that required the presence of both interacting proteins through evolution. Therefore, it is more sensitive overall than using information purely from the organism, e.g., simple co-expression, cellular co-localization, and similarity in gene's gene ontology functional annotations. In a similar study, Bhardwaj and Lu also verified that reliable predictions of interactions from heterogeneous data sources could be strengthened by evolutionary conserved gene co-expression measurements [42]. Our computational method was based on the degree of overlap between protein interactions and the use of an evolutionarily conserved co-expressed gene data set called MetaGene. MetaGene consists of 22,163 evolutionary conserved co-expression relationships from humans, flies, worms, and yeast, based on the analysis of over 3182 published DNA microarray experiments by Stuart et al [43]. It is a comprehensive compilation of evolutionary conserved gene co-expression pairs from a diverse set of DNA microarray experiments that were obtained from four different organisms: 1,202 DNA microarrays from H. sapiens, 979 from C. elegans,155 from D. melanogastor, and 643 from S. cerevisiae. The relative quality of each PPI database, including HAPPI, OPHID [25], IntNetDB [27], ProNet [44], UniHI [28], and HPRD [11], was estimated as the count of overlaps between protein interactions in the PPI database of interest and MetaGene conserved co-expressed gene pairs. The human subset of MetaGene data involves 6,591 human genes and 22,154 MetaGene co-expression gene pairs. 6,297 of the 22,154 human MetaGene co-expression gene pairs can be found in the union (U0 set) of all the known human PPI databases, including HAPPI, OPHID, IntNetDB, ProNet, UniHI, and HPRD; furthermore, 6,145 of the 6,297 MetaGene pairs form a large connected MetaGene co-expression association network that showed the scale-free property commonly observed of most molecular interaction networks. Therefore, we regarded 6,145 Metagene pairs (M0 Set) to be most relevant high-quality subset of U0 and could be used as a gold standard for evaluating unknown PPIs from large databases. To facilitate comparisons of overlaps for different databases with MetaGene, we also developed an artificially synthesized protein-protein "random interaction" set (R0 Set) of 37,000 PPIs (comparable to the size of all PPIs in HPRD), by randomly reconnecting proteins observed in U0. Therefore, the lower-bound of any protein interaction data set derived from U0 could be given by counting the overlap between R0 and M0. To adapt to the different sizes of PPI databases, we took a random sample of 1000 PPIs each time from each database in comparison (including R0), and repeated this random sampling process 1000 times to obtain a distribution of normalized overlap counts with M0. Results HAPPI was developed as a web-based PPIs database application and is freely accessible to the public at http://bio.informatics.iupui.edu/HAPPI/. In the current release, HAPPI contains 13,601 proteins and 1,209,463 PPIs integrated from five databases collected with both experimental and computationally methods as described in the previous section. Users of the HAPPI web application software can search for PPIs using common protein identifiers. Typical web query results display all HAPPI PPIs at a default quality grade (star rank 3 and above). Users can drill down to explore annotations of the protein interaction or proteins involved. Assessing data quality While there are several methods for validating PPI data, including those based on interacting domains, gene co-expression profiles, or gene ontology (GO) annotation semantic distances [42,45-49], we assessed the quality of the new HAPPI database by comparing the extent of overlap between PPIs and MetaGene pairs, using a new computational approach described earlier in the Method section. In Figure Figure3A3A
Figure Figure3A3A In Figure Figure3B,3B We also analyzed PPI overlaps between HAPPI database subsets of different quality grades and two reference PPI databases. In Figure Figure4A,4A
Querying the database HAPPI enables users to retrieve human PPI data through multiple types of protein identifiers, such as UniProt IDs, Swiss-Prot accession numbers, RefSeq IDs, or IPI accession numbers, at its query home page. Query results that contain protein interaction data and quality rank are shown in a single web page as a data table. The query result is available for download either in a Molecular Interaction (MI) format recommended by the Proteomics Standard Initiatives (PSI) or in a Graph Markup Language (GML) format recommended by the International Molecular Exchange Consortium. Additional annotation details of the protein or protein interaction can be queried and retrieved online by selecting the hyperlinks in the protein interaction result page. Viewing and exploration of results HAPPI users can retrieve a list of protein interactions showing the following fields in a table: the query protein, a relationship symbol (currently implemented as bi-directional binding, represented as "<=>"), the data source of the interaction, and a confidence rating of 1 to 5 stars. Figure Figure55
We created two interactive components in the protein interaction details page: one to explore interacting protein 3D structures and the other to explore interaction protein feature alignments. In Figure Figure6A6A
Conclusion HAPPI is by far the most comprehensive public compilation of human protein interaction data that come with a unified framework of interaction data reliability scores. In its current release, the HAPPI database contains 13,601 proteins and 1,209,463 PPIs integrated from several databases derived either experimentally or computationally. By comparing the degree of overlap between PPIs of varying quality grades and evolutionarily conserved co-expressed gene pairs, we assessed the quality of HAPPI. While the overall quality of HAPPI is comparable to that of the HPRD database, HAPPI PPIs with 3-5 star rank levels have a higher average quality than all other human PPI databases considered in this study, which include ProNet, UniHI, IntNetDB, OPHID, HPRD, and BioGrid. For future HAPPI database releases, we have three plans. First, we wish to continue integrating and linking valuable annotation data into the HAPPI database. Protein interaction data from high-precision text mining projects could be used to improve the validation of high-quality protein interactions as "re-discovered" compared to the findings reported in past literature. Gene co-expression and Gene Ontology data are also candidates for data import next, since they both can help define common functional context in which protein interactions may take place. Second, we plan on applying database customization techniques to improve the user querying experience with HAPPI. For example, we will add control buttons for users to customize interaction data quality filter thresholds, and to select a subset of retrieved protein interactions for downloading into spreadsheet programs. Third, we wish to improve existing PPI data investigation features. For example, we hope to run molecular docking programs and show computationally predicted protein binding constants and binding sites between two proteins. We also plan to improve the interplay between JMOL and Safmap Java Applets so that a highlight of sequence segments in one program may also be highlighted in the other program. With these improvements, we expect the database to play essential roles for biomedical researchers to retrieve trustworthy information on plausible human protein interaction data and for bioinformatics scientists to conduct network biology modeling studies. Competing interests The authors declare that they have no competing interests. Authors' contributions JYC conceived the initial idea, designed the method for the database construction, and drafted the manuscript. SM implemented the design, developed the database from integrated data sets, and implemented the web-based database interface. TH performed database comparisons and evaluations of the database. All authors are involved in the revisions of the manuscript. Additional file 1 A unified scoring model to assess the reliability of human protein-protein interactions integrated from public protein interaction databases. Click here for file(13K, docx) Acknowledgements The HAPPI database was developed in part with research funding from the Research and Sponsored Programs of Indiana University – Purdue University Indianapolis awarded to Dr. Jake Chen. We thank Stephanie Burks of the University Information Technology and Services at Indiana University for providing generous support in Oracle 10g database administration, Jason Sisk from Indiana University School of Informatics for configuring the Web server for the project, Dr. Sudipto Saha from Indiana University School of Informatics for helping improve the web application user interface and the initial draft of the manuscript, and Basil George for assisting in the development of viewing PDB structures in the web interface. We are particularly grateful for the generous and timely help from Michael Grobe of Indiana University in proofreading the manuscript before it goes to press. This article has been published as part of BMC Genomics Volume 10 Supplement 1, 2009: The 2008 International Conference on Bioinformatics & Computational Biology (BIOCOMP'08). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/10?issue=S1. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||
Nature. 2000 Feb 10; 403(6770):601-3.
[Nature. 2000]Nature. 2000 Feb 10; 403(6770):623-7.
[Nature. 2000]Genome Res. 2001 Oct; 11(10):1758-65.
[Genome Res. 2001]Cell. 2005 Sep 23; 122(6):957-68.
[Cell. 2005]Nature. 2005 Oct 20; 437(7062):1173-8.
[Nature. 2005]Nucleic Acids Res. 2008 Jan; 36(Database issue):D637-40.
[Nucleic Acids Res. 2008]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D411-4.
[Nucleic Acids Res. 2006]JAMA. 2005 Sep 21; 294(11):1352-8.
[JAMA. 2005]Science. 2002 Mar 1; 295(5560):1662-4.
[Science. 2002]Genome Biol. 2005; 6(3):210.
[Genome Biol. 2005]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D411-4.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D449-51.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2007 Jan; 35(Database issue):D561-5.
[Nucleic Acids Res. 2007]Nat Rev Genet. 2006 Feb; 7(2):119-29.
[Nat Rev Genet. 2006]Bioinformatics. 2004 Oct 12; 20(15):2466-70.
[Bioinformatics. 2004]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D411-4.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D418-24.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2007 Jan; 35(Database issue):D572-4.
[Nucleic Acids Res. 2007]Nucleic Acids Res. 2007 Jan; 35(Database issue):D358-62.
[Nucleic Acids Res. 2007]Bioinformatics. 2005 May 1; 21(9):2076-82.
[Bioinformatics. 2005]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D187-91.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D556-61.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D247-51.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D187-91.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D34-8.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D319-21.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D556-61.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2008 Jan; 36(Database issue):D13-21.
[Nucleic Acids Res. 2008]BMC Bioinformatics. 2005 May 6; 6():112.
[BMC Bioinformatics. 2005]BMC Bioinformatics. 2005 Mar 2; 6():40.
[BMC Bioinformatics. 2005]Bioinformatics. 2005 Jun 1; 21(11):2730-8.
[Bioinformatics. 2005]Science. 2003 Oct 10; 302(5643):249-55.
[Science. 2003]Bioinformatics. 2005 May 1; 21(9):2076-82.
[Bioinformatics. 2005]BMC Bioinformatics. 2006 Nov 18; 7():508.
[BMC Bioinformatics. 2006]Genome Res. 2004 Jun; 14(6):1170-5.
[Genome Res. 2004]Nucleic Acids Res. 2007 Jan; 35(Database issue):D590-4.
[Nucleic Acids Res. 2007]Bioinformatics. 2005 Jun 1; 21(11):2730-8.
[Bioinformatics. 2005]Nature. 1999 Nov 4; 402(6757):83-6.
[Nature. 1999]J Chem Inf Comput Sci. 2003 Mar-Apr; 43(2):493-500.
[J Chem Inf Comput Sci. 2003]