• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of databaseLink to Publisher's site
Database (Oxford). 2010; 2010: baq023.
Published online Oct 11, 2010. doi:  10.1093/database/baq023
PMCID: PMC2963317

iRefWeb: interactive analysis of consolidated protein interaction data and their supporting evidence

Abstract

We present iRefWeb, a web interface to protein interaction data consolidated from 10 public databases: BIND, BioGRID, CORUM, DIP, IntAct, HPRD, MINT, MPact, MPPI and OPHID. iRefWeb enables users to examine aggregated interactions for a protein of interest, and presents various statistical summaries of the data across databases, such as the number of organism-specific interactions, proteins and cited publications. Through links to source databases and supporting evidence, researchers may gauge the reliability of an interaction using simple criteria, such as the detection methods, the scale of the study (high- or low-throughput) or the number of cited publications. Furthermore, iRefWeb compares the information extracted from the same publication by different databases, and offers means to follow-up possible inconsistencies. We provide an overview of the consolidated protein–protein interaction landscape and show how it can be automatically cropped to aid the generation of meaningful organism-specific interactomes. iRefWeb can be accessed at: http://wodaklab.org/iRefWeb.

Database URL: http://wodaklab.org/iRefWeb/

Introduction

Most cellular processes are carried out by groups of physically interacting proteins, or complexes (1, 2) and anomalies in protein interactions often lead to disease phenotypes (3). The experimental detection of protein–protein interactions (PPIs) has therefore become a major focus of research in molecular biology with promising applications in medicine (4, 5).

Thanks to recent technological advances, the detection of PPIs can be performed on the genome scale, with individual studies generating vast amounts of data on both interactions and multi-protein complexes. But such high-throughput studies are still limited to a few model organisms including yeast (6–9), fly (10) and worm (11), and more recently bacteria (12) and human (13–15). The same advances in experimental techniques have also fueled a proliferation of hypothesis-driven low-throughput studies, with results reported in a fast expanding body of scientific literature.

Recognizing the importance of keeping systematic records of the proliferating PPI data, various databases have been created for curating and archiving these data and making them available to the scientific community (16–26). These databases represent independent annotation efforts based on a range of research interests, resulting in complementary as well as redundant information.

Thus, anyone wishing to retrieve information on PPIs and complexes for a particular organism of interest has a choice between several databases. But most often, obtaining an up-to-date description of the full compendium of the PPIs in an organism—its interactome—requires the consolidation of PPI records from multiple databases.

A major factor facilitating consolidation has been the adoption of the Proteomics Standards Initiative—Molecular Interaction (PSI-MI) format (27) and the related IMEx initiative (28). The more uniform representation of PPI data, which was afforded by adhering to these standards, laid the foundation for several recent efforts that aggregate information from multiple PPI databases and present a unified data collection to the user (29–33).

Due to the endemic problems of cross-referencing genes and proteins across biological databases (34), as well as to other more specific issues related to PPI literature curation and to the accuracy of the curated data itself (35–37), researchers should have access to key information about the aggregated data. They need to readily verify how each PPI record was consolidated, or which databases contributed to a given record. Furthermore, it is crucial to know the techniques that were used to detect an interaction, because different techniques probe different kinds of interactions at varying levels of accuracy.

Ideally, one would want to have a reliability score associated with each consolidated interaction. However, deriving such scores on an objective basis remains a major challenge (38–40), especially for literature-curated PPIs. Attempts made so far involve ad hoc heuristic scoring schemes. Some take into account various aspects of the supporting evidence, including the detection method and the scale of the study (low- or high throughput) (41). More elaborate scores incorporate quality measures based on extraneous data such as gene expression, co-occurrence in the same cellular pathways, paralogy relationships and domain composition (25, 42). But these measures and the methods for combining them into a single score tend to vary depending on the authors and organism considered, which in turn reflects the inherent problems associated with generalizing this approach.

Meanwhile, obtaining answers to some simple questions should be very helpful in gauging the reliability of a PPI record. For example: How was the interaction detected? Was the interaction cited by more than one original publication? When the same publication has been curated by different databases, are the curations consistent with one another? If not, which of the databases reflects the published report more closely?

To help address these and similar questions we created iRefWeb (http://wodaklab.org/iRefWeb), a web interface to the latest build of the Interaction Reference Index (iRefIndex) repository (32). This latest build consolidates interaction records from 10 different databases: BIND (16), BioGRID (17), CORUM (18), DIP (26), IntAct (19), HPRD (20), MINT (21), MPact (22), MPPI (23) and OPHID (24). For completeness, we consolidated both the standard BIND distribution available as tab-delimited text files, and BIND Translation, a set of interactions from the BIND archives recently recast into PSI-MI 2.5 XML format (see ‘Materials and methods’ section for details).

The underlying iRefIndex data aggregation is a rigorously documented procedure that not only combines equivalent protein identifiers from multiple databases but also maps different protein splice isoforms of the same gene to their canonical representation. This allows it to effectively combine records that use different protein representations to support the same protein interaction or multi-protein complex. Furthermore, iRefIndex enables backtracking of the links used to establish the identity of all interacting proteins to their original source database records.

Thanks to this consolidation process, iRefWeb affords a global view of the consolidated data, and enables the exploration of the known interaction partners for a protein of interest, regardless of the database(s) that contain the original information. In addition, it offers several innovative features. First, easy means are provided for examining the publications cited for each interaction record. Annotated interactions contributed by individual databases can be compared, highlighting any possible discrepancies between them. Secondly, iRefWeb allows the retrieval of consolidated interactions that match various user-defined criteria, such as the number of supporting publications or low- versus high-throughput studies. Options to filter by PSI-MI vocabulary terms such as ‘interaction type’ or ‘interaction detection method’ are also provided. The former is intended to describe the nature of the association between the proteins, for instance if it is a physical association, or a phenotypic association (43–45), whereas the latter informs on the actual experimental methods used to detect an association.

Here, we present an overview of the consolidated PPI landscape available through iRefWeb, and describe how the resource can be used to document this landscape along the lines described above. We also illustrate the automatic retrieval of organism-specific interactomes with a specified level of support and discuss the current limitations of such retrieval.

Results

The consolidated information

We consolidated PPI annotations from the 10 public databases listed above, which curate predominantly physical PPIs. The Interaction Reference Index method (32) (iRefIndex; http://irefindex.uio.no) was used to consolidate the data, while also mapping all proteins to their canonical isoforms whenever possible (see ‘Materials and methods’ section). The consolidated dataset provides a thorough coverage of the existing PPI data, and establishes the basis for building customized interactomes for a wide variety of organisms.

The latest version of this dataset (version 7.0) comprises a total of 404 384 distinct interactions, derived by consolidating 1 119 604 original records from the source databases. The iRefIndex consolidation process involved the identification of original records that contained only proteins (or genes) as interactors; the mapping of such entities and interactions into the same representation system; the elimination of redundant representations; and further consolidation of splice variants through protein isoform canonicalization [see ‘Materials and methods’ section and (32)].

The original interactions were detected by a broad variety of experimental techniques that probe different types of interactions. For example, binary physical interactions are identified by various yeast two-hybrid (Y2H) screens (7, 8), protein-fragment complementation assays (46) or by biochemical and structural analyses. On the other hand, groups of proteins that physically associate with one another—often referred to as protein complexes—may be detected via a spectrum of purification methods, which include immunoprecipitation (47) and affinity purification coupled with mass spectrometry (6, 9). Other methods such as fluorescent tagging identify proteins that co-localize to the same cellular compartment (48). The consolidated dataset also includes the so-called genetic interactions, which are curated by some of the databases. These interactions are not physical in nature, but represent unexpected phenotype alterations produced by the deletion or mutation of one gene in the background of a mutation (or deletion) of another gene (43–45).

The information on the interaction type and detection method is captured by the source databases using the PSI-MI (27) controlled vocabulary, and associated with each interaction record. The iRefIndex/iRefWeb system aggregates this information as part of the consolidation process and makes it available. This often involves resolving ambiguities in the captured descriptions, likely resulting from different interpretations of the PSI-MI specifications (see ‘Materials and methods’ section).

The annotation of protein complexes often differs across databases. Some databases record complexes as groups of proteins, whereas others use the so-called spoke expansion, which represents complexes as sets of binary interactions between a designated ‘bait’ protein and all other proteins in the complex (49). The latter case may be distinguished from experimentally detected binary interactions by examining the PSI-MI interaction type: binary interactions derived from complexes are usually annotated as ‘physical association’ (rather than ‘direct interaction’).

The PPI landscape

Here we focus on the landscape of all types of experimentally detected physical associations between proteins, comprising direct binary interactions and participation in the same complex. The consolidated dataset was therefore filtered to exclude genetic interactions (see ‘Materials and methods’ section), as well as interactions predicted on the basis of computational methods recorded by the OPHID database (24). Interactions of proteins with nucleic acids and small molecules curated by BIND were not consolidated.

Following the above filtering, the aggregated dataset of physical PPIs comprises 263 479 distinct interactions involving 66 701 proteins, mapping to 1448 different organism taxonomy identifiers. iRefWeb offers an extensive set of visual quantitative summaries of this landscape in its Statistics page, as highlighted in Figure 1. For example, it provides overviews of the number of interactions and proteins contributed by each source database, as well as the number of interactions and proteins that are unique to a given database.

Figure 1.
Summaries of the relative contribution of each database to the consolidated set. (a) Contribution of physical PPIs by different databases. The number of interactions that do not appear in any other database (i.e. unique contribution) is represented in ...

Most major databases record interactions in different organisms such as human, mouse, yeast, fly and worm. But organism coverage varies among databases. Some focus entirely on interactions in human (HPRD), in mammalian organisms (CORUM, MPPI) or yeast (MPact). Organism-specific summaries produce overviews of this information as illustrated in Figure 2 and Tables 1 and and2.2. These include the number of interactions and proteins for a given organism in the full consolidated dataset, the total and unique contribution of individual databases to these data, and the data shared between pairs of databases.

Figure 2.
Organism-specific summaries for the consolidated PPI landscape. The number of publications, interactions and proteins in the consolidated dataset annotated in different organisms (colored bars). The data for specific organisms are sorted by the number ...
Table 1.
Interactions contributed by individual databases
Table 2.
Proteins contributed by individual databases

The breakdown by organism shows that the majority (59%) of the consolidated physical interactions are from the yeast Saccharomyces cerevisiae (30%) and human (29%); 13% are from fly (Drosophila melanogaster) and 7% are from various strains of Escherichia coli; whereas those from the worm C aenorhabditis elegans, mouse and rat each represent <4%. Interactions from over 1400 additional organisms (mostly microbes and plants) collectively make up the remaining ~12%.

Closer analysis of the data reveals that although the number of unique interactions contributed by individual databases may span a wide range (Table 1), assembling a complete set of PPI for a given organism requires data consolidation from all the databases. For example, almost half (~45%) of all the consolidated human PPIs represent unique interactions contributed by HPRD, IntAct and BioGRID. However, an additional 12% of the human PPIs are unique interactions contributed by the remaining six databases. The remaining 43% of human PPIs are each contributed by two or more databases. The same situation occurs for the yeast S. cerevisiae, where BioGRID and IntAct contribute the lion’s share of the unique interactions. However, the remaining databases typically list thousands of unique interactions each, representing a valuable complement. IntAct, DIP, MINT and BIND contribute a significant number interactions in various additional organisms grouped under the category ‘other’ in Figure 2.

Not unexpectedly, the different databases share a very large fraction of their organism-specific proteins, especially in human and yeast (Table 2). The sharing of proteins is much more limited, however, in organisms such as E. coli, mouse and rat, where interactions have been less extensively studied. For these organisms in particular, consolidating PPIs from multiple databases affords much better coverage of the proteins whose interactions have been reported.

Interrogation of the supporting evidence

The wealth of information contained in the consolidated iRefIndex data can be explored interactively via the iRefWeb interface. This interface provides multiple and flexible views of the data, including the composition of binary interactions and multi-subunit complexes, the identity of the interacting proteins, their many aliases, the organisms and the experimental methods used to detect the interactions. It also provides a graphical display of the interaction neighborhood for any annotated protein as well as the details on the consolidation of the source database records, with links to the original annotation records (Figure 3). The data may be searched for particular combinations of genes, proteins, PubMed IDs or by any string query, e.g. ‘chromatin cancer’.

Figure 3.
The detailed graphical view of an interaction record in iRefWeb. The interaction summary of the Rev1 protein [REV1 homolog (S. cerevisiae)] in mouse is returned by iRefWeb Search (a), and is expanded to reveal a graphical representation of its interaction ...

A series of tools are provided for analyzing the rich supporting evidence consolidated for each interaction record, which help to assess the reliability of an interaction. This is best illustrated by the following simple questions that researchers can address using the resource.

How many publications reported a given interaction?

It has been pointed out that PPIs identified in several different publications are in general more likely to be biologically relevant. Requiring that a PPI be supported by several publications has been a common approach for scoring PPI data in public databases (25, 38), as it is easy to interpret and the bias toward well-studied interactions is immediately evident. iRefWeb displays the number of supporting publications (NP) as well as their PubMed IDs for each consolidated interaction (Figure 3). It also provides links to the original annotation records in the source databases for easy verification. Furthermore, the iRefWeb Search option allows filtering the data using a range of attributes. Figure 4 illustrates such filtering to retrieve physical PPIs in the yeast S. cerevisiae, where the user may instantly see how many of the retrieved interactions are supported by one, two, or more publications.

Figure 4.
Filtering interactions on the basis of the supporting evidence. Portion of an iRefWeb Search page is shown, with different panels corresponding to filtering options based on the supporting evidence. Each panel displays the different attribute values or ...

Was the interaction detected in any low-throughput studies?

It is generally believed that interactions detected in carefully crafted low-throughput studies are more accurate than those detected in large-scale analyses (50), although this assertion has been recently challenged (39, 40, 51). To enable identification of PPIs supported by either type of study, each consolidated interaction is assigned a so-called Lowest PubMed Re-use (LPR) metric (32). The LPR metric is defined as the lowest number of PPIs reported by any of the publications that cite the considered interaction. For instance, when an interaction is curated from both, a low-throughput study detecting only three PPIs, and a high-throughput study reporting over a thousand PPIs, then its LPR = 3. The Search page displays the interaction counts for several LPR values and allows users to restrict their search to a particular LPR range (Figure 4). The LPR metric may be used to rank interactions and to derive interaction subsets on the basis of this rank, as done in other consolidation efforts (25) or databases (21).

Is the information extracted from the same publication consistent across databases?

The iRefWeb PubMed Detail feature enables in-depth comparison of the information extracted by different databases for each publication that supports a given PPI (Figure 5). Analysis of this information for all the publications that were curated by more than one database revealed that differences between the original curations of the same publication are rather frequent (Turinsky, A.L. et al., 2010, Database, in press). While such differences can be attributed to many factors, they do in some cases point out inherent difficulties in interpreting the published information (Turinsky, A.L. et al., 2010, Database, in press). iRefWeb enables the user to directly consult the original publications in order to determine the possible origins of the detected differences. One can also use the PubMed Reports feature to identify differences in the curated data from many publications at once, or for an entire source database.

Figure 5.
Divergent annotations of a yeast complex by five databases. The iRefWeb Pubmed Detail summary displays the different annotations of the same paper (PMID 9210376), which describes a six-subunit actin-related complex in yeast. Each line indicates the presence/absence ...

Extracting meaningful interactome descriptions

The filtering capabilities of iRefWeb (Figure 4) can be readily exploited to extract organisms-specific interactomes from the consolidated data subject to specified constraints. For instance, to derive the PPI network for yeast, S. cerevisiae, the first step is to activate the search filters ‘Saccharomyces cerevisiae’ and ‘Physical interaction’, and to include only interactions between proteins from the same organism (Figure 4). This query filters out any genetic interactions and returns 70 182 distinct interaction records in the consolidated dataset that involve exclusively S. cerevisiae proteins. Based on the current state of knowledge about the S. cerevisiae proteome, however, one may surmise that this rather large number probably includes a sizable fraction of low confidence interactions that may not be biologically relevant.

To limit the number of potentially spurious interactions, the user can apply additional filters to select only interactions supported by two or more studies (‘Number of Supporting PubMeds’ panel). These filters can be further combined with the selection of PPIs reported in either high- or low throughput studies using the LPR criterion (‘Lowest PubMed Re-use’ panel), and interactions detected by specific methods e.g. tandem affinity purification, affinity chromatography, etc. (‘Detection Type’ panel). The descriptions of the interaction-detection method and interaction type are based on the corresponding PSI-MI controlled vocabulary terms (27).

At any time, the detailed records of all the retrieved PPIs may be visually inspected for features of interest, and the entire retrieved collection may be downloaded in PSI-MITAB data format, using the ‘Download Interactome’ option.

However, such automated data extraction is only the first step in building a high confidence interactome. To complete the task, further manual re-curation of the data is necessary. Most obvious cases warranting re-curation are those in which the number of interactions archived by a source database either significantly exceeds that reported in the cited publication or is close to the number of reported low-confidence interactions. Table 3 lists several examples where, depending on the choice of the database that annotated the same high-throughput yeast study, an unsuspecting user may retrieve substantially different sets of interactions. Although most of the differences are minor, no databases have identical number of interaction records compared with each other and to the number described by the authors of the original publication. Prominent discrepancies are typically a result of a decision to curate the high-confidence filtered subset of interactions versus the full unfiltered set; or the decision to record additional data from the authors’ supplementary materials and resources; or the failure to annotate the interaction types properly in the downloadable data distribution (e.g. missing quality attributes or missing PSI-MI codes to indicate genetic interactions). In such cases, the original records from individual databases may have to be manually flagged and excluded entirely from the retrieved PPI set. iRefWeb greatly facilitates this manual process, by helping both researchers and curators identify publications for which the PPI counts differ significantly between the annotating databases, or for which the interactions recorded by one database are not supported by any other.

Table 3.
Examples of high-throughput annotations that require manual verification

Discussion

We believe that the iRefIndex/iRefWeb system represents a significant step forward in integrating information on protein interactions from public databases, and enabling researchers to seamlessly interrogate this information.

The versatile iRefWeb search filters enable the retrieval of organism-specific interactomes from the consolidated data, subject to specified criteria formulated on the basis of the supporting evidence. These interactomes can be pruned to reduce the number of low confidence interactions likely to be spurious, for example, by requiring that a retrieved interaction be supported by two or more publications, of which at least one publication is a low-throughput study. Further automated filtering options on the basis of the interaction type (physical association, direct interaction, covalent binding, etc.) and the experimental detection method are also offered. But these filters are unfortunately not as reliable as one would want them to be, because the expected information is often either missing or not properly mapped onto the corresponding terms of the PSI-MI ontology, by the source database.

Another issue is the representation of multi-protein complexes and associations across the source databases. iRefWeb allows filtering interactions by specifying a threshold (say, three or more) for the ‘Number of Interacting Proteins’ in the consolidated interaction record. However, this criterion will not result in the retrieval of all multi-protein complexes in the consolidated dataset, because some databases annotate a multi-subunit complex as a set of binary interactions using the so-called spokes expansions (Figure 5). To include these cases as well, filtering on the interaction type (such as ‘physical association’) should be applied in addition.

Many of these problems can be traced back to legacy data curated prior to the existence of the PSI-MI standard, or the IMEx consortium, and most should be resolved, as more of the source databases adhere to the agreed upon standards and unify their annotation practices and policies. In particular, it would be very useful if the databases applied identical policies for the annotation of low confidence raw PPI data made available by some studies (7, 55), which ideally should be flagged as such. Furthermore, MIMIx (minimum information required for reporting a molecular interaction experiment) guidelines were recently proposed to facilitate the standardized description of interaction data in public databases (56).

In the meantime, some level of manual re-curation is needed to retrieve interactomes that are biologically relevant. The capabilities offered by iRefWeb, notably the various automated options to filter out interactions likely to be spurious, greatly increase the efficiency of this process. Ultimately, however, such filtering should rely on more quantitative scoring schemes that are specific for distinct experimental methods (40) and can be generalized across different organisms.

A very important, and so far unique, feature in iRefWeb, is that it gives users the ability to readily compare how different databases interpret the same published information, and in case of clear differences, to verify these interpretations directly by examining the original publication.

Two rather typical examples of such differences are illustrated in Figure 6. One example highlights the discrepancies on both the organism and the number of interactions recorded by three databases from the same study by Ito et al. (57). The second example illustrates the disagreements across three databases on the proteins recorded from the article by Enenkel et al. (58).

Figure 6.
Examples of citation differences. Each of these examples can be viewed by querying the PubMed tab of the iRefWeb interface using the PubMed identifier (PMID). (a) Discrepancies in both organisms and the number of interactions recorded from PMID 11483497 ...

A systematic analysis of such differences yields unique insights into the challenges of curating the PPI literature (Turinsky, A.L. et al., 2010, Database, in press). The ability to query the original information at the level of individual publications should also be valuable to both the consumers of the PPI data and to database curators wishing to prioritize or validate their curation efforts.

Finally, we have shown how the rich graphical and numerical summaries of the consolidated data provide a valuable snapshot of the known PPI landscape across different organisms and databases. Analysis of this information reveals that most databases contribute a significant number of unique PPIs (often in the thousands for well-studied organisms such as human or yeast), which makes data consolidation a necessity.

Materials and Methods

Databases

The following versions of the source databases were used in this study: BIND (including the standard BIND distribution dated 25 May 2005 and BIND Translation dated 8 January 2010), BioGRID (31 January 2010, Version 2.0.61), CORUM (02 December 2009), DIP (30 December 2009), HPRD (06 July 2009, Release 8), IntAct (22 January 2010), MINT (11 November 2009), MPact (10 January 2008), MPPI (06 January 2004) and OPHID (18 July 2006). The corresponding 7.0 release of the iRefIndex PSI-MITAB files are available at http://irefindex.uio.no. The BIND Translation files are a pre-release version of archived BIND records recently recast into in PSI-MI 2.5 XML format by Gary Bader (http://baderlab.org/BINDTranslation) and kindly made available to this consolidation effort. Compared to BIND, they contain many additional annotation details, for example, references to 5346 additional publications, of which 1982 support PPIs in fruit fly D. melanogaster and 1742 support human PPIs. Overall they contribute 2225 unique interactions to iRefWeb, including 1088 human PPIs. The BIND Translation files will be publicly available in future releases of iRefIndex (see http://irefindex.uio.no/wiki/Sources_iRefIndex_7.0).

Data consolidation

Data consolidation was performed using the iRefIndex procedure [http://irefindex.uio.no, (32)]. This procedure collects PPI annotations from the source databases in PSI-MI format (27), in which genes and proteins may be specified using a variety of systems (NCBI Entrez Gene or RefSeq, UniProt). It then assigns identical keys to PPI records from multiple sources if they all represent the same interaction involving identical protein partners. Proteins are considered identical if their identifiers refer to the exact same amino acid sequence from the same organism.

Briefly, for each protein referred to by the source database, a protein sequence is retrieved, and assigned a hash code called a SEGUID by using the Secure Hash Algorithm (SHA-1). The protein is then given a unique key called a ROGID (redundant object group identifier), consisting of a concatenation of the SEGUID and the NCBI taxonomy ID. Each ‘interaction’ is also assigned a unique ID by ordering and concatenating the keys for the protein interactors and then creating a new SHA-1 key for the resulting string. Records with identical keys are defined as a redundant group. Two interaction records have identical keys if they refer to the same set of identical protein sequences and taxonomy identifiers.

Mapping proteins to canonical isoforms and genes

An additional isoform consolidation step, recently introduced into the iRefIndex procedure (since Version 6), maps every protein to the canonical splice isoform of the corresponding gene whenever possible (see http://irefindex.uio.no/wiki/Canonicalization). This mapping was performed because it is not uncommon that a particular isoform is annotated as the interacting protein, even when the interaction is not specific to that isoform. This additional step enables further consolidation and more reliable comparison of the data across the source databases. It involved the following procedure.

EntrezGene records are associated with a list of protein products (as defined by their corresponding ROGIDs described above). EntrezGene identifiers were clustered into related gene groups (RGGs) if they share at least one identical protein product. As a result, each RGG has an initial list of distinct protein products encoded by at least one of its member genes and represented by a set of RefSeq protein records. This initial list was expanded to include (i) distinct proteins from UniProt that are isoforms related to one of the proteins already in this list and/or (ii) UniProt proteins that cross-reference one of the EntrezGene identifiers in the RGG. From this expanded list of proteins, one specific protein was chosen as the canonical isoform for the entire list. If one of the proteins was an annotated canonical UniProt sequence (see http://www.uniprot.org/faq/30), then it was chosen as the canonical form. If two or more such proteins were annotated, the one with the longest sequence was chosen. If no canonical UniProt sequences existed, the longest protein sequence associated with the RGG was chosen.

All ROGIDs (interactors) were mapped to canonical ROGIDs in this manner. The net effect of the process was to minimize the number of canonical proteins by utilizing information from both UniProt and EntrezGene. Much of this reduction occurs for interactions and proteins from human and other mammalia.

It should be noted, however, that although our procedure maps the different splice isoforms to a single group, the original information curated by the different databases is completely preserved and can be directly queried for each consolidated record. This is a reasonable compromise until new data on the effect of splice isoforms on detected interactions become available, and standards are derived by PSI-MI for recording the information. In the meantime, individual databases follow their own policies in this regard, with some like IntAct using isoform-specific UniProt accessions as opposed to canonical accessions to annotate interactors. Links to the original records curated by the different databases ensure that this information is scrupulously passed on to the user.

Filtering on the basis of PSI-MI interaction attributes

Several interaction-attribute filters were enabled by processing the standard terms of the PSI-MI ontology (27), recorded in the source database annotations. Most relevant to this report is the processing of terms in the Interaction Type and Interaction Detection Type categories. The former describe the nature of the association between the proteins, whereas the latter specify the experimental method used to detect this type of association.

Interactions were considered ‘genetic’ (representing phenotypic alterations) if their ‘interaction type’ in the PSI-MI XML 2.5 record was described by the Molecular Interaction Ontology term MI:0208 ‘genetic interaction’ (http://www.ebi.ac.uk/ontology-lookup) or by any of its child terms. Such interactions were omitted from this analysis and may be filtered out interactively using the iRefWeb site search capabilities. But interactions of all types are available in iRefIndex.

In general, whenever an MI term identifier was not listed but an interaction-type term was provided, manual mapping was made to the closest MI term. Often MI terms for the interaction-detection method were (inappropriately) listed instead of those for the interaction type, in which case they were mapped to the interaction type expected for that detection method. These mappings are available at http://donaldson.uio.no/wiki/Mapping_of_terms_to_MI_term_ids_-_iRefIndex_6.0.

Interaction types in the HPRD source database were not processed, because they are systematically described in a non-standard way as in vivo or in vitro. It was assumed that all HPRD records describe physical interactions.

iRefWeb design and architecture

iRefWeb is implemented using mainly open source software tools (Figure 7). Its technology stack consists of three major components: (i) MySQL relational database (http://mysql.com/) for persistent data storage; (ii) Apache Solr enterprise search server (http://lucene.apache.org/solr/) that wraps the Lucene Java search library (http://lucene.apache.org/); and (iii) the standard MVC (model, view, control) web layer implemented using Grails web application framework (http://grails.org/). The Grails web layer provides Grails’ object relational mapping (GORM), and is built on top of Spring platform for enterprise Java applications (http://www.springsource.org/) and Hibernate library for the mapping of an object-oriented domain model to a traditional relational database (http://www.hibernate.org/).

Figure 7.
The iRefWeb architecture. The iRefWeb architecture comprises a MySQL relational database, a Solr enterprise search server and a web layer implemented using Grails web application framework. The Grails web layer provides GORM, and is built on top of Spring ...

The decision to use the Solr search layer was motivated by the fact that MySQL, although robust and versatile, is a relational database not originally designed for full-text or faceted search. Solr provides a convenient and easy way to index the interaction data, as well as fast and focused retrieval of search results across search terms and interaction evidence filters (facets). The Grails framework utilizing Spring and Hibernate gave us all the advantages of a full J2EE application but without the typical code and configuration bloat, since Grails supports the ‘convention over configuration’ software design paradigm in which only the unconventional aspects of the application need to be specified. Furthermore, since Grails is built on Groovy, an agile and dynamic language for the Java Virtual Machine (http://groovy.codehaus.org/), it provided us with a rapid path from prototype to production.

Author Contribution

B.T. designed and implemented iRefWeb. S.R. performed iRefIndex data consolidation. A.L.T. analyzed the consolidated landscape. J.V. investigated interactome retrieval. E.K.C., E.C. and K.M. contributed to the implementation of iRefWeb. I.D. supervised the iRefIndex project. S.J.W. supervised the iRefWeb project. A.L.T. and S.J.W. drafted the manuscript.

Funding

Canadian Institutes of Health Research (MOP #82940); the SickKids Foundation; the Ontario Research Fund. S.J.W. is Canada Research Chair, Tier 1. Funding for open access charge: Canadian Institutes of Health Research (MOP #82940).

Conflict of interest statement. None declared.

Acknowledgements

The authors wish to thank Sandra Orchard for helpful suggestions regarding several iRefWeb features.

References

1. Alberts B. The cell as a collection of protein machines: preparing the next generation of molecular biologists. Cell. 1998;92:291–294. [PubMed]
2. Gavin AC, Superti-Furga G. Protein complexes and proteome organization from yeast to man. Curr. Opin. Chem. Biol. 2003;7:21–27. [PubMed]
3. Oti M, Brunner HG. The modular nature of genetic diseases. Clin. Genet. 2007;71:1–11. [PubMed]
4. Lim J, Hao T, Shaw C, et al. A protein-protein interaction network for human inherited ataxias and disorders of Purkinje cell degeneration. Cell. 2006;125:801–814. [PubMed]
5. Goh KI, Cusick ME, Valle D, et al. The human disease network. Proc. Natl Acad. Sci. USA. 2007;104:8685–8690. [PMC free article] [PubMed]
6. Gavin AC, Aloy P, Grandi P, et al. Proteome survey reveals modularity of the yeast cell machinery. Nature. 2006;440:631–636. [PubMed]
7. Ito T, Chiba T, Ozawa R, et al. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl Acad. Sci. USA. 2001;98:4569–4574. [PMC free article] [PubMed]
8. Uetz P, Giot L, Cagney G, et al. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature. 2000;403:623–627. [PubMed]
9. Krogan NJ, Cagney G, Yu H, et al. Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature. 2006;440:637–643. [PubMed]
10. Giot L, Bader JS, Brouwer C, et al. A protein interaction map of Drosophila melanogaster. Science. 2003;302:1727–1736. [PubMed]
11. Li S, Armstrong CM, Bertin N, et al. A map of the interactome network of the metazoan C. elegans. Science. 2004;303:540–543. [PMC free article] [PubMed]
12. Butland G, Peregrin-Alvarez JM, Li J, et al. Interaction network containing conserved and essential protein complexes in Escherichia coli. Nature. 2005;433:531–537. [PubMed]
13. Ewing RM, Chu P, Elisma F, et al. Large-scale mapping of human protein-protein interactions by mass spectrometry. Mol. Syst. Biol. 2007;3:89. [PMC free article] [PubMed]
14. Rual JF, Venkatesan K, Hao T, et al. Towards a proteome-scale map of the human protein-protein interaction network. Nature. 2005;437:1173–1178. [PubMed]
15. Stelzl U, Worm U, Lalowski M, et al. A human protein-protein interaction network: a resource for annotating the proteome. Cell. 2005;122:957–968. [PubMed]
16. Bader GD, Donaldson I, Wolting C, et al. BIND–The Biomolecular Interaction Network Database. Nucleic Acids Res. 2001;29:242–245. [PMC free article] [PubMed]
17. Stark C, Breitkreutz BJ, Reguly T, et al. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 2006;34:D535–D539. [PMC free article] [PubMed]
18. Ruepp A, Brauner B, Dunger-Kaltenbach I, et al. CORUM: the comprehensive resource of mammalian protein complexes. Nucleic Acids Res. 2008;36:D646–D650. [PMC free article] [PubMed]
19. Hermjakob H, Montecchi-Palazzi L, Lewington C, et al. IntAct: an open source molecular interaction database. Nucleic Acids Res. 2004;32:D452–D455. [PMC free article] [PubMed]
20. Peri S, Navarro JD, Amanchy R, et al. Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. 2003;13:2363–2371. [PMC free article] [PubMed]
21. Chatr-aryamontri A, Ceol A, Palazzi LM, et al. MINT: the Molecular INTeraction database. Nucleic Acids Res. 2007;35:D572–D574. [PMC free article] [PubMed]
22. Guldener U, Munsterkotter M, Oesterheld M, et al. MPact: the MIPS protein interaction resource on yeast. Nucleic Acids Res. 2006;34:D436–D441. [PMC free article] [PubMed]
23. Pagel P, Kovac S, Oesterheld M, et al. The MIPS mammalian protein-protein interaction database. Bioinformatics. 2005;21:832–834. [PubMed]
24. Brown KR, Jurisica I. Online predicted human interaction database. Bioinformatics. 2005;21:2076–2082. [PubMed]
25. Jensen LJ, Kuhn M, Stark M, et al. STRING 8–a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res. 2009;37:D412–D416. [PMC free article] [PubMed]
26. Salwinski L, Miller CS, Smith AJ, et al. The Database of Interacting Proteins: 2004 update. Nucleic Acids Res. 2004;32:D449–D451. [PMC free article] [PubMed]
27. Kerrien S, Orchard S, Montecchi-Palazzi L, et al. Broadening the horizon–level 2.5 of the HUPO-PSI format for molecular interactions. BMC Biol. 2007;5:44. [PMC free article] [PubMed]
28. Orchard S, Kerrien S, Jones P, et al. Submit your interaction data the IMEx way: a step by step guide to trouble-free deposition. Proteomics. 2007;7(Suppl. 1):28–34. [PubMed]
29. Prieto C, De Las Rivas J. APID: Agile Protein Interaction DataAnalyzer. Nucleic Acids Res. 2006;34:W298–W302. [PMC free article] [PubMed]
30. Tarcea VG, Weymouth T, Ade A, et al. Michigan molecular interactions r2: from interacting proteins to pathways. Nucleic Acids Res. 2009;37:D642–D646. [PMC free article] [PubMed]
31. Kamburov A, Wierling C, Lehrach H, et al. ConsensusPathDB–a database for integrating human functional interaction networks. Nucleic Acids Res. 2009;37:D623–D628. [PMC free article] [PubMed]
32. Razick S, Magklaras G, Donaldson IM. iRefIndex: a consolidated protein interaction database with provenance. BMC Bioinformatics. 2008;9:405. [PMC free article] [PubMed]
33. Chaurasia G, Malhotra S, Russ J, et al. UniHI 4: new tools for query, analysis and visualization of the human protein-protein interactome. Nucleic Acids Res. 2009;37:D657–D660. [PMC free article] [PubMed]
34. Cote RG, Jones P, Martens L, et al. The Protein Identifier Cross-Referencing (PICR) service: reconciling protein identifiers across multiple source databases. BMC Bioinformatics. 2007;8:401. [PMC free article] [PubMed]
35. Cusick ME, Yu H, Smolyar A, et al. Literature-curated protein interaction datasets. Nat. Methods. 2009;6:39–46. [PMC free article] [PubMed]
36. Salwinski L, Licata L, Winter A, et al. Recurated protein interaction datasets. Nat. Methods. 2009;6:860–861. [PubMed]
37. Cusick ME, Yu H, Smolyar A, et al. Addendum: Literature-curated protein interaction datasets. Nat. Methods. 2009;6:934–935. [PMC free article] [PubMed]
38. Suthram S, Shlomi T, Ruppin E, et al. A direct comparison of protein interaction confidence assignment schemes. BMC Bioinformatics. 2006;7:360. [PMC free article] [PubMed]
39. Braun P, Tasan M, Dreze M, et al. An experimentally derived confidence score for binary protein-protein interactions. Nat. Methods. 2009;6:91–97. [PMC free article] [PubMed]
40. Yu J, Finley RL., Jr Combining multiple positive training sets to generate confidence scores for protein-protein interactions. Bioinformatics. 2009;25:105–111. [PMC free article] [PubMed]
41. Ceol A, Chatr Aryamontri A, Licata L, et al. MINT, the molecular interaction database: 2009 update. Nucleic Acids Res. 38:D532–D539. [PMC free article] [PubMed]
42. Chen JY, Mamidipalli S, Huan T. HAPPI: an online database of comprehensive human annotated and predicted protein interactions. BMC Genomics. 2009;10(Suppl. 1):S16. [PMC free article] [PubMed]
43. Tong AH, Lesage G, Bader GD, et al. Global mapping of the yeast genetic interaction network. Science. 2004;303:808–813. [PubMed]
44. Lehner B, Crombie C, Tischler J, et al. Systematic mapping of genetic interactions in Caenorhabditis elegans identifies common modifiers of diverse signaling pathways. Nat. Genet. 2006;38:896–903. [PubMed]
45. Collins SR, Miller KM, Maas NL, et al. Functional dissection of protein complexes involved in yeast chromosome biology using a genetic interaction map. Nature. 2007;446:806–810. [PubMed]
46. Remy I, Michnick SW. Clonal selection and in vivo quantitation of protein interactions with protein-fragment complementation assays. Proc. Natl Acad. Sci. USA. 1999;96:5394–5399. [PMC free article] [PubMed]
47. Phizicky EM, Fields S. Protein-protein interactions: methods for detection and analysis. Microbiol. Rev. 1995;59:94–123. [PMC free article] [PubMed]
48. Huh WK, Falvo JV, Gerke LC, et al. Global analysis of protein localization in budding yeast. Nature. 2003;425:686–691. [PubMed]
49. Bader GD, Hogue CW. Analyzing yeast protein-protein interaction data obtained from different sources. Nat. Biotechnol. 2002;20:991–997. [PubMed]
50. Chen Y, Xu D. Computational analyses of high-throughput protein-protein interaction data. Curr. Protein Pept. Sci. 2003;4:159–181. [PubMed]
51. Collins SR, Kemmeren P, Zhao XC, et al. Toward a comprehensive atlas of the physical interactome of Saccharomyces cerevisiae. Mol. Cell Proteomics. 2007;6:439–450. [PubMed]
52. Lopes CT, Franz M, Kazi F, et al. Cytoscape Web: an interactive web-based network browser. Bioinformatics. 2010;26:2347–2348. [PMC free article] [PubMed]
53. Tarassov K, Messier V, Landry CR, et al. An in vivo map of the yeast protein interactome. Science. 2008;320:1465–1470. [PubMed]
54. Tong AH, Evangelista M, Parsons AB, et al. Systematic genetic analysis with ordered arrays of yeast deletion mutants. Science. 2001;294:2364–2368. [PubMed]
55. Miller JP, Lo RS, Ben-Hur A, et al. Large-scale identification of yeast integral membrane protein interactions. Proc. Natl Acad. Sci. USA. 2005;102:12123–12128. [PMC free article] [PubMed]
56. Orchard S, Salwinski L, Kerrien S, et al. The minimum information required for reporting a molecular interaction experiment (MIMIx) Nat. Biotechnol. 2007;25:894–898. [PubMed]
57. Ito T, Matsui Y, Ago T, et al. Novel modular domain PB1 recognizes PC motif to mediate functional protein-protein interactions. EMBO J. 2001;20:3938–3946. [PMC free article] [PubMed]
58. Enenkel C, Blobel G, Rexach M. Identification of a yeast karyopherin heterodimer that targets import substrate to mammalian nuclear pore complexes. J. Biol. Chem. 1995;270:16499–16502. [PubMed]
59. Belanger KD, Kenna MA, Wei S, et al. Genetic and physical interactions between Srp1p and nuclear pore complex proteins Nup1p and Nup2p. J. Cell Biol. 1994;126:619–630. [PMC free article] [PubMed]

Articles from Database: The Journal of Biological Databases and Curation are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...