• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of ploscompComputational BiologyView this ArticleSubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)
PLoS Comput Biol. Sep 2008; 4(9): e1000165.
Published online Sep 26, 2008. doi:  10.1371/journal.pcbi.1000165
PMCID: PMC2527685

A Genomewide Functional Network for the Laboratory Mouse

Andrey Rzhetsky, Editor

Abstract

Establishing a functional network is invaluable to our understanding of gene function, pathways, and systems-level properties of an organism and can be a powerful resource in directing targeted experiments. In this study, we present a functional network for the laboratory mouse based on a Bayesian integration of diverse genetic and functional genomic data. The resulting network includes probabilistic functional linkages among 20,581 protein-coding genes. We show that this network can accurately predict novel functional assignments and network components and present experimental evidence for predictions related to Nanog homeobox (Nanog), a critical gene in mouse embryonic stem cell pluripotency. An analysis of the global topology of the mouse functional network reveals multiple biologically relevant systems-level features of the mouse proteome. Specifically, we identify the clustering coefficient as a critical characteristic of central modulators that affect diverse pathways as well as genes associated with different phenotype traits and diseases. In addition, a cross-species comparison of functional interactomes on a genomic scale revealed distinct functional characteristics of conserved neighborhoods as compared to subnetworks specific to higher organisms. Thus, our global functional network for the laboratory mouse provides the community with a key resource for discovering protein functions and novel pathway components as well as a tool for exploring systems-level topological and evolutionary features of cellular interactomes. To facilitate exploration of this network by the biomedical research community, we illustrate its application in function and disease gene discovery through an interactive, Web-based, publicly available interface at http://mouseNET.princeton.edu.

Author Summary

Functionally related proteins interact in diverse ways to carry out biological processes, and each protein often participates in multiple pathways. Proteins are therefore organized into a complex network through which different functions of the cell are carried out. An accurate description of such a network is invaluable to our understanding of both the system-level features of a cell and those of an individual biological process. In this study, we used a probabilistic model to combine information from diverse genome-scale studies as well as individual investigations to generate a global functional network for mouse. Our analysis of the global topology of this network reveals biologically relevant systems-level characteristics of the mouse proteome, including conservation of functional neighborhoods and network features characteristic of known disease genes and key transcriptional regulators. We have made this network publicly available for search and dynamic exploration by researchers in the community. Our Web interface enables users to easily generate hypotheses regarding potential functional roles of uncharacterized proteins, investigate possible links between their proteins of interest and disease, and identify new players in specific biological processes.

Introduction

Establishing a functional network is invaluable to furthering our understanding of gene function, pathways, and systems-level properties of an organism and can be a powerful resource in directing targeted experiments. The availability of diverse genome-scale data enables the prediction of networks encompassing all or at least most of the proteins in an organism. In Saccharomyces cerevisiae, probabilistic models have been used to predict the genomewide protein–protein functional interactions by integrating diverse data types [1][6]. Such probabilistic approaches have also been used in mammals to predict physical interactions [7],[8] and to generate expression networks [9][13]. In human, functional relationship networks have also been generated by integrating diverse interaction data [14]. However, it is still challenging to predict functional relationships through integrating diverse genomic data in mammalian model systems, due to the intrinsic complexity of these genomes and functional biases in individual datasets. Yet recent accumulation of both traditional targeted experiments, including protein physical interactions [15][17], gene-disease/phenotypic associations [18] and genome-scale data including gene expression and tissue localization [19][21], phylogenetic and phenotypic profiles [22],[23], as well as data retrieved based on homology [2],[24] provides the basis for establishing a global functional relationship network in the laboratory mouse [25].

We describe here a functional network in mouse generated by integrating a wide range of data types. In contrast to interactomes that represent physical interactions, our functional network predicts the probability that two proteins are involved in the same biological process and thus represents a more comprehensive combination of physical, genetic and regulatory linkages (Figure 1A). We demonstrate the utility of our network to predict gene functions and pathway components by both computational and experimental approaches. Further, we demonstrate how it can be used to further our understanding of the systems-level features of the mouse functional network. Our global functional network for the laboratory mouse is a valuable resource for analysis and annotation of the mouse proteome and can be used as a means of generating biological hypotheses for subsequent experimental validation, especially through the interactive public web interface available at http://mouseNET.princeton.edu.

Figure 1
Strategy for processing and integration of diverse genomic data.

Results

A Probabilistic Model To Predict Functional Relationships by Integrating Diverse Data Types

Bayesian networks have been used successfully for integrating diverse data sources in many biological settings, including protein function prediction [3],[6], prediction of genetic interactions [26], physical interactions [4],[7] and most relevant to this work, prediction of functional networks in S. cerevisiae [2],[5],[6] and human [14]. The Bayesian approach is especially well-suited to our problem, where many genome-scale data have missing values and collections of individual investigations may not be a complete representation of genome profiles. Based on a Bayesian framework, we designed a method that combines redundant datasets, processes continuous data, minimizes over-fitting and finally, integrates all experimental evidence (Table 1) in a confidence-based manner to estimate the genomewide pair-wise probabilities of functional linkage (Figure 1A). The resulting mouse interactome includes 20,581 genes, with edges representing the probability of functional relationship between each pair (Figure 1B). As demonstrated below, creation of this functional network through integrating diverse data sources can facilitate identification of novel pathway components and represents a powerful resource for understanding genetic diseases and network evolution.

Table 1
Data sources used for functional interactome integration.

MouseNET Recovers Functional Relationships

A key application of a functional network prediction is to uncover novel pathway components. We first evaluated the accuracy of our predicted network through cross-validation analysis on known functional linkages (co-annotations of proteins to specific Gene Ontology [27] terms), which is the standard for unbiased computational evaluation. In short, cross-validation can be used to assess the accuracy of predictions by evaluating the system's accuracy in recovering subsets of known annotations withheld during the training process. Our integrated network is substantially more successful in predicting known functional linkages than any of the individual datasets and making more correct predictions (demonstrating higher precision) at every confidence cutoff (Figure 2A). This result is robust to using a different annotation standard, i.e., co-annotation to the same Kyoto Encyclopedia of Genes and Genomes [28] (KEGG) pathways (Figure 2B). Notably, although the relative performance of datasets varies with different standards, the consistently good performance of our results suggests that the integrated predictions are robust to variations in the annotation standard.

Figure 2
Computational performance analysis of the integrated network to predict functional relationships and the relative performance of different datasets.

A common pitfall of many global integration schemes is the tendency to make precise predictions over only a limited set of biological processes [29]. Thus we evaluated the functional composition of our integrated results using KEGG, which is an accurate representation of our current knowledge of different pathways. The integrated network exhibits a balanced representation of a large group of pathways, even though many individual datasets have significant functional biases (Figure S1, the complete statistics of this functional composition analysis are included in the Dataset S1). For instance, the protein–protein interaction data obtained from the Biomolecular Interaction Network Database (BIND) [15] is significantly skewed towards the processes of focal adhesion. In contrast, given the broad functional coverage of the integrated network, we expect our approach will be useful in further characterization of a variety of pathways.

MouseNET Predicts Novel Pathway Components and Gene Functions

The high accuracy in predicting co-annotation to KEGG pathways (Figure 2B) by our network and its broad functional coverage (Figure S1) suggest that mouseNET can accurately capture pathway-based functional linkages for a variety of processes. We thus focused specifically on the predicted functional network for the major conserved signaling pathways related to development, including Hedgehog, Wnt, MAPK, TGF-β, Notch, and Toll-like receptor signaling pathways. We find that in addition to recovering known pathway components (Figure S2), these networks include a number of proteins not previously annotated to the pathway. Many of these novel predictions have reasonable experimental support in the literature. For example, in the 40 most tightly connected nodes surrounding known MAPK pathway proteins (Figure 3), 14 of them are annotated as the canonical pathway components in KEGG (p<10−10, hypergeometric distribution). Furthermore, two of the other nodes (Kit, MGI:96677 and Shh, MGI:98297) are not annotated to the MAPK pathway in KEGG but are annotated in the Gene Ontology [27] to be MAPK-related. Another nine unannotated predictions in the cluster of 40 have been suggested in literature to be involved in the MAPK pathway (Table S2 and Text S1). Thus, our system not only recovers well-established knowledge but also implicates novel pathway components, and therefore could be a powerful tool for generating hypotheses for experimental approaches.

Figure 3
Analysis of MAPK pathway predictions based on the integrated functional network.

Our genomewide prediction of protein function based on the integrated network produced 689 novel annotations with an estimated 80% precision. A subset of these new predictions was evaluated through examination of the literature by MGD curators and the precision estimate was confirmed (Dataset S2). Of these, 17 predictions were confirmed based on literature evidence at the level sufficient for annotation in MGI, and another six were found to have some support in the literature, but at a level not yet sufficient for GO annotation. For example, Retn (MGI:1888506), which does not have a GO biological process or KEGG pathway annotation, was predicted with high confidence (over 0.8) to be involved in glucose homeostasis (GO:0042593). The loss of Retn was indeed found to improve glucose homeostasis in leptin deficiency [30], confirming the prediction. This evaluation demonstrates that through integrating information from diverse sources, the system is capable of making accurate novel predictions on genes not previously annotated in GO or KEGG.

Experimental Validation by Nanog Down-Regulation Induced Cell Differentiation

To further validate novel functional relationships predicted by our integrative network, we investigated proteins predicted to cluster around the homeobox transcription factor Nanog (MGI:1919200), which is an essential gene responsible for maintaining embryonic cell fate. Specifically, we experimentally down-regulated the expression of Nanog, and observed the nuclear protein expression changes of the top functional interactors in our predicted network by mass spectrometry. Five of the top 10 Nanog interactors predicted by mouseNET (Figure 4A) were detected in the nuclei and thus, we could evaluate their expression following Nanog down-regulation. We observed that after Nanog down-regulation, expression levels of four of them either significantly increased (DNA (cytosine-5-)-methyltransferase 3-like, Dnmt3l, MGI:1859287 and DNA methyltransferase 3B, Dnmt3b, MGI:1261819) or decreased (transformation related protein 53, Trp53, MGI:98834 and POU domain, class 5, transcription factor 1, Pou5f1, MGI:101893) (p<0.1 when compared to the overall distribution of the nucleus-detected proteins, Figure S8). Of those, Pou5f1 has also been previously shown to be involved in ES cell regulation [31],[32] and it has significant overlap in genomic binding targets with Nanog [33],[34]. Furthermore, the change in expression for these four proteins is consistent for different time points after Nanog knock-down, and increases consistently over the time course (Figure 4B). This experimental verification demonstrates that our system is a powerful tool which can aid researchers in generating accurate hypotheses for discovery of proteins involved in a specific cellular process.

Figure 4
Validation by Nanog down-regulation experiment.

Our functional network can also highlight information about physical interactions and transcriptional binding sites. For example, the 17 physical interactions with Nanog identified by Wang et al. were highly enriched in pairs of high functional relationship confidence (Mann-Whitney U test p = 0.00069). In addition, on the transcription level, the Nanog binding loci associated genes [34] were also highly enriched in high confidence functional interactors of Nanog predicted by our network (U test p = 3.98E-18). Therefore, by integrating a diverse collection of data, mouseNET enables users to explore variety types of functional associations, including physical interactions and transcriptional level regulation.

Topological Analysis Reveals Distinct Characteristics of Modulators of Diverse Processes

MouseNET provides a valuable resource to characterize the systems-level features of a model organism, which is a critical issue in understanding the organization and dynamics of the proteome. In the mouseNET network, the majority of proteins have only a small number of connections (Figure 5A), yet the presence of a few highly connected nodes (Figure 1B) implies central modifiers of the proteome. These ‘hub’ genes (at confidence cutoff 0.6) are enriched in regulation of response to stress, DNA metabolic process and cell cycle, (Bonferroni-corrected p<1.0E-9) (Table 2). Additionally, these hubs were significantly enriched (Bonferroni-corrected p = 8.3E-10) for ‘chromosome organization and biogenesis’, which is in agreement with a previous study in C. elegans that identified a class of genetic interaction hubs, all six of which were chromatin regulators [35].

Figure 5
Topological properties of the functional network.
Table 2
GO SLIM (Biological Process) enrichment of potential modulators of several pathways (Ci<0.15, N≥10) and highly connected genes (N≥10).

We further analyzed the topology of the functional network surrounding these hubs and found distinct characteristics that correlate with their role in the cell. Proteins with high connectivity may appear in densely connected modules, or alternatively, they could be linkers of multiple functional modules and participate in several pathways [36]. To investigate these two classes, for each gene we computed the clustering coefficient, C, which gives the probability that its interactors are connected to each other. We found that low clustering coefficients, when controlled for node degree, are critical indicators of proteins participating in more biological pathways (Figure 5B). This trend is robust against different confidence cutoff levels for the interactions (Figure S3). For example, both nucleolar protein 1 (Nol1, MGI:107891) and paxillin (Pxn, MGI:108295) have 50 functional linkages with more than 0.6 confidence in interactions (Figure 5C and 5D). However, the former, which has a C of 0.44, is involved in only the rRNA processing pathway, while the latter, with a C of 0.06, is known to be involved in multiple biological processes, including activation of MAPK activity, branching morphogenesis of a tube, cell adhesion and protein folding. Furthermore, we found that the set of proteins with low clustering coefficients, but not the set of all proteins with only high node degree, is highly enriched for ‘signal transduction’ (Table 2), probably because proteins involved in signal transduction are central to cross-talk among multiple pathways and the cell's diverse response to various stimuli. Thus, the topology of the functional network contains important clues to the global organization of the proteome; and in addition to connectivity, we demonstrate that the clustering coefficient is a critical factor characterizing modifiers of multiple biological pathways.

Phenotypic and Disease Effects in Relation to Topology and Functional Participation

Global modeling of functional linkages provides a general framework to analyze the relationship between local network properties and functional consequences of individual gene perturbations. For example, previous studies have predicted that the network connectivity is correlated with the propensity of a protein to be essential [37],[38]. Recently, however, there has been debate over whether this relationship is indeed true in yeast or human [39],[40], the main issue being whether high connectivity is truly a property of the underlying network or simply an effect of intense study of the essential gene set (i.e., annotation or investigational bias).

To address this question in the mouse functional network and control for investigation bias, we constructed two networks: one including all input data except knock-out phenotype information, and one including only whole-genome datasets. To avoid the caveat that not all gene knock-outs have been constructed, only genes that have been knocked out or targeted were included in all statistical analyses. For the first functional network, essential genes or disease-associated genes are significantly more connected than average (p<10−18 for perinatal lethality, p<10−9 for postnatal lethality, and p<10−6 for disease-associated genes, Mann-Whitney U test) (Figure S4A). However, in the functional network based on only whole-genome datasets, the difference between essential and non-essential sets was not significant, nor was that between disease-related set and the genome average (Figure 6A), suggesting the observed relationships between essentiality and network connectivity are likely to be explained by investigational biases in our case. This result is consistent with a previous study [41] which suggested that the vast majority of disease genes show no tendency to encode physical interaction hubs in human data. We further considered whether connectivity and local topology in our functional network relate to other perturbation phenotypes. Although most phenotype-responsible gene groups (Table S1) have a higher than average connectivity based on all available input data (Figure S4B), only proteins involved in tumorigenesis, embryogenesis still have significantly higher connectivity than average (p<0.05) on the whole-genome-data-only network (Figure 6B). This result highlights that the variation in intensity of study for genes can cause significant biases in the conclusions reached when comparing the connectivity of different groups of genes.

Figure 6
Relationship between phenotypic effects and local network configuration.

We observed that all groups of phenotype-associated genes have a lower clustering coefficient than average, and most participate in more biological pathways (Figure 6C). This conclusion holds true when controlling for investigational biases. For example, Trp53, with very high connectivity (Figure 1B) and particularly low clustering coefficient (0.02252), is essential during both embryonic perinatal and postnatal stages and plays a role in tumorigenesis, the reproductive system, and has ten other high level phenotypes (Table S1) according to the Mouse Genome Informatics (MGI) database [18]. This result implies that hubs with low clustering coefficient and participating in multiple pathways are important buffers of the genome, and that mutations or other disruptions of these genes are likely to be related to a detrimental phenotypes and, likely, disease.

Comparison of Yeast and Mouse Functional Networks

Genome evolution on the sequence level has been studied intensively during the past decades. Studies of functional evolution on the genome-scale, on the other hand, require comprehensive profiling of proteins, which is difficult due to largely incomplete annotation of protein function in most organisms. Here, we demonstrate that mouseNET is a valuable resource for cross-species functional evolution studies by comparing it to the S. cerevisiae network [2]. To avoid circularity caused by integration of sequence similarity information, we generated a functional network that excludes all orthology-based input data. Given these mouse and yeast networks, we first investigated whether functional linkages are conserved between pairs of orthologs as identified through InParanoid [23]. Our results indicate that high-confidence functional linkages in S. cerevisiae are strongly predictive of functional linkages between orthologous gene pairs in mouse (Figure 7A for statistical analysis).

Figure 7
Comparison of yeast and mouse interactome and identification of mouse-specific functional linkages.

We also investigated the conservation of functional neighborhoods in the mouse and yeast networks. To make the datasets comparable, we included only orthologous pairs in the conservation statistical analysis. We found that the two networks vary from a high degree of conservation to almost no conservation (Figure 7B and 7C). Functional linkages between proteins involved in response to stress, response to endogenous stimulus, catabolic process, DNA metabolism, cell cycle, and other core biological processes and components were highly conserved between yeast and mouse (Table 3), e.g., the ribosomal protein L15 (Rpl15, MGI:1913730; Figure 7B and 7C). In contrast, functional relationships in processes specific to higher organisms, including, behavior, embryonic development, multicellular organismal development and anatomical structure morphogenesis were limited to the mouse network (Table 4). For example, the HtrA serine peptidase 1 (Htra1, MGI:1929076) plays a role in BMP signaling pathway [42], but its ortholog in yeast, YNL123W (Nma111, SGD: S000005067) is involved in apoptosis and lipid metabolic process [43],[44] (Figure 7B and 7C). The newly generated interactions for these mouse-specific functional networks originated through a combination of orthologous pairs in yeast and novel connections with existing genes or genes that have no ortholog in yeast (Figure 7B and 7C). Interestingly, ion transport was among the list of enriched processes for both conserved and unconserved subgraphs. We found that in conserved subgraphs, these genes were enriched in energy-coupled proton transport, which is conserved from yeast to mammals. In contrast, in the unconserved subgraphs, this enrichment of ion transport was due to genes involved in metal-ion or chloride transport, probably because of their involvement in the neural system. Details regarding the enrichment statistics are available in the Dataset S3.

Table 3
Conservation between yeast and mouse functional relationships.
Table 4
Divergence between yeast and mouse functional relationships.

Comparative analysis of interactomes between species, such as that presented above, is no doubt a promising approach for answering a number of fundamental biological questions [45]. Previous studies, e.g., [40], have demonstrated the sparsity of our current knowledge of physical interactions in many organisms, which has led to a very limited set of identified conserved interactions. As demonstrated here, the comparison of higher-coverage functional networks based on probabilistic models for integrating diverse genomic data provide an alternative solution for studying the evolution of functional linkages between proteins.

Example Application of the MouseNET Web Interface

Generating hypotheses for biological functions for a protein of interest based on integrating diverse data sources

An important application of the network analysis is to identify, for a protein of interest, which biological processes and pathways it participates in. Here, we use the mouseNET online query system to identify two different biological processes involving Ace (angiotensin I converting enzyme 1, MGI:87874). Ace is currently only annotated to metabolic process (GO:0008125) and proteolysis (GO:0006508) biological process terms in the Gene Ontology. Ace has a well-established central role in blood pressure regulation, evidenced by knock-out phenotypes [46], but it currently lacks annotation to the corresponding GO term. When mouseNET is queried with ‘Ace’, the system indeed suggests that the local network is highly enriched in blood pressure regulation (GO:0008217, p = 8.17E-4), including four proteins annotated directly to this term (Agtr1a, Agtr1b, Ren1, and Agt) (Figure S5A). The functional links between Ace and these four genes cannot be confidently surmised from any single input dataset; instead, they are supported by a combination of data from InParanoid [47], phenotype [48], OMIM [24], SAGE [19], and Zhang [21] expression data, indicating the important role of data integration for suggesting accurate functional role for proteins.

In the Ace predicted functional network, we also found enrichment for another unrelated process: menstrual cycle phase (GO:0022601), which currently is synonymous to estrous cycle in mouse GO annotation. Three of the top 40 interactors (Stat5a, Nos3 and Agt) were annotated to this term (p = 3.73E-2), with support from InParanoid [23], phenotype [48], OMIM, SAGE [19], Su [20], and Zhang [21] expression data. Indeed, the expression cycle of Ace shown by immunohistochemistry is correlated with menstrual cycle in human [49], suggesting that mouseNET's prediction of Ace participation in the estrous cycle phase process is likely correct. This annotation is missing from existing annotation databases and such prediction would not be made based on genome scale pair-wise physical interaction studies. Because our system integrates diverse data sources and presents them in a network context, it can quickly allow biology researchers to reveal multiple independent roles of a single gene. mouseNET can thus serve both as a source of functional information for genes that have been previously investigated, but not yet annotated in public databases, as well as a method for directing experiments by hypothesizing novel roles for previously uncharacterized proteins.

Identifying disease-related genes through multiple queries of the mouseNET network

Because genes responsible for the same disease are often involved in related pathways, mouseNET provides a valuable resource for identifying novel disease gene candidates though its multiple-query feature. For example, by searching mouseNET with a set of genes (Mapt, Sncaip, Tbp, Drd4, Ndufv2 and Nr4a2) already known to be involved in Parkinson's disease, we are able to extract other genes annotated to this disease and some novel candidates (Figure S5B). The top three interactors returned by mouseNET (Uchl1, Dbh and Snca) are already labeled with Parkinson's disease in OMIM, indicating the ability of our system to accurately identify other disease genes given some known ones. The fourth gene Msx1 (Homeo box, msh-like 1) is not yet annotated to Parkinson's disease. However, its connection to several query genes (Tbp and Mapt) and to several proteins functionally related to the query set (Mdm2, Fyn, Psen1, Apoe, Uchl1, and Dbh) in mouseNET suggests its potential role in Parkinson's disease. Interestingly, Msx1 was found to act as an intrinsic dopamine-neuron determinant during development, and therefore is very likely to be a candidate involved in Parkinson's disease, which leads to mesencephalic dopamine neuron degeneration. In addition, among the top three interactors, experiment using transgenic mice shows that Uchl1 mutant could lead to dopaminergic neuronal loss [50]; Dbh is a critical gene involved in dopamine biosynthesis; and Snca has been suggested to be an essential regulator of dopamine neurotransmission [51]. Notably, query of Tbp alone results in a list of transcription-related genes that has no significance with the particular disease. The novel candidate Msx1 is only identified with multiple disease gene queries and a network including both direct and indirect neighbors. This illustrates the ability of mouseNET to identify novel candidates of disease genes based on its multiple-query feature, which cannot be achieved by existing databases nor can be readily extracted from any single genome-scale dataset.

Discussion

In this study, we combined diverse genetic and genomic data using a probabilistic framework to generate a functional network for the laboratory mouse. Our network accurately predicts functional linkages between mouse genes and covers a broad range of biological processes. We expect this view of the mouse proteome will be an invaluable resource in identifying novel pathway components and understanding system-level organization.

We have demonstrated several applications of our network in this study. First, we characterized the topology of the network and demonstrated that local network topology correlates with biological functions. Also, we used this genomewide view of functional linkages to investigate the relationship between diverse phenotypes and the local configuration of subnetworks. Finally, although network comparison across several species is limited by the sparsity of our current knowledge of physical interactions [40], generation of a functional network based on diverse data types also allowed us to examine the conservation of subnetworks on a global system level.

We provide a searchable interface for the exploration of the mouse functional network (http://mouseNET.princeton.edu). The interface also presents a full analysis of the functional enrichment of networks surrounding the genes(s) of interest and the disease genes in the local network. Through our interface, users could identify the original evidence supporting for specific functional linkages. The website includes integration results generated for the purpose of topological studies (controlled for investigational biases) and of cross-species network alignment studies (by excluding homology data) (http://mouseNET.princeton.edu/supplement/supplemental_data.htm). In the future, new publicly available genome-scale data will be added to our system, which will provide up-to-date support for hypothesis generation for questions ranging from individual protein function prediction to characterization of diverse system-level features.

In this study, we focused on the generation of a global functional network of mouse and demonstrated its wide applicability. Availability of tissue-specific datasets should allow us to generate tissue, cell, and developmental stage-specific network predictions using similar probabilistic frameworks. These tissue or developmental stage-specific networks will be more targeted and will be invaluable to the researchers of individual fields of study.

Materials and Methods

Functional Genomic Data Retrieval and Preprocessing

To build a functional network of proteins, we have collected a diverse set of evidence from several databases (Table 1). In order to predict pair-wise protein–protein relationships, all data were preprocessed, as described below, into pair-wise scores, reflecting the similarity between protein pairs. The databases included in our analysis are:

  1. Physical interaction data from the Biomolecular Interaction Network Database (BIND) [15], the Database of Interacting Proteins (DIP) [17] and the General Repository for Interaction Datasets (GRID) [16]. We also mapped the interactions in the Online Predicted Human Interaction Database (OPHID) [24] to mouse orthologs via InParanoid [23]. In this process, members of the interactions that have more than one ortholog in mouse were mapped for each of their orthologs. Because physical interaction data are pair-wise and binary (representing the presence or absence of evidence for a physical interaction between a pair of proteins), these datasets were in the format of pair-wise binary scores and were ready to be input into the Bayesian network.
  2. Phenotype and disease data from MGI [18] and the Online Mendelian Inheritance in Man (OMIM). The disease association data were mapped to mouse using InParanoid [23]. Based on independence analysis (see below), we found that different phenotypes are highly conditionally dependent on each other, and that the phenotype data and disease data are dependent on each other as well. Thus treating phenotype and disease data as separate evidence nodes in a naïve Bayesian network would cause significant over-estimation of functional relationships between gene pairs that affect the same multiple phenotypes/diseases. As a result, phenotype and disease data were treated as a single evidence node in our Bayesian network, of which the score for the protein pair j,k will be:
    equation image
    (1)
    Where ai(j) = 1 if protein j has phenotype i and ai(j) = 0 otherwise, and Ni is the number of proteins involved in this phenotype/disease; n is the total number of phenotypes and diseases. In this way, co-occurrence of rare phenotypes or diseases will be given more weight than common ones. Such calculation allows the transformation from original phenotype/disease profiles to pair-wise scores that reflect the similarity level between a pair of proteins.
  3. Homologous functional relationship predictions in yeast from the bioPIXIE system. bioPIXIE is a previously established genomewide prediction of S. cerevisiae functional network, which is based on integration of diverse yeast genome-scale datasets [2]. This integrated dataset was used as an input in our mouse interactome by mapping orthologous genes between S. cerevisiae and laboratory mouse using InParanoid [23]. The average was taken in the case that orthology mapping results in multiple mapped pair-wise scores in yeast for a single pair in mouse.
  4. Expression and Tissue localization datasets from Su et al., 2004, Zhang et al., 2004, and the SAGE database [19]. We chose these three datasets because they represent expression profiles of a wide range of tissue and developmental stages. In total, they included 333 conditions. To make the data suitable as an input to our Bayesian network, we applied the Pearson correlation coefficient ρ, to assess levels of co-expression between pairs of genes:
    equation image
    (2)
    Where x and y are expression level data vectors of length n for two genes, x and [y with overline] are means, and σx and σy are standard deviations. The correlation coefficients were further Fisher z-transfored to ensure comparable, normal distribution [52].

Filtering Redundant Datasets

In the following section, we applied a naïve Bayes network to integrate all data sources and to predict pair-wise functional relationships. However, the application of a naïve Bayesian framework requires a non-trivial assumption of independence between individual evidence sources, which correspond to different evidence nodes in the naïve Bayes network. To address this issue, we evaluated the conditional independence between datasets and those with significant dependence were merged into a single evidence node. To determine whether two datasets should be merged, we calculated the likelihood ratio of each combination of datasets with and without the assumption of independence.

equation image
(3)
equation image
(4)

where E is the score of the protein pair in dataset i or j, a FRY means a positive functional relationship (FR = 1) in gold standard, and FRN means a negative functional relationship (FR = 0).

Two conditionally independent datasets will have similar likelihood ratios calculated by the above two approaches (Figure S6A). In contrast, highly dependent datasets tend to have erroneously high likelihood ratios (Figure S6B) when they are treated as independent ones. After a complete analysis of the independence properties between every dataset pair, we found that phenotype data from MGI and disease data from OMIM are highly dependent on each other. As a result, we treated these phenotype and disease data as a single evidence node in the Bayesian network, and each of the remaining datasets as an individual evidence node.

Bayesian Network Construction

As data sources are different in their accuracy of measurement as well as relevance for predicting protein functions, creating an accurate network for functional linkages requires a systematic approach that weights and integrates information from individual datasets. We applied a Bayesian network to integrate diverse data and make the final functional linkage predictions (Figure 1A). Specifically, we computed the posterior probability of a functional relationship given all available evidence as follows:

equation image
(5)

where FR represents functional relationship, Ei represents the score of the pair in each dataset i and Z is a normalization factor. Intuitively, this probability FRij for two proteins i and j represents how likely it is, given existing data and accuracy and coverage of each input dataset, that proteins i and j participate in the same biological process.

To learn the parameters in this Bayesian framework, we established a gold standard that approximates a true set of functionally related proteins. Mouse Genome Informatics (MGI) maintains curated annotations of Gene Ontology (GO) for mouse [53]. The sources of these annotations include (1) hand annotation from primary literature, (2) electronic annotation based on gene name and symbols, (3) annotation from SwissProt keywords, (4) Enzyme Commision (EC) numbers. These annotation sources are reasonably accurate for our analysis. We defined positive as pairs of proteins that are co-annotated to a specific Biological Process GO term (less than two hundred genes annotated to this GO term) and negatives as those in which both members of the pair have specific annotations but do not share any of them.

To model the posterior distribution given a set of data, we grouped the pair-wise values from each dataset into discrete groups. For binary datasets, for example, physical interactions, it is easy to separate the two categories where 0 means that there is no interaction between the pair, and 1 means that the interaction exists. Continuous pair-wise scores (e.g., expression profiles and phenotype/disease data) require a binning approach for discretization. We observed that for each dataset, the posteriors generally decreases with small fluctuation as the pair-wise score decreases (Figure S7). Thus, to avoid over-fitting to noise in the datasets, discretization was done so as to force the posteriors of the discretized bins to decrease as the average pair-wise score of those bins decreases.

Network-Based Pathway Component Prediction

An important application of such a functional network is to predict novel pathway components. We therefore applied our network to predict pathway components in KEGG [28]. For a specific pathway, during each iteration, 10 known genes were seeded into the weighted network and the rest of the genes were treated as unknowns. Thus for every other gene, we compute an adjacency to the 10 seeds. This process was repeated three hundred times with random samplings of the seed set. We then calculated the average adjacency for each gene:

equation image
(6)

where wi represents the weight of each gene and j represents the seed genes, and wijk represents the confidence, as estimated by our integration, of the functional relationship between protein i and j in iteration k. ni is the number of times gene i was not one of the seed genes. The top components and recovery curves were generated based on the ranking of wi.

Topological Characterization of the Functional Interactome

To characterize the topology of the functional network, we calculated the connectivity and clustering coefficient C of all proteins. The clustering coefficient of a protein gives the probability that its neighbors are connected to each other. In a densely connected module or clique, C is close to one. C for each of the proteins was calculated as follows [54]:

equation image
(7)

where n denotes the number of links between k direct interactors.

Functional Enrichment

We obtained GO annotations [27] from the Mouse Genome Informatics (MGI) [18] on Jan 18, 2007. The enrichment of each GO term was found using a hypergeometric distribution. The most enriched GO terms were represented by the lowest Bonferroni-corrected p value [55].

Implementation, Publicly Available Interface, and Network-Based Gene Function Predictions

To facilitate wide access to the integrated functional network by the biology community, we implemented a web interface (http://mouseNET.princeton.edu) that allows the users to browse our predictions based on single or multiple protein queries. We have implemented a probabilistic algorithm that searches the direct or indirect neighbors with the largest adjacency to the query set [2]. GO term enrichment was calculated for the top neighbors, which facilitates fast discovery of unknown gene function.

We also provide the community with a list of gene function predictions based on our network for proteins with no currently known function. Specifically, we calculated the GO term enrichment of the top 40 nearest neighbors of each gene using the hypergeometric distribution. Then the per-function enrichment of each gene's top neighbors is reported as a Bonferroni-corrected p-value and thus their putative function is deduced.

Experimental Verification

The Nanog controllable embryonic stem cell lines were set up and tested by Natalia Ivanova, and were cultured as described [56]. The feeder cells, primary mouse embryonic fibroblasts, were removed before use. To down-regulate Nanog, we withdrew the doxycycline (1 g ml−1) from the media, but still supplied the cells with all the routine ES cell nutrients (DMEM with 15% FBS (Hyclone), 100 mM MEM non-essential amino acids, 0.1 mM 2-mercaptoethanol, 1 mM l-glutamine (Invitrogen), and 103 U ml-1 of LIF (Chemicon). For the nuclear protein measurement, nuclear protein samples were prepared with nuclear/cytosol fractionation kit (BioVision, catalog number: K266-100). The samples from four different time points were labeled by different isotope (iTRAQ) and then analyzed at a single run of mass spectrometry. We used ProQUANT (Applied Biosystems) and the ProGROUP (Applied Biosystems) software to identify proteins. The experiment was repeated three times. Proteins detected more than twice were included in the analysis and the average values were used.

Supporting Information

Dataset S1

Functional composition and biases of each data source and the integrated result.

(0.41 MB XLS)

Dataset S2

Expert curation for a selected set of gene function predictions based on the network.

(0.07 MB XLS)

Dataset S3

Functional biases of conserved and non-conserved sub-network.

(0.09 MB XLS)

Figure S1

The functional composition of the integrated results and individual datasets.

(0.14 MB TIF)

Figure S2

Performance of the integrated interactome in predicting the components of six major pathways in development.

(1.57 MB TIF)

Figure S3

Connectivity (at 0.3 cutoff in confidence) versus clustering coefficient.

(0.43 MB TIF)

Figure S4

Connectivity and phenotypic effects in networks integrated using both individual experimental evidence and large-scale genomic data.

(0.67 MB TIF)

Figure S5

Illustration of the mouseNET interface.

(1.20 MB TIF)

Figure S6

Example of a conditionally independent pair of datasets and a conditionally dependent dataset pair.

(0.24 MB TIF)

Figure S7

The general trend of posteriors for continuous datasets.

(0.13 MB TIF)

Figure S8

The distribution of Log2 changes in protein expression level on the fifth day after Nanog knock-down for 1148 proteins detected in the nucleus.

(0.11 MB TIF)

Table S1

Mapping of phenotypes and MP index in MGI.

(0.05 MB DOC)

Table S2

Literature evidence for novel components (not currently annotated to MAPK in KEGG or GO) predicted to be involved in MAPK pathway.

(0.09 MB DOC)

Text S1

Supplementary figure and tables.

(5.07 MB DOC)

Acknowledgments

The authors would like to thank Dr. Judith Blake and her curatorial staff for manual review of a subset of our functional predictions. We thank David Hess for his suggestions on homologous comparison between mouse and yeast. We also thank Matthew Hibbs and Curtis Huttenhower for insightful comments and discussion.

Footnotes

The authors have declared that no competing interests exist.

National Science Foundation (NSF) CAREER award DBI-0546275, National Institutes of Health grant R01 GM071966, NSF grant IIS-0513552, and National Institute of General Medical Sciences Center of Excellence grant P50 GM071508. OGT is an Alfred P. Sloan Research Fellow.

References

1. Jiang T, Keating AE. AVID: an integrative framework for discovering functional relationships among proteins. BMC Bioinformatics. 2005;6:136. [PMC free article] [PubMed]
2. Myers CL, Robson D, Wible A, Hibbs MA, Chiriac C, et al. Discovery of biological networks from diverse functional genomic data. Genome Biol. 2005;6:R114. [PMC free article] [PubMed]
3. Chen Y, Xu D. Global protein function annotation through mining genome-scale data in yeast Saccharomyces cerevisiae. Nucleic Acids Res. 2004;32:6414–6424. [PMC free article] [PubMed]
4. Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan NJ, et al. A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science. 2003;302:449–453. [PubMed]
5. Lee I, Date SV, Adai AT, Marcotte EM. A probabilistic functional network of yeast genes. Science. 2004;306:1555–1558. [PubMed]
6. Troyanskaya OG, Dolinski K, Owen AB, Altman RB, Botstein D. A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc Natl Acad Sci U S A. 2003;100:8348–8353. [PMC free article] [PubMed]
7. Rhodes DR, Tomlins SA, Varambally S, Mahavisno V, Barrette T, et al. Probabilistic model of the human protein-protein interaction network. Nat Biotechnol. 2005;23:951–959. [PubMed]
8. Xia K, Dong D, Han JD. IntNetDB v1.0: an integrated protein-protein interaction network database generated by a probabilistic model. BMC Bioinformatics. 2006;7:508. [PMC free article] [PubMed]
9. Ala U, Piro RM, Grassi E, Damasco C, Silengo L, et al. Prediction of human disease genes by human-mouse conserved coexpression analysis. PLoS Comput Biol. 2008;4:e1000043. doi:10.1371/journal.pcbi.1000043. [PMC free article] [PubMed]
10. Stuart JM, Segal E, Koller D, Kim SK. A gene-coexpression network for global discovery of conserved genetic modules. Science. 2003;302:249–255. [PubMed]
11. Novershtern N, Itzhaki Z, Manor O, Friedman N, Kaminski N. A functional and regulatory map of asthma. Am J Respir Cell Mol Biol. 2008;38:324–336. [PMC free article] [PubMed]
12. Schadt EE, Molony C, Chudin E, Hao K, Yang X, et al. Mapping the genetic architecture of gene expression in human liver. PLoS Biol. 2008;6:e107. doi:10.1371/journal.pbio.0060107. [PMC free article] [PubMed]
13. Tsaparas P, Marino-Ramirez L, Bodenreider O, Koonin EV, Jordan IK. Global similarity and local divergence in human and mouse gene co-expression networks. BMC Evol Biol. 2006;6:70. [PMC free article] [PubMed]
14. Franke L, van Bakel H, Fokkens L, de Jong ED, Egmont-Petersen M, et al. Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am J Hum Genet. 2006;78:1011–1025. [PMC free article] [PubMed]
15. Alfarano C, Andrade CE, Anthony K, Bahroos N, Bajec M, et al. The Biomolecular Interaction Network Database and related tools 2005 update. Nucleic Acids Res. 2005;33:D418–D424. [PMC free article] [PubMed]
16. Breitkreutz BJ, Stark C, Tyers M. The GRID: the General Repository for Interaction Datasets. Genome Biol. 2003;4:R23. [PMC free article] [PubMed]
17. Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, et al. The Database of Interacting Proteins: 2004 update. Nucleic Acids Res. 2004;32:D449–D451. [PMC free article] [PubMed]
18. Eppig JT, Bult CJ, Kadin JA, Richardson JE, Blake JA, et al. The Mouse Genome Database (MGD): from genes to mice—a community resource for mouse biology. Nucleic Acids Res. 2005;33:D471–D475. [PMC free article] [PubMed]
19. Siddiqui AS, Khattra J, Delaney AD, Zhao Y, Astell C, et al. A mouse atlas of gene expression: large-scale digital gene-expression profiles from precisely defined developing C57BL/6J mouse tissues and cells. Proc Natl Acad Sci U S A. 2005;102:18485–18490. [PMC free article] [PubMed]
20. Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A. 2004;101:6062–6067. [PMC free article] [PubMed]
21. Zhang W, Morris QD, Chang R, Shai O, Bakowski MA, et al. The functional landscape of mouse gene expression. J Biol. 2004;3:21. [PMC free article] [PubMed]
22. Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, et al. BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics. 2005;21:3439–3440. [PubMed]
23. O'Brien KP, Remm M, Sonnhammer EL. Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res. 2005;33:D476–D480. [PMC free article] [PubMed]
24. Brown KR, Jurisica I. Online predicted human interaction database. Bioinformatics. 2005;21:2076–2082. [PubMed]
25. Peña-Castillo L, Tasan M, Myers C, Lee H, Joshi T, et al. A critical assessment of M. musculus gene function prediction using integrated genomic evidence. Genome Biol. 2008;9(Suppl 1):S2. [PMC free article] [PubMed]
26. Zhong W, Sternberg PW. Genome-wide prediction of C. elegans genetic interactions. Science. 2006;311:1481–1484. [PubMed]
27. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, et al. Gene ontology: tool for the unification of biology. Nat Genet. 2000;25:25–29. [PMC free article] [PubMed]
28. Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, et al. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 2006;34:D354–D357. [PMC free article] [PubMed]
29. Myers CL, Barrett DR, Hibbs MA, Huttenhower C, Troyanskaya OG. Finding function: evaluation methods for functional genomic data. BMC Genomics. 2006;7:187. [PMC free article] [PubMed]
30. Qi Y, Nie Z, Lee YS, Singhal NS, Scherer PE, et al. Loss of resistin improves glucose homeostasis in leptin deficiency. Diabetes. 2006;55:3083–3090. [PubMed]
31. Chambers I, Colby D, Robertson M, Nichols J, Lee S, et al. Functional expression cloning of Nanog, a pluripotency sustaining factor in embryonic stem cells. Cell. 2003;113:643–655. [PubMed]
32. Mitsui K, Tokuzawa Y, Itoh H, Segawa K, Murakami M, et al. The homeoprotein Nanog is required for maintenance of pluripotency in mouse epiblast and ES cells. Cell. 2003;113:631–642. [PubMed]
33. Boyer LA, Lee TI, Cole MF, Johnstone SE, Levine SS, et al. Core transcriptional regulatory circuitry in human embryonic stem cells. Cell. 2005;122:947–956. [PMC free article] [PubMed]
34. Loh YH, Wu Q, Chew JL, Vega VB, Zhang W, et al. The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cells. Nat Genet. 2006;38:431–440. [PubMed]
35. Lehner B, Crombie C, Tischler J, Fortunato A, Fraser AG. Systematic mapping of genetic interactions in Caenorhabditis elegans identifies common modifiers of diverse signaling pathways. Nat Genet. 2006;38:896–903. [PubMed]
36. Snel B, Bork P, Huynen MA. The identification of functional modules from the genomic association of genes. Proc Natl Acad Sci U S A. 2002;99:5890–5895. [PMC free article] [PubMed]
37. Lee I, Lehner B, Crombie C, Wong W, Fraser AG, et al. A single gene network accurately predicts phenotypic effects of gene perturbation in Caenorhabditis elegans. Nat Genet. 2008;40:181–188. [PubMed]
38. Jeong H, Mason SP, Barabasi AL, Oltvai ZN. Lethality and centrality in protein networks. Nature. 2001;411:41–42. [PubMed]
39. Coulomb S, Bauer M, Bernard D, Marsolier-Kergoat MC. Gene essentiality and the topology of protein interaction networks. Proc Biol Sci. 2005;272:1721–1725. [PMC free article] [PubMed]
40. Gandhi TK, Zhong J, Mathivanan S, Karthick L, Chandrika KN, et al. Analysis of the human protein interactome and comparison with yeast, worm and fly interaction datasets. Nat Genet. 2006;38:285–293. [PubMed]
41. Goh KI, Cusick ME, Valle D, Childs B, Vidal M, et al. The human disease network. Proc Natl Acad Sci U S A. 2007;104:8685–8690. [PMC free article] [PubMed]
42. Oka C, Tsujimoto R, Kajikawa M, Koshiba-Takeuchi K, Ina J, et al. HtrA1 serine protease inhibits signaling mediated by Tgfbeta family proteins. Development. 2004;131:1041–1053. [PubMed]
43. Fahrenkrog B, Sauder U, Aebi U. The S. cerevisiae HtrA-like protein Nma111p is a nuclear serine protease that mediates yeast apoptosis. J Cell Sci. 2004;117:115–126. [PubMed]
44. Tong F, Black PN, Bivins L, Quackenbush S, Ctrnacta V, et al. Direct interaction of Saccharomyces cerevisiae Faa1p with the Omi/HtrA protease orthologue Ynm3p alters lipid homeostasis. Mol Genet Genomics. 2006;275:330–343. [PubMed]
45. Sharan R, Ideker T. Modeling cellular machinery through biological network comparison. Nat Biotechnol. 2006;24:427–433. [PubMed]
46. Bernstein KE, Xiao HD, Frenzel K, Li P, Shen XZ, et al. Six truisms concerning ACE and the renin-angiotensin system educed from the genetic analysis of mice. Circ Res. 2005;96:1135–1144. [PubMed]
47. Bandyopadhyay S, Sharan R, Ideker T. Systematic identification of functional orthologs based on protein network comparison. Genome Res. 2006;16:428–435. [PMC free article] [PubMed]
48. Eppig JT, Blake JA, Bult CJ, Kadin JA, Richardson JE. The mouse genome database (MGD): new features facilitating a model system. Nucleic Acids Res. 2007;35:D630–D637. [PMC free article] [PubMed]
49. Harata T, Ando H, Iwase A, Nagasaka T, Mizutani S, et al. Localization of angiotensin II, the AT1 receptor, angiotensin-converting enzyme, aminopeptidase A, adipocyte-derived leucine aminopeptidase, and vascular endothelial growth factor in the human ovary throughout the menstrual cycle. Fertil Steril. 2006;86:433–439. [PubMed]
50. Setsuie R, Wang YL, Mochizuki H, Osaka H, Hayakawa H, et al. Dopaminergic neuronal loss in transgenic mice expressing the Parkinson's disease-associated UCH-L1 I93M mutant. Neurochem Int. 2007;50:119–129. [PubMed]
51. Abeliovich A, Schmitz Y, Farinas I, Choi-Lundberg D, Ho WH, et al. Mice lacking α-synuclein display functional deficits in the nigrostriatal dopamine system. Neuron. 2000;25:239–252. [PubMed]
52. Fisher RA. Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika. 1915:507–521.
53. Hill DP, Davis AP, Richardson JE, Corradi JP, Ringwald M, et al. Program description: strategies for biological annotation of mammalian systems: implementing gene ontologies in mouse genome informatics. Genomics. 2001;74:121–128. [PubMed]
54. Watts DJ, Strogatz SH. Collective dynamics of ‘small-world’ networks. Nature. 1998;393:440–442. [PubMed]
55. Boyle EI, Weng S, Gollub J, Jin H, Botstein D, et al. GO::TermFinder—open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics. 2004;20:3710–3715. [PMC free article] [PubMed]
56. Ivanova N, Dobrin R, Lu R, Kotenko I, Levorse J, et al. Dissecting self-renewal in stem cells with RNA interference. Nature. 2006;442:533–538. [PubMed]

Articles from PLoS Computational Biology are provided here courtesy of Public Library of Science
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...