• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of bioinfoLink to Publisher's site
Bioinformatics. Jul 15, 2010; 26(14): 1759–1765.
Published online May 27, 2010. doi:  10.1093/bioinformatics/btq262
PMCID: PMC2894508

Fast integration of heterogeneous data sources for predicting gene function with limited annotation

Abstract

Motivation: Many algorithms that integrate multiple functional association networks for predicting gene function construct a composite network as a weighted sum of the individual networks and then use the composite network to predict gene function. The weight assigned to an individual network represents the usefulness of that network in predicting a given gene function. However, because many categories of gene function have a small number of annotations, the process of assigning these network weights is prone to overfitting.

Results: Here, we address this problem by proposing a novel approach to combining multiple functional association networks. In particular, we present a method where network weights are simultaneously optimized on sets of related function categories. The method is simpler and faster than existing approaches. Further, we show that it produces composite networks with improved function prediction accuracy using five example species (yeast, mouse, fly, Esherichia coli and human).

Availability: Networks and code are available from: http://morrislab.med.utoronto.ca/˜sara/SW

Contact: ude.otnorot.sc@ivafatsoms; ac.otnorotu@sirrom.diauq

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

The past decade has seen a dramatic increase in the quantity and variety of publicly available genomic and proteomic data, and a parallel increase in the number of computational methods to integrate these heterogeneous data in generating predictions about protein and gene function [see Noble and Ben-Hur (2007) for a review]. Many of these methods, often called gene (or protein) function prediction algorithms, use the same basic framework: first, they generate so-called functional association networks that capture information about shared gene (or protein) function implicit in each dataset, then they integrate these networks to generate a single composite network which they input, along with a set of labels that describe gene function, to a kernel- or network-based classification algorithm (e.g. Lanckriet et al., 2004; Marcotte et al., 1999; Mostafavi et al., 2008; Myers and Troyanskaya, 2007; Tsuda et al., 2005). Once trained, these classification algorithms assign discriminant values to each gene that can then be thresholded to generate hypotheses about the function of unlabeled genes.

The functional association network is a natural and widely used representation for capturing information about shared gene function from high-throughput data sources. In this representation, nodes correspond to genes or proteins and the edges are weighted according to the evidence implied by a given data source for shared function of the connected nodes. These edge weights are calculated using a similarity metric matched to a given data type; for example, the Pearson's correlation coefficient (PCC) is often used to measure pairwise similarities between gene expression profiles. Once calculated, it is relatively easy to translate these networks into kernels for kernel-based learning methods [e.g. by using a diffusion kernel (Kondor and Lafferty, 2002; Qi et al., 2008)].

An important step in predicting gene function is the construction of a composite network from multiple functional association networks. A common approach is to construct a function-specific composite network as a weighted sum of the individual networks such that the weight of each network is determined based on the network's predictiveness of a set of positively labeled genes that are deemed to have the same specific function (Lanckriet et al., 2004; Mostafavi et al., 2008; Tsuda et al., 2005). The positive gene labels are derived from online databases such as Gene Ontology (GO; Ashburner et al., 2000), KEGG (Kanehisa and Goto, 2000) and Enzyme Commission (EC; Bairoch, 2000). These databases provide a controlled vocabulary describing categories of gene function and curated lists of genes annotated to these functions.

There are two challenges in constructing function-specific composite networks. First, because many functional categories have only a few annotations, it is difficult to assign network weights without overfitting. Second, for an algorithm to be widely applicable it must be fast and scalable to combine dozens of networks with over 10 000 nodes (genes) each.

Here, we investigate a number of network weighting schemes to avoid overfitting. In particular, we propose a new approach that we refer to as Simultaneous Weights (SWs). SW is based on our previous algorithm, GeneMANIA (Mostafavi et al., 2008), which constructs function-specific composite network by solving a constrained linear regression problem. However, instead of assigning function-specific network weights, we simultaneously optimize the weights on a group of related function categories by solving a single-constrained linear regression problem. We evaluate the impact of several regularization schemes such as LASSO (Tibshirani, 1996), elastic net (Zou and Hastie, 2005), ridge regularization on our previous weighting scheme (Mostafavi et al., 2008) and SW. Compared with other state-of-the-art methods in gene function prediction, SW results in a drastic improvement in performance while reducing the computation time requirement of gene function prediction on five example species (yeast, fly, mouse, human and Escherichia coli).

2 RELATED WORK

There are large number of algorithms that extend simple guilt-by-association when predicting gene function from a single network including (Karaoz et al., 2003; Nabieva et al., 2005; Vazquez et al., 2003). The approaches closest to those presented in this article are methods for integrating multiple functional association networks into one composite network with the goal of predicting gene function from the composite network. In the seminal work of Marcotte et al. (1999), a composite network is constructed with an edge between two genes if the two genes are linked together in the majority of the underlying functional association networks. Similarly, in Pavlidis et al. (2002) a composite network is constructed as an unweigted sum of several functional association networks, each derived from a different data source. More recently, in Lanckriet et al. (2004); Tsuda et al. (2005) and Mostafavi et al. (2008), function-specific composite networks are constructed as a weighted sum where the weight of each network is determined based on the function being predicted. In Lanckriet et al. (2004) and Tsuda et al. (2005), the network weights are assigned to optimize the performance of support vector machine (SVM) and Gaussian random fields (GRFs), respectively, which use the composite network to predict gene function. In Mostafavi et al. (2008), we use linear regression to optimize an objective function inspired by the kernel target alignment (Cristianini et al., 2002) of the composite network and the class labels. Another method for combining multiple association network was presented in Myers and Troyanskaya (2007) where a combined network was constructed using a naive Bayes classifier.

The new approach that we present here, SW, extends GeneMANIA algorithm (Mostafavi et al., 2008) that was previously shown to have the state-of-art performance on yeast and mouse benchmark datasets (Mostafavi et al., 2008; Pena-Castillo et al., 2008). However, achieving good performance with the GeneMANIA algorithm in categories with a small number of annotations required a time-consuming regularization procedure. Here, we investigate how to improve the performance for function categories with few annotations without increasing computation time.

3 ALGORITHM

Following the framework of Mostafavi et al. (2008), our approach for predicting gene function from multiple networks consists of two steps: (i) it constructs a composite network from multiple functional association networks and (ii) it predicts gene function from a single composite network. Below, we first review the constrained linear regression problem solved by the GeneMANIA algorithm for assigning network weights; next we describe SW, our new approach for assigning network weights using related categories of gene function. Finally, we briefly review how gene function is predicted from a single composite network.

3.1 Combining networks with linear regression

We assume that we are given as input m networks, which we index by d, Wd [set membership] Rn×n, Wd = WdT, where the (i, j)-th element of Wd, wdij ≥ 0 for all i and j. We interpret wdij as the strength of the evidence of co-functionality between genes i and j as derived from dataset d. Using annotation databases such as GO, for each GO term that describes a given category of gene function c, positive genes are defined as genes that are annotated to c and we consider all other genes as negatives: that is, we define a label vector An external file that holds a picture, illustration, etc.
Object name is btq262i1.jpg, where positive and negative genes are labeled as +1, −1, respectively. Our goal is to construct a composite network as a weighted sum of the m networks W* = ∑dmμdWd, where μd is the weight assigned to network d, such that W* can be used to predict other positive genes.

To assign the network weights, GeneMANIA solves a constrained linear regression problem by minimizing the least squares error between the composite network and the target network T which represents the pairwise functional relationships implied by the label vector:

equation image
(1)

where the target network T has elements Tij taking one of the two values: An external file that holds a picture, illustration, etc.
Object name is btq262i2.jpg if genes i and j are both positive and An external file that holds a picture, illustration, etc.
Object name is btq262i3.jpg if genes i and j have opposite signs, where n+ and n are the number of positives and negatives in An external file that holds a picture, illustration, etc.
Object name is btq262i4.jpg. Since the negative–negative pairs of genes typically do not form a coherent class, it is more appropriate to solve a one-class problem: in GeneMANIA this is addressed by removing the entries in T and each network Wd that correspond to negative pairs of genes (i.e. Tij and wdij with i and j both negative).

The non-negative constraint in Equation (1) ensures that the Laplacian matrix L which is derived from W*, L=DW* (where D is a diagonal matrix of the row sums of W*, i.e. Dii=∑nj=1w*ij) is positive semi-definite. As we will show later, we need this condition to use W* for making predictions about gene function.

By using the fact that trace(WT) = vec(W)T vec(T), where vec(W) is an operator that stacks the columns of matrix W atop of each other, we can write (1) as a non-negative linear regression problem:

equation image
(2)

where An external file that holds a picture, illustration, etc.
Object name is btq262i5.jpg, Ω = [vec(W1),…, vec(Wm)] and An external file that holds a picture, illustration, etc.
Object name is btq262i6.jpg. In practice, we include a column of ones in Ω and calculate a bias μ0 that we discard when constructing W*. Unlike the other values of μd, μ0 is not constrained to be positive. Solving Equation (2) requires at most m iterations (though in practice the number of iterations is much smaller), each iteration involves solving a system of linear equations with m variables and a matrix-vector product. As m (the number of networks) tends to be smaller than 100, we can compute the network weights very fast (e.g. in seconds on a standard computer).

3.2 Combining networks with SWs

Although the above approach is fast, it often performs poorly in predicting categories that have a small number of annotations (as discussed in Section 5). In Mostafavi et al. (2008), it was shown that an [ell]2 norm regularization (also known as ridge regression) to a mean weight prior, improves performance in such categories (Section 5.1.1). However, assessing this prior requires solving several regression problems. Here, we define a simple modification that improves the performance without increasing the computational time and show that it performs better than previous approaches. In particular, in SW, instead of assigning network weights for each category separately, we fit the network weights to a set of related function categories. To do so, we assign the network weights by solving the following problem:

equation image

where tc for c = 1,…, h are constructed from h positively labeled genes sets (categories) that are related to each other. Once we obtain An external file that holds a picture, illustration, etc.
Object name is btq262i7.jpg, we construct W* and use it to predict all h categories.

If we include all entries of Ω (i.e. not excluding negative–negative pairs of genes for each category h as described above), we can then write the above problem as:

equation image

where An external file that holds a picture, illustration, etc.
Object name is btq262i8.jpg and so we only need to solve the regression problem once to get the SWs. As such, for each category, Tijc takes on one of the three possible values: An external file that holds a picture, illustration, etc.
Object name is btq262i9.jpg when i, j are both negative, both positive and have the opposite signs, respectively, and n+c (nc) is the number of positives (negatives) in category c. As we will show, including the negative–negative pairs of genes in tc and Ω does not degrade the performance of the constructed composite network.

In our experiments, constructing An external file that holds a picture, illustration, etc.
Object name is btq262i10.jpg takes <5 s with h=1000 on a standard computer (2.4 GHz Intel Core 2 Duo, 4 GB RAM). Further, for a given set of networks, we can always precompute ΩTΩ and thus only need to calculate An external file that holds a picture, illustration, etc.
Object name is btq262i11.jpg for a group of categories of interest. As we will show in Section 5, combining weights by SWs results in an improvement in the performance of the composite networks in predicting the relevant h gene categories while it reduces the computation time (as now we are only required to solve for the network weights once when predicting h categories).

3.3 Predicting protein function from a single network

We evaluate a composite network, W*, by its ability to predict a given gene function. As done in Mostafavi et al. (2008), we use the GRFs algorithm (Zhou et al., 2004; Zhu et al., 2003) to predict gene function from a single composite network. In particular, given a label vector An external file that holds a picture, illustration, etc.
Object name is btq262i12.jpg where yi represents the prior evidence for gene i having the function of interest, the GRF algorithm assigns a discriminant score fi [set membership] [−1, 1] to each node (gene) i in the network which we can then threshold to classify the genes. In particular, yi = {−1, k, +1} where known negative and positive genes are assigned −1 and +1, respectively, and the unlabeled genes (i.e. the possibility set) are assigned a value −1≤k ≤+1, for example, k can be adjusted based on a gene's annotations in GO (Mostafavi and Morris, 2009).

We can write the GRFs algorithm in the following general form:

equation image
(3)

where [σ1,…, σn]T are model parameters, Σ is a diagonal matrix with Σiii, L=DW is the graph Laplacian and D is a diagonal matrix with Dii=∑jw*ij. The above objective ensures that the discriminant scores remain close to their initial labels [first term in (3)] and that the discriminant scores of genes likely to share a function (measured by high w*ij) are similar to each other [second term in (3)]. As done in Mostafavi et al. (2008), we set An external file that holds a picture, illustration, etc.
Object name is btq262i13.jpg, the mean of the labels of the labeled nodes; this modification results in considerable performance improvement in unbalanced classification problems such as gene function prediction.

Setting σi>0, ensures that Σ+L is invertible because Σ+L is diagonally dominant; in our experiments, we set Σ=I (the identity matrix). To solve for An external file that holds a picture, illustration, etc.
Object name is btq262i14.jpg, we only need to solve a linear system of equations An external file that holds a picture, illustration, etc.
Object name is btq262i15.jpg, where M = (I+L), which we can do with various existing fast iterative solvers (Nocedal and Wright, 2006). We use conjugate gradient (CG). CG is guaranteed to terminate in n steps, the most time-consuming operation at each step being a matrix-vector product with a computational complexity proportional to the number of non-zero elements in L; in our setting L is very sparse, with O(n) non-zero elements, and CG terminates in fewer than 20 iterations.

4 METHODS

In this section, we describe our benchmark datasets, how we construct functional association networks, our evaluation criterion and how we group function categories in SW (see the Supplementary Material for more detailed information).

4.1 Yeast, fly, mouse, human and E.Coli datasets

We evaluate our methodology on benchmark networks in five species: yeast, fly, mouse, human and E.coli. For yeast, we constructed 44 networks that include interactions derived from gene expression, protein and genetic interaction [downloaded from BIOGRD (Stark et al., 2006)] and protein localization. For mouse, we use the MouseFunc benchmark (Pena-Castillo et al., 2008), which consists of 10 networks and covers 21 603 mouse genes. For fly, we have constructed 38 networks from various gene expression data [downloaded from GEO (Edgar et al., 2002)], protein interaction (downloaded from BioGRID) and domain composition [downloaded from BioMART (Kasprzyk et al., 2004)] that cover 13 562 fly genes. For E.coli, we use seven networks from Hu et al. (2009) that include co-inheritance and protein interactions for 4175 E.coli genes. Similarly, our human benchmark consists of eight networks constructed from various gene expression, protein interaction, domain composition and phenotype data and covers 13 281 human genes obtained from HPRD (Prasad et al., 2006).

4.2 Functional association networks

We construct networks from each profile-based high-throughput data source using the PCC. For network-based data (e.g. protein interaction), we use both a direct interaction network and a correlation-based network using the PCC on the frequency-corrected data [as done in Mostafavi et al. (2008)]. For efficiency, we sparsify our correlation-based networks by setting by keeping the top K interactions for each gene and setting the rest to zero. See the Supplementary Material for more details. We then normalized all our networks by: An external file that holds a picture, illustration, etc.
Object name is btq262i16.jpg where Dd is the diagonal row sum matrix of Wd. Similarly, we also normalize the combined network W*.

4.3 Evaluation

To evaluate gene function prediction, we use the GO biological process (BP) function categories (Ashburner et al., 2000) for Saccharomyces cerevisiae (June 2006), Mus musculus [downloaded from MouseFunc data (Pena-Castillo et al., 2008)], Drosophila melanogaster (July 2009), Homo sapiens (July 2009) and E.coli (April 2010). Following common practice, we have removed Inferred from Electronic Annotation (IEA) annotations. These annotations, which constitute the majority of GO annotations, are not reviewed by a curator and, as such, are believed to be less accurate. Furthermore, doing so helps us to avoid circularity because IEAs are themselves computationally predicted using some of the data that we make available to our algorithms.

We evaluate each method's composite networks by using them as input to the GRF algorithm (the second step in GeneMANIA). We report the performance in terms of both average area under the receiver operating characteristic (ROC) curve (AUC of ROC) and average precision at 10% recall over all BP GO categories with 3–300 annotations using 3-fold cross-validation (CV). We focus on BP categories because they make up the majority of functions in the GO hierarchy. Our results for cellular component (CC) and molecular function (MF) categories are similar and are described in the Supplementary Material.

4.4 Grouping GO categories for simultaneous weights

We have examined several methods for grouping GO categories when assigning SWs including grouping by (i) GO hierarchy (i.e. BP, CC and MF) (ii) GO hierarchy and number of annotations (iii) clustering of GO categories based on annotations and (iv) ancestor and descendant terms with ancestors having a maximum category size (300 annotations). We only report results for the grouping of GO categories by GO hierarchy and number of annotations (e.g. all categories with less than 300 annotations) as we found it to have the best performance (Supplementary Material).

5 RESULTS

Here, we first evaluate SW and compare its performance to several other approaches: various regularized linear regression methods, the TSS algorithm and a simpler correlation-based method (described below), using the yeast benchmark networks. We then show analogous results using mouse, fly, human and E.coli benchmark data.

5.1 Performance on yeast networks

5.1.1 Comparison of performance of SW with various function-specific linear regression methods

We first extensively compare the performance of SW in predicting gene function in yeast to that of GeneMANIA. In particular, as discussed in Section 4, one way to improve the performance of function-specific constrained linear regression in GeneMANIA is to use regularization; in fact, Mostafavi et al. (2008) showed that ridge regression (i.e. [ell]2 norm regularization) to a mean weight prior, where the mean weights refer to the average weight assigned to each network in a large number of function predictions, considerably improves the performance with the drawback of increasing the computation time to estimate the mean weights. Here, we investigate the effect of several forms of regularization on the performance of GeneMANIA algorithm where we find the network weights by solving the following problem:

equation image

where J ≥ 0 is the regularization function. In particular, we investigated the performance of four different regularizations: (i) ridge with uniform prior, (ii) ridge with mean prior, (iii) LASSO and (iv) elastic net. In LASSO (Tibshirani, 1996), An external file that holds a picture, illustration, etc.
Object name is btq262i17.jpg, whereas in standard ridge regression An external file that holds a picture, illustration, etc.
Object name is btq262i18.jpg where α1 and α2 are regularization constants and determine the strength of the regularization. The elastic net regularization (Zou and Hastie, 2005) combines l2- and l1-norm penalties: α1md=1d| + α2md=2μd2. In Zou and Hastie (2005), it was shown that the elastic net results in a sparse solution and often performs better than the LASSO.

For ridge with a prior, we define An external file that holds a picture, illustration, etc.
Object name is btq262i19.jpg, where An external file that holds a picture, illustration, etc.
Object name is btq262i20.jpg is a prior weight vector and sd determines the strength of the regularization on μd. In Mostafavi et al. (2008), the mean weight prior was obtained as the average weight assigned to each category (using unregularized regression) in predicting all categories in the same GO hierarchy (we will refer to this method as ridge with mean prior). In addition, if we set vd = 1 the network weights are shrunk to a uniform value, we call this second method ridge with uniform prior. In our experiments, we set sd = 1/∑ijwdij; thus, the strength of the regularizer is higher on sparser networks.

Figure 1a summarizes the performance of each method in five evaluation categories: predicting gene functions which have [3–10], [11–30], [31–100], [101–300] positive annotations and [3–300] (i.e. overall) positive annotations. In ridge with mean prior, we set the prior on each network's weight to the average weight that network received in predicting all 1188 GO BP categories with 3–300 annotations. We used the LARS (Efron et al., 2004) algorithm to solve for the LASSO and elastic net solutions; we set the number of positive coefficients using F-statistics (Hastie et al., 2001). For elastic net, we set α2 = 1e-6 using CV.1 For SW, we used all 1188 BP GO categories to fit the networks weights. In Uniform, the network weights are all set to 1/m where m is the number of networks.

Fig. 1.
(a) Comparison of performance of LASSO, elastic net (ElasticNet), unregularized linear regression (Unregularized), ridge with uniform prior [Ridge (Uniform)], ridge with mean prior [Ridge (mean)], SW and a network combination with uniform weights (Uniform) ...

Figure 1a shows that SW significantly outperforms ridge regression with mean prior overall in terms of ROC (P = 4.368 × 10−23, Wilcoxon signed rank test) and slightly improves on the performance in terms of precision (P = 0.0437, Wilcoxon signed rank test) with the advantage that it only requires solving one linear regression problem to predict all 1188 GO categories (instead of 1188 for the function-specific network weighting methods). In addition, this figure shows that unregularized linear regression performs as well as or better than LASSO, ridge or elastic net regularization whereas ridge with a prior results in a better performance in all evaluation categories. However, as expected, we see that the performance of unregularized regression improves with increasing number of positives and thus it is more appropriate to use function-specific weighting in such instances.

One explanation for the observed trend in Figure 1a is that regularization methods that shrink the network weights toward zero are too selective and often identify only a few relevant networks. For example, on average 45% (20/44), 54% (24/44) and 95% (42/44), 97% (43/44) of the networks are assigned a non-zero weight using LASSO, unregularized linear regression and ridge with mean prior, and SW, respectively (see Supplementary Fig. S1). Note that the best performing networks on their own are significantly worse than the combined data (Fig. 1b). SW results in a better measure of network relevancy and with the current available genomics and proteomics datasets, one integrated composite functional association network can sufficiently and accurately predict a broad range of functional relationships.

5.1.2 Comparison of SW with TSS and correlation-based network weights

We also compared the performance of SW with two other methods: TSS algorithm (Tsuda et al., 2005) and a simpler correlation-based network weighting method (Fig. 2). In the correlation network weighting, each network is assigned a weight that is inspired by the Kernel Target Alignment score—we set An external file that holds a picture, illustration, etc.
Object name is btq262i21.jpg. Unlike the linear regression methods, correlation-based weighting does not account for the redundancy between the networks. The TSS algorithm (Tsuda et al., 2005) assigns the network weights by optimizing the performance of the GRFs algorithm with the resulting composite network. In our experiments, we set the regularization parameters of the TSS algorithm by CV to c0 = 0.5 and c = 1. As done in code provided in Tsuda et al. (2005), we also set a lower bound of 0.01 on μd. We note that the absence of the lower bound results in a decrease in the performance of the TSS algorithms. As shown in Figure 2, SW significantly outperforms correlation-based network weights and TSS in all evaluation categories.

Fig. 2.
Performance of SW, TSS and correlation in predicting gene function in yeast according to BP categories.

To further understand the differences between these various approaches, we compare the network weights that were assigned to individual networks. As shown in Figure 3, we observed that the TSS algorithm tends to be very selective, often assigning large weights to a few networks and a very low weight (the weight lower bound) to the rest. The correlation weights are similar to the linear regression weights; however, the redundancy between the networks is not accounted for. For example, the protein interaction networks (shown as cyan and green) drawn from separate publications tend to include similar information and the average of weights assigned to these networks by correlation weights is higher than that of linear regression. As expected, the mean weight assigned by linear regression to individual networks is similar to SW for that network. In general, consistent with previous studies (Marcotte et al., 1999), we observed that all methods assign a high proportion of the network weights to the networks derived from gene expression datasets and the protein localization dataset.

Fig. 3.
Each colored bar represents the average weight assigned to each network while predicting 1188 gene functions. Networks are divided into four types (i) co-localization (network 1), (ii) gene expressions (networks 2–7), (iii) protein interaction ...

5.2 Performance on fly, mouse, human and E.coli benchmarks

We also investigated the performance of unregularized linear regression, SW and uniform network weights on fly, mouse, human and E.coli networks in all GO categories that have between 3 and 100 annotations (2195 for fly, 1626 for mouse, 1952 for human, and 809 for E. coli). Figure 4 summarizes the performance in terms of AUC of ROC curve and precision at 10% recall in the four species. As shown, SW is significantly better than unregularized linear regression in the overall category in fly, mouse, human and E.coli in terms of AUC of ROC. As well, SW is significantly better than uniform and unregularized linear regression in terms of precision in fly and human. In mouse, SW significantly outperforms unregularized linear regression in terms of precision. We note that the human networks are sparser than the other organisms, which makes it hard to assign accurate network weights (mean number of interactions is 391 240 in human networks compared with 1 011 400 in mouse), which may explain the smaller (but significant) improvements of SW compared with uniform weights. As well, we note that the performance of uniform weights tends to degrade as the number of networks increases—this is because of the abundance of gene expression datasets and thus the number of co-expression networks. For example, out of the 38 networks for fly, 32 are co-expression networks. By not accounting for redundancy, the performance of uniform weights is significantly worse than that of SW.

Fig. 4.
Comparison of performance of unregularized linear regression (Unreg), SW and a fixed uniform combination of networks in predicting gene function in fly (a and e), mouse (b and f), human (c and g) and E.coli (d and h). The bars show average performance ...

6 CONCLUSION

We have introduced a new network weighting scheme for combining multiple networks that are derived from genomic and proteomic data in order to construct a composite network that is predictive of gene function. We have shown that by fitting network weights that are simultaneously optimized on a group of functions from the same branch of GO, we greatly improve prediction performance. We have shown that we can obtain these SWs by solving a constrained linear regression problem. In our experiments, the SW method results in a significant improvement in predicting gene function in yeast, mouse and fly. In human and E.coli, SWs performs only slightly better than a uniform network combination; this is because these networks tend to be sparser than the other networks making it hard to assign accurate network weights.

In our experiments, we have observed that adding a small amount of ridge regularization to SW results in a slight performance improvement; the regularization parameter can be set using CV; alternatively, we have observed good performance by setting it to An external file that holds a picture, illustration, etc.
Object name is btq262i22.jpg (i.e. 0.1% of the total number of observations). Our results show that fitting the SWs to GO categories in the same hierarchy with a broad range of specificities (those with [3–300] annotations) outperform more specific groupings of the GO categories. Note that, because we adjust the target vector An external file that holds a picture, illustration, etc.
Object name is btq262i23.jpg to balance the number of positives and negatives in each category c, the larger GO categories contribute more to the overall target vector An external file that holds a picture, illustration, etc.
Object name is btq262i24.jpg ; on the other hand, there are many more categories with [3–10] annotations.

In summary, we have demonstrated the feasibility and the utility of constructing a single composite network with SWs for predicting various GO categories. Unlike a fixed network combination with uniform weights, SWs account for noisy and redundant networks. This observation can in turn speed up gene function prediction from multiple networks.

Supplementary Material

[Supplementary Data]
[Supplementary Data]

ACKNOWLEDGMENTS

We would like to thank Khalid Zuberi, Quentin Shao and Javier Diaz for their help with collecting the networks.

Funding: Ontario Graduate Scholarship (partially funded to S.M.), Genome Canada Technology Development Project administered through Ontario Genomics Institute and a Natural Science and Engineering Operating grant (to Q.M.).

Conflict of Interest: none declared.

Footnotes

1We picked α2 from the set [1e-8, 1e-6, 1e-4, 1e-2, 1e-1, 1] by examining the mean ROC using 3-fold CV.

REFERENCES

  • Ashburner M, et al. Gene ontology: tool for unification of biology. Nat. Genet. 2000;25:25–29. [PMC free article] [PubMed]
  • Bairoch A. The enzyme database in 2000. Nucleic Acids Res. 2000;28:304–305. [PMC free article] [PubMed]
  • Cristianini N, et al. Proceedings of the Fourteen Conference on Advances in Neural Information Processing Systems. Vancouver, BC, Canada: 2002. On kernel target alignment; pp. 367–373.
  • Edgar R, et al. Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002;30:207–210. [PMC free article] [PubMed]
  • Efron B, et al. Least angle regression. Ann. Stat. 2004;32:407–499.
  • Hastie T, et al. The Elements of Statistical Learning. NY, USA: Springer; 2001.
  • Hu P, et al. Global functional atlas of escherichia coli encompassing previously uncharacterized proteins. PLoS Biol. 2009;7:e96. [PMC free article] [PubMed]
  • Kanehisa K, Goto S. KEGG: Kyoto encyclopedia of genes and genome. Nucleic Acids Res. 2000;28:27–30. [PMC free article] [PubMed]
  • Karaoz U, et al. Whole-genome annotation by using evidence integration in functional-linkage networks. Proc. Natl Acad. Sci. USA. 2003;101:2888–2893. [PMC free article] [PubMed]
  • Kasprzyk A, et al. Ensmart: a generic system for fast and flexible access to biological data. Genome Res. 2004;14:160–169. [PMC free article] [PubMed]
  • Kondor R, Lafferty J. Diffusion kernels on graphs and other discrete structures. Int. Conf. Mach. Learn. (ICML) 2002;11:463–475.
  • Lanckriet G, et al. A statistical framework for genomic data fusion. Bioinformatics. 2004;20:2626–2635. [PubMed]
  • Marcotte E, et al. A combined algorithm for genome-wide prediction of protein function. Nature. 1999;42:83–86. [PubMed]
  • Mostafavi S, Morris Q. Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence. Montreal, QC, Canada: 2009. Using the gene ontology hierarchy when predicting gene function.
  • Mostafavi S, et al. Genemania: a real-time multiple association network integration algorithm for predicting gene function. Genome Biol. 2008;9(Suppl. 1):S4. [PMC free article] [PubMed]
  • Myers C, Troyanskaya O. Context-sensitive data integration and prediction of biological networks. Bioinformatics. 2007;23:2322–2330. [PubMed]
  • Nabieva E, et al. Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics. 2005;2(Suppl. 1) [PubMed]
  • Noble W, Ben-Hur A. Integrating Information for Protein Function Prediction. In: Lengauer T, editor. Bioinformatics-From Genomes to Therapies. Weinheim, Germany: Wiley-VCH Verlag GmbH & Co KGaA; 2007.
  • Nocedal J, Wright S. Numerical Optimization. NY, USA: Springer; 2006.
  • Pavlidis P, et al. Learning gene functional classification from multiple data types. J. Comput. Biol. 2002;9:401–411. [PubMed]
  • Pena-Castillo L, et al. A critical assessment of Mus musculus gene function prediction using integrated genomic evidence. Genome Biol. 2008;9(Suppl. 1):S2. [PMC free article] [PubMed]
  • Prasad TK, et al. Human protein reference database – 2006 update. Nucleic Acids Res. 2006;34:D411–D414. [PMC free article] [PubMed]
  • Qi Y, et al. Finding friends and enemies in an enemies-only network: a graph diffusion kernel for predicting novel genetic interactions and co-complex membership from yeast genetic intearctions. Genome Res. 2008;18:1991–2004. [PMC free article] [PubMed]
  • Stark C, et al. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 2006;1:D539. [PMC free article] [PubMed]
  • Tibshirani R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B. 1996;58:267–288.
  • Tsuda K, et al. Fast protein classification with multiple networks. Bioinformatics. 2005;21(Suppl. 2):ii59–ii65. [PubMed]
  • Vazquez A, et al. Global protein function prediction from protein-protein interaction networks. Nat. Biotechnol. 2003;21:697–700. [PubMed]
  • Zhou D, et al. Learning with local and global consistency. Adv. Neural Inf. Process. Syst. 2004;16:321–328.
  • Zhu X, et al. Proceedings of the Twentieth International Conference on Machine Learning. Washington DC, USA: 2003. Semi-supervised learning using Gaussian fields and harmonic functions.
  • Zou H, Hastie T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. B. 2005;67:301–320.

Articles from Bioinformatics are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...