![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||
Copyright © 2005, Cold Spring Harbor Laboratory Press Assessing the limits of genomic data integration for predicting protein networks 1 Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA 2 Department of Computer Science, Yale University, New Haven, Connecticut 06520, USA 3 Program of Computation Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA 4Corresponding author. E-mail Mark.Gerstein/at/yale.edu; fax (360) 838-7861. Received December 22, 2004; Accepted May 2, 2005. This article has been cited by other articles in PMC.Abstract Genomic data integration—the process of statistically combining diverse sources of information from functional genomics experiments to make large-scale predictions—is becoming increasingly prevalent. One might expect that this process should become progressively more powerful with the integration of more evidence. Here, we explore the limits of genomic data integration, assessing the degree to which predictive power increases with the addition of more features. We focus on a predictive context that has been extensively investigated and benchmarked in the past—the prediction of protein–protein interactions in yeast. We start by using a simple Naive Bayes classifier for integrating diverse sources of genomic evidence, ranging from coexpression relationships to similar phylogenetic profiles. We expand the number of features considered for prediction to 16, significantly more than previous studies. Overall, we observe a small, but measurable improvement in prediction performance over previous benchmarks, based on four strong features. This allows us to identify new yeast interactions with high confidence. It also allows us to quantitatively assess the inter-relations amongst different genomic features. It is known that subtle correlations and dependencies between features can confound the strength of interaction predictions. We investigate this issue in detail through calculating mutual information. To our surprise, we find no appreciable statistical dependence between the many possible pairs of features. We further explore feature dependencies by comparing the performance of our simple Naive Bayes classifier with a boosted version of the same classifier, which is fairly resistant to feature dependence. We find that boosting does not improve performance, indicating that, at least for prediction purposes, our genomic features are essentially independent. In summary, by integrating a few (i.e., four) good features, we approach the maximal predictive power of current genomic data integration; moreover, this limitation does not reflect (potentially removable) inter-relationships between the features. A major challenge in post-genomic biology is systematically mapping the interactome, the set of all protein–protein interactions within an organism. Since proteins carry out their functions by interacting with one another and with other biomolecules, reconstructing the interactome of a cell is the important first step toward understanding protein function and cell behavior (Hartwell et al. 1999; Eisenberg et al. 2000). Recently, several large-scale protein-interaction maps have been experimentally determined in the model organism Saccharomyces cerevisiae (Uetz et al. 2000; Ito et al. 2001; Gavin et al. 2002; Ho et al. 2002). These studies have drastically improved our knowledge of protein interactions. Unfortunately, the data sets generated from these studies are often noisy and incomplete (von Mering et al. 2002). In addition to experimentally determined interaction data sets, there exists a large amount of biological information in the expanding functional genomic data sets, such as sequence, structure, functional annotation, and expression-level databases. It is thus desirable to computationally predict protein–protein interactions by exploiting the interaction evidence contained in these data sets. Such predictions can serve as a valuable complement to the current experimental efforts. Several studies have been carried out to search for individual features contained in the genomic data sets that are useful for interaction prediction. For example, two proteins are likely to interact if they have homologs in another genome that are fused into a single protein, or if their mRNA expression patterns are correlated (Marcotte et al. 1999a,b; Ideker et al. 2001; Jansen et al. 2002a). Detailed reviews of these individual methods can be found elsewhere (Valencia and Pazos 2002; Xia et al. 2004). Each genomic feature, by itself, is only a weak predictor of protein interactions. However, predictions can be improved by integrating different genomic features (Marcotte et al. 1999b). There are two main reasons for this. First, predicting a protein–protein interaction with confidence depends on how much evidence supports it. When multiple distinct features all support a predicted interaction, our confidence in the prediction increases. Second, different features may cover different subsets of the interactome, and feature integration can increase the coverage. Feature integration can be accomplished via simple rules, such as intersection, union, or majority vote. To achieve optimal predictive power, however, different genomic features need to be properly integrated into a single probabilistic framework (Gerstein et al. 2002). Many machine learning methods can be used for feature integration, such as Bayesian approaches (Troyanskaya et al. 2001; Jansen et al. 2003; Friedman 2004), decision trees (Lin et al. 2004; Zhang et al. 2004), and support vector machines (Brown et al. 2000). In particular, Bayesian approaches can be roughly divided into two broad groups as follows: (1) learning to infer the causal structure of cellular networks from quantitative measurements (Friedman 2004); (2) classification based on a set of probabilistic rules. Here, we focus on the second classification aspect of Bayesian approaches. In addition to protein–protein interaction prediction, feature integration is also essential for other prediction problems in genomics as well, such as localization prediction (Drawid et al. 2000), function prediction (Troyanskaya et al. 2001; Lee et al. 2004), and genetic interaction prediction (Wong et al. 2004). One might expect genomic data integration to become increasingly powerful with the integration of more evidence. Here, we explore the limits of genomic data integration, assessing the degree to which predictive power increases with addition of more features. We focus on a predictive context that has been extensively investigated and benchmarked in the past; the prediction of protein–protein interactions in yeast. Previously, we developed a Naive Bayesian classification approach to predict protein–protein interactions in yeast by integrating four genomic features (functional similarity based on MIPS and GO annotations, mRNA expression correlation, and coessentiality) (Jansen et al. 2003). By definition, two proteins interact if they belong to the same complex. The parameters in the Naive Bayes classifier were trained using a collection of protein pairs known to be interacting or noninteracting. The advantages of Naive Bayes classifiers are two-fold. First, the models constructed by Naive Bayes classifiers are readily interpretable; they represent conditional probabilities among features and class labels (interaction vs. noninteraction). Second, Naive Bayes classifiers are very flexible for the highly heterogeneous genomic features. Numerical features and categorical features can be easily combined, and missing data can be readily handled. In this study, we expand the list of genomic features to include 16 diverse features that are plausible indicators for protein interactions. These 16 features are assembled based on both protein pair features and single protein features, and they are derived from a wide range of physical, genetic, contextual, and evolutionary properties of yeast genes. We believe that such “feature-richness” is an essential property of genomic data sets; therefore, we would like to test whether protein-interaction predictions can be further improved by exploiting the diversity of the features, and if so, by how much. Naive Bayes classifiers assume conditional independence between features (see Methods). In the following text, when we say (in)dependent, we mean conditionally (in)dependent. We would expect that there exists a high dependence between a number of genomic features, and that this would become increasingly likely as we try to integrate more features. In this case, Naive Bayes may no longer be the optimal approach, as the dependence among features needs to be taken into account. In this study, we apply boosting to Naive Bayes classifiers as an automated and efficient way for handling dependent features. Boosting (Schapire 1990)—in particular, AdaBoost (Freund and Schapire 1996)—is a recent development in the field of machine learning. The process combines the performances of several weak classifiers to form strong predictions via a weighted majority vote. In our case, the weak classifiers can be either individual features or simple Naive Bayes classifiers. Boosting approximately finds the best linear combination of all possible weak classifiers via maximum likelihood on a logistic scale (Friedman et al. 2000), thereby solving potential feature redundancy and statistical dependence problems. By comparing the performance of a simple Naive Bayes classifier with a boosted Naive Bayes classifier on our collection of features, we will be able to address whether or not the dependence among our collection of features—if any—decreases the Naive Bayes classifier's predictive power. In other words, does the Naive Bayes approach perform sufficiently well at the current level of feature dependence? This comparison will also be done on a set of highly dependent features as a control. Results and Discussion A list of features useful for predicting protein interactions In addition to the four features in Jansen et al. (2003), we consider 12 more features as listed in Figure 1
Predictive power of individual features We use ROC curves (see Methods) to illustrate the predictive power of each individual feature. Figure 2
A good feature, i.e., one with high predictive power, simultaneously has a large number of true positives and a small number of false positives. In this case, the ROC curve climbs rapidly away from the origin (lower left hand corner of the graph). How quickly the ROC curve arises away from the origin can be quantified by measuring the area under the curve. The larger the area, the better the feature. Ranking the features by the area they cover in the ROC curves (easily seen in Fig. 3A Another point we need to pay attention to is that we should not take the performance of a feature against the GSTDs as indicative of the accuracy or usefulness of the feature in its original context. This is because the performance of a feature against the GSTDs only measures its usefulness in relation to a specific task—i.e., predicting complex membership—which is probably not what the feature was originally designed to do. For example, multimeric threading method is designed for predicting physical interactions between two proteins. However, because of the way the GSTDs are constructed, the majority of protein pairs in the GSTDs are simply in the same molecular complex without direct contacts. Therefore, when predicting physical interactions, these GSTDs are not a good means of judging the accuracy or usefulness of the multimeric threading method. Quite often, only the TPR for a specific FPR is valued. For example, COE outperforms MIP until the FPR reaches 5%, even though MIP covers more area in the whole range of FPR. Thus, the features can also be ranked and selected according to the acceptable FPR in prediction. Feature selection and improvement of performance Because of the varying quality and predictive powers of genomic features, incorporating all features without selection will likely decrease the predictive power by introducing noise rather than improving the results. Therefore, we select only those new features with high predictive power based on the performance of individual features. Another factor we need to take into account is the coverage of features. It is obvious that there is a distinct difference between the features to the left and right of the divider in Figure 2 The performance of combining new features is presented in Figure 4A
Because of the dominant performance of the two functional similarity features (MIP and GOF), the improvement accomplished by incorporating new features may not seem obvious. We thus exclude these two functional features, showing the improvement by incorporating three additional features over the remaining two original features (i.e., COE and ESS). Including three additional features shows a significant improvement over the original two features (Fig. 4B Another benefit of genomic data integration is the improvement in coverage; by incorporating more features, two predictors with similar ROC curve performance may cover different parts of the system to varying degrees. Note, it is the coverage of not only the labeled pairs (GSTDs), but also unlabeled pairs (unseen pairs). So far, our assessments have been done for labeled pairs only; however, if additional features allow the predictor to have a more extensive view of the system despite no significant improvement in ROC curve, they probably should be considered as beneficial, because in this case, the coverage of unlabeled pairs is improved. Here, we find the coverage is slightly improved by integrating more features. For all possible 21,658,071 protein pairs (6582 ORFs from MIPS), the four original features cover 18,527,741 pairs (85.5%), whereas the seven most populous features cover 18,880,102 (87.2%). Correlations and statistical dependence between features In this section, we investigate whether or not the marginality of improvement is confounded by the correlation and dependencies between features. We first calculate the Pearson correlation coefficients (CCs) between each pair of features. Such correlations between features can often generate useful biological insights. The five highest absolute values are highlighted in bold in Table 1A. None of the feature pairs exhibit significant correlation. In addition, we calculate mutual information between genomic features as an alternative to CCs. Whereas CC only measures linear relationships, mutual information is a more general measure of correlation. The results show an agreement with Ccs. The five pairs containing the most mutual information are exactly the same as those of the CCs. These correlations between some of the features, albeit not strong, are expected. For example, the correlations between the two functional features (MIP and GOF) are the highest among feature pairs. It is also expected that absolute mRNA expression (EXP) and absolute protein abundance (APA) are somewhat correlated. We next investigate the conditional dependence between features given the positive or negative GSTD by calculating mutual information. In other words, we calculate the mutual information between pairs of features by taking into account only protein pairs that occur in both features and in either set of GSTDs. The small amount of mutual information, given either set of GSTDs, indicates that the features we integrated by Naive Bayes classifier are largely conditionally independent (Table 1B). Simple Naive Bayes classifier vs. boosted Naive Bayes classifier on data sets with or without high dependence Even though the conditional dependence between our features is not strong, it is possible that the combined weak dependence can still significantly decrease the predictive power of a Naive Bayes classifier. In this section, we address this question by comparing the performance of a simple Naive Bayes classifier (SNB) with that of a boosted Naive Bayes classifier (BNB). Since a BNB is fairly resistant to feature dependence, a significantly worse performance by a SNB on the same data set means that the feature dependence does affect the predictive power of the SNB. We first conduct a control experiment with highly dependent features to verify the resistance of BNB to feature dependence. To obtain a highly dependent set of features, we used mRNA expression data from microarray experiments conducted by Cho et al. (1998) under eight different conditions. Such expression data are highly dependent with regard to high CCs—the minimum CC between each pair of conditions is 0.904, the maximum CC is 0.970. Treating these eight sets of expression data as if they were eight features, we integrate them with the original four features. When evaluated on this highly dependent data set, the BNB significantly outperforms the SNB. Figure 5 We then compare a SNB with a BNB on our data set, with only weak conditional dependence; the original four features plus only one instead of eight sets of expression data. If the BNB significantly outperforms the SNB, it indicates that the SNB is affected by feature dependence, even though it is not strong. The results show that the SNB performs as well as the BNB on this weakly dependent data set (Fig. 5 The results in Figure 5 Conclusions In this study, we quantitatively address the question of how far genomic data integration can be improved by integrating more and more features. We use a SNB for integrating diverse sources of genomic evidence, ranging from coexpression relationships to similar phylogenetic profiles. By integrating three more strong features, marginal improvement on both accuracy and coverage can be achieved. The calculations of correlation coefficients, mutual information, and boosting all suggest that the marginality of the improvement on prediction by incorporating more features is unlikely to result from the weak feature dependencies. It is also unlikely to result from an excess of parameters, relative to data points (resulting in overfitting), because our Naive Bayes approach involves simple models with only small numbers of free parameters that are fitted against a large number of data points. Rather, this suggests that by integrating a few good features, we approach the maximal predictive power, or limit, of current genomic data integration. Furthermore, this limitation does not reflect (potentially removable) inter-relationships between the features. Unless we obtain features that are stronger in predictive power than MIP and GOF and simultaneously possess a reasonable coverage, it is unlikely that the prediction will be significantly improved by integrating a few more features. It is also possible that a higher coverage of our examined 16 features may allow better predictive power in the future. Our discovery that no strong dependence exists between features is an interesting finding in and of itself. Among as many as seven populous features, one might expect some dependence high enough to significantly decrease SNB's predictive power. However, our calculation on correlation coefficients and mutual information, as well as our boosting results, suggest otherwise. One possibility is that the observed lack of dependence among different features may result from differences in coverage, since all of these data sets are essentially incomplete. Specifically, the overlap of proteins or protein pairs represented among the different features is likely to increase with extended coverage and possibly results in higher feature dependence. In this case, the BNB can be used as an alternative solution. Finally, SNB is chosen in this study because of its simplicity, as well as the ability to compare with an existing benchmark study using the same technique (Jansen et al. 2003). Furthermore, we use BNB to specifically address SNB's well-known limitation relating to high feature dependency. Other machine-learning techniques could have been potentially used in this study. However, most alternative techniques have issues in their own right, such as suffering from the missing value problems or being prohibitively time-consuming. Such problems prevent them from being applied to this problem as readily as a SNB. In addition, since BNB does not improve SNB on our collection of features, it is probably not the case that the conclusions made here will be significantly different if other machine-learning techniques are used—though, of course, we cannot definitely say this without a comprehensive test. Methods Naive Bayesian formalism Inferring protein–protein interactions from genomic features can be formulated as a classification problem, in which we classify a pair of proteins into two classes (C1 = interact, C0 = not interact), given an n-dimensional vector of genomic features x = (x1,x2,...,xn).5 The Bayesian Decision Rule states that in order to minimize the average probability of a classification error, one must choose the class with the highest posterior probability, i.e., assign a feature vector x to the class Ck, such that: Ck = argCimax P(Ci | x), where Ci ranges over the set of classes (see for example, Bishop 1995; Duda et al. 2001). Ck is known as the maximum a posteriori (MAP) estimate. Using Bayes theorem, the posterior probability can be rewritten, as The idea behind Naive Bayes is to make the simplifying assumption that the attribute values are conditionally independent, given the target values. The computation of each is thus made efficient by approximating it as a product of conditional probabilities
In the case of stochastic independence, the covariance between two features is zero. Thus, the covariance between features is a measure of the deviation from the condition of stochastic independence and is indicative of the amount of approximation introduced by the Naive Bayes assumption. For this reason, the next section shall present an analysis of the covariance between the various features, given the class. Alternatively, the Bayesian Decision rule for two classes can be stated thusly:
If we then introduce the Naive Bayes approximation, we can rewrite equation 2 as:
ROC (receiver operating characteristic) curve In a two-class classification problem, with classes C1 (or positive) and C0 (or negative), for each prediction there are four possible outcomes. The true positives (TP) and the true negatives (TN) are correct classifications. Wrong classifications can be of two types. For a false positive (FP), the outcome is incorrectly predicted as belonging to C1, when in fact it belongs to C0; for a false negative (FN), the outcome is incorrectly predicted as belonging to C0, when it belongs to C1. Our earlier discussion on Naive Bayes was motivated by the goal of minimizing the average probability of a classification error; it was aimed at reducing the total number of wrong predictions, regardless of the type of error that was made. This amounts to saying that we were maximizing the number of A ROC curve graphically depicts the performance of a classification method for different costs. It consists of a set of points, each computed for a different setting of the cost, connected by lines. For each point, the vertical coordinate is a true positive rate (TPR) given by the ratio of the number of true positives to the total number of positives (i.e., TP/[TP+FN]), while the horizontal coordinate is a false positive rate (FPR) given by the ratio of the number of false positives to the total number of negatives (i.e., FP/[FP+TN]). Note that the TPR is equivalent to the commonly used term sensitivity, while FPR is equivalent to 1—specificity. Clearly, the ROC curve for a good classifier will be as close as possible to the upper-left corner of the chart; that is where we have the highest number of true positives and at the same time the smallest number of false positives. Mutual information Given two random variables, X and Y (in this study, X and Y are either feature values or class labels), the Mutual Information I(X; Y) between X and Y measures how much information one variable conveys about the other one. It is defined as the relative entropy (or Kullback-Leibler distance) between the joint distribution and the product distribution of X and Y, that is Boosting Boosting is a general method that can be used for improving the performance of any classifier. The idea behind boosting is to combine the outputs of many different “weak” classifiers to produce a powerful “committee.” We have used one of the most popular boosting algorithms, AdaBoost (Freund and Schapire 1999), which we shall briefly describe here. For more information on this and other boosting algorithms refer to Friedman et al. 2000. AdaBoost consists of sequentially applying a weak classification algorithm to modified versions of the data, producing a sequence of weak classifiers. Then, the prediction from each classifier is combined through a weighted majority vote. The data is modified by applying weights to each of the training observations. At each iteration, a weak learner is trained on the weighted set of data and the weights are updated. This operation is repeated until the desired performance for the training data is achieved. The updating rule for these weights is such that training pairs that had been misclassified in the previous step will have their weights increased, while those that were correctly classified will have their weights decreased. At each iteration, then, training pairs that are more difficult to classify have more influence, and classifiers are forced to focus on pairs overlooked by previous classifiers. Given a data set of N training pairs (xi,yi), i = 1...N, where xi is an input vector of features and yi {–1,1} is the target value representing classes C0 and C1, respectively, let us denote the weight associated with training pair i at time t as Dt(i), and the weak classification algorithm used at time t as ht. The AdaBoost algorithm to iterate T times is as follows:
Training and testing data sets The details of construction of the training and testing data sets are described in Figure 1 Acknowledgments We thank Drs. Ronald Jansen, Valery Trifonov, and Haoxin Lu for stimulating discussions and proofreading of this manuscript. Y.X. is a Fellow of the Jane Coffin Childs Memorial Fund for Medical Research. This work is supported by a grant from NIH/NIGMS for work in the PSI. Notes [All genomic feature data used in this study can be downloaded at http://networks.gersteinlab.org/intint/.] Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.3610305. Footnotes 5Bold letters denote vectors; P(·) denote probabilities; p(·) denote probability density functions. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||
Nature. 1999 Dec 2; 402(6761 Suppl):C47-52.
[Nature. 1999]Nature. 2000 Jun 15; 405(6788):823-6.
[Nature. 2000]Nature. 2000 Feb 10; 403(6770):623-7.
[Nature. 2000]Proc Natl Acad Sci U S A. 2001 Apr 10; 98(8):4569-74.
[Proc Natl Acad Sci U S A. 2001]Nature. 2002 Jan 10; 415(6868):141-7.
[Nature. 2002]Nature. 1999 Nov 4; 402(6757):83-6.
[Nature. 1999]Science. 2002 Jan 11; 295(5553):284-7.
[Science. 2002]Bioinformatics. 2001 Jun; 17(6):520-5.
[Bioinformatics. 2001]Science. 2003 Oct 17; 302(5644):449-53.
[Science. 2003]Science. 2004 Feb 6; 303(5659):799-805.
[Science. 2004]Science. 2003 Oct 17; 302(5644):449-53.
[Science. 2003]Science. 2003 Oct 17; 302(5644):449-53.
[Science. 2003]Mol Cell. 1998 Jul; 2(1):65-73.
[Mol Cell. 1998]Science. 2003 Oct 17; 302(5644):449-53.
[Science. 2003]