• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of genoresGenome ResearchCSHL PressJournal HomeSubscriptionseTOC AlertsBioSupplyNet
Genome Res. Jul 2005; 15(7): 945–953.
PMCID: PMC1172038

Assessing the limits of genomic data integration for predicting protein networks

Abstract

Genomic data integration—the process of statistically combining diverse sources of information from functional genomics experiments to make large-scale predictions—is becoming increasingly prevalent. One might expect that this process should become progressively more powerful with the integration of more evidence. Here, we explore the limits of genomic data integration, assessing the degree to which predictive power increases with the addition of more features. We focus on a predictive context that has been extensively investigated and benchmarked in the past—the prediction of protein–protein interactions in yeast. We start by using a simple Naive Bayes classifier for integrating diverse sources of genomic evidence, ranging from coexpression relationships to similar phylogenetic profiles. We expand the number of features considered for prediction to 16, significantly more than previous studies. Overall, we observe a small, but measurable improvement in prediction performance over previous benchmarks, based on four strong features. This allows us to identify new yeast interactions with high confidence. It also allows us to quantitatively assess the inter-relations amongst different genomic features. It is known that subtle correlations and dependencies between features can confound the strength of interaction predictions. We investigate this issue in detail through calculating mutual information. To our surprise, we find no appreciable statistical dependence between the many possible pairs of features. We further explore feature dependencies by comparing the performance of our simple Naive Bayes classifier with a boosted version of the same classifier, which is fairly resistant to feature dependence. We find that boosting does not improve performance, indicating that, at least for prediction purposes, our genomic features are essentially independent. In summary, by integrating a few (i.e., four) good features, we approach the maximal predictive power of current genomic data integration; moreover, this limitation does not reflect (potentially removable) inter-relationships between the features.

A major challenge in post-genomic biology is systematically mapping the interactome, the set of all protein–protein interactions within an organism. Since proteins carry out their functions by interacting with one another and with other biomolecules, reconstructing the interactome of a cell is the important first step toward understanding protein function and cell behavior (Hartwell et al. 1999; Eisenberg et al. 2000). Recently, several large-scale protein-interaction maps have been experimentally determined in the model organism Saccharomyces cerevisiae (Uetz et al. 2000; Ito et al. 2001; Gavin et al. 2002; Ho et al. 2002). These studies have drastically improved our knowledge of protein interactions. Unfortunately, the data sets generated from these studies are often noisy and incomplete (von Mering et al. 2002). In addition to experimentally determined interaction data sets, there exists a large amount of biological information in the expanding functional genomic data sets, such as sequence, structure, functional annotation, and expression-level databases. It is thus desirable to computationally predict protein–protein interactions by exploiting the interaction evidence contained in these data sets. Such predictions can serve as a valuable complement to the current experimental efforts. Several studies have been carried out to search for individual features contained in the genomic data sets that are useful for interaction prediction. For example, two proteins are likely to interact if they have homologs in another genome that are fused into a single protein, or if their mRNA expression patterns are correlated (Marcotte et al. 1999a,b; Ideker et al. 2001; Jansen et al. 2002a). Detailed reviews of these individual methods can be found elsewhere (Valencia and Pazos 2002; Xia et al. 2004).

Each genomic feature, by itself, is only a weak predictor of protein interactions. However, predictions can be improved by integrating different genomic features (Marcotte et al. 1999b). There are two main reasons for this. First, predicting a protein–protein interaction with confidence depends on how much evidence supports it. When multiple distinct features all support a predicted interaction, our confidence in the prediction increases. Second, different features may cover different subsets of the interactome, and feature integration can increase the coverage. Feature integration can be accomplished via simple rules, such as intersection, union, or majority vote. To achieve optimal predictive power, however, different genomic features need to be properly integrated into a single probabilistic framework (Gerstein et al. 2002). Many machine learning methods can be used for feature integration, such as Bayesian approaches (Troyanskaya et al. 2001; Jansen et al. 2003; Friedman 2004), decision trees (Lin et al. 2004; Zhang et al. 2004), and support vector machines (Brown et al. 2000). In particular, Bayesian approaches can be roughly divided into two broad groups as follows: (1) learning to infer the causal structure of cellular networks from quantitative measurements (Friedman 2004); (2) classification based on a set of probabilistic rules. Here, we focus on the second classification aspect of Bayesian approaches. In addition to protein–protein interaction prediction, feature integration is also essential for other prediction problems in genomics as well, such as localization prediction (Drawid et al. 2000), function prediction (Troyanskaya et al. 2001; Lee et al. 2004), and genetic interaction prediction (Wong et al. 2004).

One might expect genomic data integration to become increasingly powerful with the integration of more evidence. Here, we explore the limits of genomic data integration, assessing the degree to which predictive power increases with addition of more features. We focus on a predictive context that has been extensively investigated and benchmarked in the past; the prediction of protein–protein interactions in yeast. Previously, we developed a Naive Bayesian classification approach to predict protein–protein interactions in yeast by integrating four genomic features (functional similarity based on MIPS and GO annotations, mRNA expression correlation, and coessentiality) (Jansen et al. 2003). By definition, two proteins interact if they belong to the same complex. The parameters in the Naive Bayes classifier were trained using a collection of protein pairs known to be interacting or noninteracting. The advantages of Naive Bayes classifiers are two-fold. First, the models constructed by Naive Bayes classifiers are readily interpretable; they represent conditional probabilities among features and class labels (interaction vs. noninteraction). Second, Naive Bayes classifiers are very flexible for the highly heterogeneous genomic features. Numerical features and categorical features can be easily combined, and missing data can be readily handled.

In this study, we expand the list of genomic features to include 16 diverse features that are plausible indicators for protein interactions. These 16 features are assembled based on both protein pair features and single protein features, and they are derived from a wide range of physical, genetic, contextual, and evolutionary properties of yeast genes. We believe that such “feature-richness” is an essential property of genomic data sets; therefore, we would like to test whether protein-interaction predictions can be further improved by exploiting the diversity of the features, and if so, by how much.

Naive Bayes classifiers assume conditional independence between features (see Methods). In the following text, when we say (in)dependent, we mean conditionally (in)dependent. We would expect that there exists a high dependence between a number of genomic features, and that this would become increasingly likely as we try to integrate more features. In this case, Naive Bayes may no longer be the optimal approach, as the dependence among features needs to be taken into account.

In this study, we apply boosting to Naive Bayes classifiers as an automated and efficient way for handling dependent features. Boosting (Schapire 1990)—in particular, AdaBoost (Freund and Schapire 1996)—is a recent development in the field of machine learning. The process combines the performances of several weak classifiers to form strong predictions via a weighted majority vote. In our case, the weak classifiers can be either individual features or simple Naive Bayes classifiers. Boosting approximately finds the best linear combination of all possible weak classifiers via maximum likelihood on a logistic scale (Friedman et al. 2000), thereby solving potential feature redundancy and statistical dependence problems. By comparing the performance of a simple Naive Bayes classifier with a boosted Naive Bayes classifier on our collection of features, we will be able to address whether or not the dependence among our collection of features—if any—decreases the Naive Bayes classifier's predictive power. In other words, does the Naive Bayes approach perform sufficiently well at the current level of feature dependence? This comparison will also be done on a set of highly dependent features as a control.

Results and Discussion

A list of features useful for predicting protein interactions

In addition to the four features in Jansen et al. (2003), we consider 12 more features as listed in Figure 1. These features are divided into four categories; each of them is assigned a three-character identification code for convenient reference. Also included in Figure 1 are two gold-standard data sets (GSTDs, positive and negative sets) that will be used to evaluate features in subsequent sections. These GSTDs have various degrees of overlap with the 16 features. In Figure 1, we present the four categories of features in the descending order according to the degree of overlaps with the GSTDs (Fig. 2). For each of them, we shall describe its biological meanings and the rationale to use it. The reference to the data source is in the parenthesis that follows the feature's name.

Figure 1.Figure 1.
Useful genomic features in prediction of protein interactions.
Figure 2.
Overlaps between features and GSTDs. The blank and shaded columns represent the size of overlaps between the 16 features and the GSTD+ and GSTD–, respectively. The total numbers of protein pairs in the GSTD+ (8250) and GSTD– (2,708,622) ...

Predictive power of individual features

We use ROC curves (see Methods) to illustrate the predictive power of each individual feature. Figure 2 shows that there is a distinct difference between the features to the left and right of the divider in terms of overlapping with the GSTDs (note, Fig. 2 is in log-scale). For this reason, and in the interest of a clear presentation, we plot the ROC curves in two panels, with the seven most populous features in one group and the remaining features in the other (Fig. 3).

Figure 3.
Predictive power of individual features illustrated by ROC curves. We plot ROC curves for individual features in two panels; the seven most populous features in A, and the remaining nine features in B. The acronyms signify the following: (TPR) True positive ...

A good feature, i.e., one with high predictive power, simultaneously has a large number of true positives and a small number of false positives. In this case, the ROC curve climbs rapidly away from the origin (lower left hand corner of the graph). How quickly the ROC curve arises away from the origin can be quantified by measuring the area under the curve. The larger the area, the better the feature. Ranking the features by the area they cover in the ROC curves (easily seen in Fig. 3A), the best feature in the first group is MIP, followed by GOF, COE, EXP, ESS, MES, and APA. All of these features show strong predictive power (i.e., well above the diagonal). The best feature in the second group is INT, followed by PGP, GNN, REG, ROS, and THR, while SYL shows very little predictive power. EVL and GNC are not shown here because they each have only two overlaps with the positive GSTD, and are thus unsuitable for this test. Because of the low coverage of these group-two features, the results in Figure 3B may be misleading without a careful interpretation. For example, SYL covers only 887 protein pairs in the GSTDs, it is thus unreliable to estimate its overall predictive power based on this 0.04% of the GSTDs when its coverage is likely to increase in the future (Fig. 3B).

Another point we need to pay attention to is that we should not take the performance of a feature against the GSTDs as indicative of the accuracy or usefulness of the feature in its original context. This is because the performance of a feature against the GSTDs only measures its usefulness in relation to a specific task—i.e., predicting complex membership—which is probably not what the feature was originally designed to do. For example, multimeric threading method is designed for predicting physical interactions between two proteins. However, because of the way the GSTDs are constructed, the majority of protein pairs in the GSTDs are simply in the same molecular complex without direct contacts. Therefore, when predicting physical interactions, these GSTDs are not a good means of judging the accuracy or usefulness of the multimeric threading method.

Quite often, only the TPR for a specific FPR is valued. For example, COE outperforms MIP until the FPR reaches 5%, even though MIP covers more area in the whole range of FPR. Thus, the features can also be ranked and selected according to the acceptable FPR in prediction.

Feature selection and improvement of performance

Because of the varying quality and predictive powers of genomic features, incorporating all features without selection will likely decrease the predictive power by introducing noise rather than improving the results. Therefore, we select only those new features with high predictive power based on the performance of individual features. Another factor we need to take into account is the coverage of features. It is obvious that there is a distinct difference between the features to the left and right of the divider in Figure 2; each of the first seven features covers at least a half million (~20%) ORF pairs in the GSTDs, while the next most populous feature (REG) covers only 2%. Even though some of the features with very low coverage show strong predictive power, whether or not that predictive power will remain is in question once the coverage increases in the future. Therefore, at the current stage, only the first seven features (i.e., F1–F7) are considered in the following calculation. The new features are EXP, MES, and APA.

The performance of combining new features is presented in Figure 4A by a ROC curve. By integrating the three additional features in the range of all FPR values, we obtain a better performance in the predictive power (higher TPR at a certain FPR value) than by integrating the four original features. However, such improvement is marginal; although each of the three new features shows a fairly strong predictive power, the increase of TPR at any value of FPR is no more than 3%.

Figure 4.
Integration of three additional features versus: (A) Four original features. Integration of three additional features (EXP, MES, APA) shows an improvement over the original four features at all range of FPRs. (B) Two original features. By excluding the ...

Because of the dominant performance of the two functional similarity features (MIP and GOF), the improvement accomplished by incorporating new features may not seem obvious. We thus exclude these two functional features, showing the improvement by incorporating three additional features over the remaining two original features (i.e., COE and ESS). Including three additional features shows a significant improvement over the original two features (Fig. 4B).

Another benefit of genomic data integration is the improvement in coverage; by incorporating more features, two predictors with similar ROC curve performance may cover different parts of the system to varying degrees. Note, it is the coverage of not only the labeled pairs (GSTDs), but also unlabeled pairs (unseen pairs). So far, our assessments have been done for labeled pairs only; however, if additional features allow the predictor to have a more extensive view of the system despite no significant improvement in ROC curve, they probably should be considered as beneficial, because in this case, the coverage of unlabeled pairs is improved. Here, we find the coverage is slightly improved by integrating more features. For all possible 21,658,071 protein pairs (6582 ORFs from MIPS), the four original features cover 18,527,741 pairs (85.5%), whereas the seven most populous features cover 18,880,102 (87.2%).

Correlations and statistical dependence between features

In this section, we investigate whether or not the marginality of improvement is confounded by the correlation and dependencies between features.

We first calculate the Pearson correlation coefficients (CCs) between each pair of features. Such correlations between features can often generate useful biological insights. The five highest absolute values are highlighted in bold in Table 1A. None of the feature pairs exhibit significant correlation.

In addition, we calculate mutual information between genomic features as an alternative to CCs. Whereas CC only measures linear relationships, mutual information is a more general measure of correlation. The results show an agreement with Ccs. The five pairs containing the most mutual information are exactly the same as those of the CCs. These correlations between some of the features, albeit not strong, are expected. For example, the correlations between the two functional features (MIP and GOF) are the highest among feature pairs. It is also expected that absolute mRNA expression (EXP) and absolute protein abundance (APA) are somewhat correlated.

We next investigate the conditional dependence between features given the positive or negative GSTD by calculating mutual information. In other words, we calculate the mutual information between pairs of features by taking into account only protein pairs that occur in both features and in either set of GSTDs. The small amount of mutual information, given either set of GSTDs, indicates that the features we integrated by Naive Bayes classifier are largely conditionally independent (Table 1B).

Simple Naive Bayes classifier vs. boosted Naive Bayes classifier on data sets with or without high dependence

Even though the conditional dependence between our features is not strong, it is possible that the combined weak dependence can still significantly decrease the predictive power of a Naive Bayes classifier. In this section, we address this question by comparing the performance of a simple Naive Bayes classifier (SNB) with that of a boosted Naive Bayes classifier (BNB). Since a BNB is fairly resistant to feature dependence, a significantly worse performance by a SNB on the same data set means that the feature dependence does affect the predictive power of the SNB.

We first conduct a control experiment with highly dependent features to verify the resistance of BNB to feature dependence. To obtain a highly dependent set of features, we used mRNA expression data from microarray experiments conducted by Cho et al. (1998) under eight different conditions. Such expression data are highly dependent with regard to high CCs—the minimum CC between each pair of conditions is 0.904, the maximum CC is 0.970. Treating these eight sets of expression data as if they were eight features, we integrate them with the original four features. When evaluated on this highly dependent data set, the BNB significantly outperforms the SNB. Figure 5 shows the robustness of the BNB on this highly dependent data set.

Figure 5.
A SNB versus a BNB over sets of genomic features with or without high dependence. TPR, FPR, TP, FP, P, and N are the same as in Figure 3.

We then compare a SNB with a BNB on our data set, with only weak conditional dependence; the original four features plus only one instead of eight sets of expression data. If the BNB significantly outperforms the SNB, it indicates that the SNB is affected by feature dependence, even though it is not strong. The results show that the SNB performs as well as the BNB on this weakly dependent data set (Fig. 5). Clearly, the SNB is hardly affected by this weak feature dependence.

The results in Figure 5 also suggest that the SNB performs sufficiently well on our collection of genomic features, while the BNB may be useful to analyze the potential problem of highly dependent features as more features are considered in the future.

Conclusions

In this study, we quantitatively address the question of how far genomic data integration can be improved by integrating more and more features. We use a SNB for integrating diverse sources of genomic evidence, ranging from coexpression relationships to similar phylogenetic profiles. By integrating three more strong features, marginal improvement on both accuracy and coverage can be achieved.

The calculations of correlation coefficients, mutual information, and boosting all suggest that the marginality of the improvement on prediction by incorporating more features is unlikely to result from the weak feature dependencies. It is also unlikely to result from an excess of parameters, relative to data points (resulting in overfitting), because our Naive Bayes approach involves simple models with only small numbers of free parameters that are fitted against a large number of data points. Rather, this suggests that by integrating a few good features, we approach the maximal predictive power, or limit, of current genomic data integration. Furthermore, this limitation does not reflect (potentially removable) inter-relationships between the features. Unless we obtain features that are stronger in predictive power than MIP and GOF and simultaneously possess a reasonable coverage, it is unlikely that the prediction will be significantly improved by integrating a few more features. It is also possible that a higher coverage of our examined 16 features may allow better predictive power in the future.

Our discovery that no strong dependence exists between features is an interesting finding in and of itself. Among as many as seven populous features, one might expect some dependence high enough to significantly decrease SNB's predictive power. However, our calculation on correlation coefficients and mutual information, as well as our boosting results, suggest otherwise. One possibility is that the observed lack of dependence among different features may result from differences in coverage, since all of these data sets are essentially incomplete. Specifically, the overlap of proteins or protein pairs represented among the different features is likely to increase with extended coverage and possibly results in higher feature dependence. In this case, the BNB can be used as an alternative solution.

Finally, SNB is chosen in this study because of its simplicity, as well as the ability to compare with an existing benchmark study using the same technique (Jansen et al. 2003). Furthermore, we use BNB to specifically address SNB's well-known limitation relating to high feature dependency.

Other machine-learning techniques could have been potentially used in this study. However, most alternative techniques have issues in their own right, such as suffering from the missing value problems or being prohibitively time-consuming. Such problems prevent them from being applied to this problem as readily as a SNB. In addition, since BNB does not improve SNB on our collection of features, it is probably not the case that the conclusions made here will be significantly different if other machine-learning techniques are used—though, of course, we cannot definitely say this without a comprehensive test.

Methods

Naive Bayesian formalism

Inferring protein–protein interactions from genomic features can be formulated as a classification problem, in which we classify a pair of proteins into two classes (C1 = interact, C0 = not interact), given an n-dimensional vector of genomic features x = (x1,x2,...,xn).5

The Bayesian Decision Rule states that in order to minimize the average probability of a classification error, one must choose the class with the highest posterior probability, i.e., assign a feature vector x to the class Ck, such that: Ck = argCimax P(Ci | x), where Ci ranges over the set of classes (see for example, Bishop 1995; Duda et al. 2001). Ck is known as the maximum a posteriori (MAP) estimate.

Using Bayes theorem, the posterior probability can be rewritten, as

equation M1

Notice that the unconditional density p(x) in the denominator does not depend on the class label; therefore, it does not affect the classification decision and can be omitted when computing Ck = argCimax P(Ci | x). Each of the priors, P(Ci), can be easily estimated by computing the frequency with which each class occurs in the data. However, the evaluation of p(x | Ci) cannot generally be accomplished in the same way, especially if the number of features is high; it would require a set of data large enough to contain many instances for each possible combination of feature values, in order to obtain reliable estimates.

The idea behind Naive Bayes is to make the simplifying assumption that the attribute values are conditionally independent, given the target values. The computation of each is thus made efficient by approximating it as a product of conditional probabilities

equation M2
1

Learning in Naive Bayes consists of estimating the various P(Ci) and various p(xj | Ci) using equation 1, based on their frequencies over the training data. Clearly, the approximation in equation 1 becomes exact only in the event of stochastic independence between the various features, given the class. In spite of its simple way of approximating the posterior distributions, Naive Bayes has, in practice, yielded quite good results for several types of problems; for example, it is among the best methods for text classification (Joachims 1997; McCallum and Nigam 1998).

In the case of stochastic independence, the covariance between two features is zero. Thus, the covariance between features is a measure of the deviation from the condition of stochastic independence and is indicative of the amount of approximation introduced by the Naive Bayes assumption. For this reason, the next section shall present an analysis of the covariance between the various features, given the class.

Alternatively, the Bayesian Decision rule for two classes can be stated thusly:

  • equation M3
    2
  • Otherwise, choose class C0.

If we then introduce the Naive Bayes approximation, we can rewrite equation 2 as:

equation M4
3

where

equation M5

and are called Likelihood Ratio for feature i. Notice that for a given feature, a likelihood ratio different than 1 indicates that the feature conveys information about the class. In other words, there is a correlation between the feature and the target. For this reason, in the next section we shall look at the likelihood ratios of the various features and the correlation between such features and the class labels.

ROC (receiver operating characteristic) curve

In a two-class classification problem, with classes C1 (or positive) and C0 (or negative), for each prediction there are four possible outcomes. The true positives (TP) and the true negatives (TN) are correct classifications. Wrong classifications can be of two types. For a false positive (FP), the outcome is incorrectly predicted as belonging to C1, when in fact it belongs to C0; for a false negative (FN), the outcome is incorrectly predicted as belonging to C0, when it belongs to C1.

Our earlier discussion on Naive Bayes was motivated by the goal of minimizing the average probability of a classification error; it was aimed at reducing the total number of wrong predictions, regardless of the type of error that was made. This amounts to saying that we were maximizing the number of

equation M6

In general, however, the two different types of errors will have different costs, just as the two different types of correct classification will have different benefits. Taking such costs into account amounts to multiplying the right hand side of equation 3 by a cost factor. In practice, these costs are rarely known with accuracy. Thus, to evaluate a classification method, it is useful to look at its ROC curve.

A ROC curve graphically depicts the performance of a classification method for different costs. It consists of a set of points, each computed for a different setting of the cost, connected by lines. For each point, the vertical coordinate is a true positive rate (TPR) given by the ratio of the number of true positives to the total number of positives (i.e., TP/[TP+FN]), while the horizontal coordinate is a false positive rate (FPR) given by the ratio of the number of false positives to the total number of negatives (i.e., FP/[FP+TN]). Note that the TPR is equivalent to the commonly used term sensitivity, while FPR is equivalent to 1—specificity. Clearly, the ROC curve for a good classifier will be as close as possible to the upper-left corner of the chart; that is where we have the highest number of true positives and at the same time the smallest number of false positives.

Mutual information

Given two random variables, X and Y (in this study, X and Y are either feature values or class labels), the Mutual Information I(X; Y) between X and Y measures how much information one variable conveys about the other one. It is defined as the relative entropy (or Kullback-Leibler distance) between the joint distribution and the product distribution of X and Y, that is

equation M7

where P(x, y) indicates the joint distribution of X and Y and P(x) and P(y) their marginal distributions. It is easy to prove that I(X;Y) = H(X)–H(X|Y) = H(Y)–H(Y|X) = I(Y;X), where H(X) and H(Y) are the entropies of X and Y, and H(X|Y) and H(Y|X) are the conditional entropies of X given Y and Y given X, respectively. This states that the information Y conveys about X is the reduction in uncertainty about X, due to knowledge of Y (and vice-versa).

Boosting

Boosting is a general method that can be used for improving the performance of any classifier. The idea behind boosting is to combine the outputs of many different “weak” classifiers to produce a powerful “committee.” We have used one of the most popular boosting algorithms, AdaBoost (Freund and Schapire 1999), which we shall briefly describe here. For more information on this and other boosting algorithms refer to Friedman et al. 2000.

AdaBoost consists of sequentially applying a weak classification algorithm to modified versions of the data, producing a sequence of weak classifiers. Then, the prediction from each classifier is combined through a weighted majority vote. The data is modified by applying weights to each of the training observations. At each iteration, a weak learner is trained on the weighted set of data and the weights are updated. This operation is repeated until the desired performance for the training data is achieved. The updating rule for these weights is such that training pairs that had been misclassified in the previous step will have their weights increased, while those that were correctly classified will have their weights decreased. At each iteration, then, training pairs that are more difficult to classify have more influence, and classifiers are forced to focus on pairs overlooked by previous classifiers.

Given a data set of N training pairs (xi,yi), i = 1...N, where xi is an input vector of features and yi [set membership] {–1,1} is the target value representing classes C0 and C1, respectively, let us denote the weight associated with training pair i at time t as Dt(i), and the weak classification algorithm used at time t as ht. The AdaBoost algorithm to iterate T times is as follows:

  • Initialize the observation weights for each pair
    equation M8
  • For t = 1...T do:
    1. Train ht using the training pairs weighted by Dt
    2. Compute Et, the global error of ht as:
      equation M9
    3. equation M10
    4. equation M11
      where Zt is a normalization factor such that
      equation M12
  • The output of the final classifier is:
    equation M13

Training and testing data sets

The details of construction of the training and testing data sets are described in Figure 1.

Acknowledgments

We thank Drs. Ronald Jansen, Valery Trifonov, and Haoxin Lu for stimulating discussions and proofreading of this manuscript. Y.X. is a Fellow of the Jane Coffin Childs Memorial Fund for Medical Research. This work is supported by a grant from NIH/NIGMS for work in the PSI.

Notes

[All genomic feature data used in this study can be downloaded at http://networks.gersteinlab.org/intint/.]

Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.3610305.

Footnotes

5Bold letters denote vectors; P(·) denote probabilities; p(·) denote probability density functions.

References

  • Alberts, B. 2002. Molecular biology of the cell. Garland Science, New York.
  • Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al. 2000. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25: 25-29. [PMC free article] [PubMed]
  • Baker, D. and Sali, A. 2001. Protein structure prediction and structural genomics. Science 294: 93-96. [PubMed]
  • Berger, J.M., Gamblin, S.J., Harrison, S.C., and Wang, J.C. 1996. Structure and mechanism of DNA topoisomerase II. Nature 379: 225-232. [PubMed]
  • Bishop, C.M. 1995. Neural networks for pattern recognition. Clarendon Press, Oxford University Press, Oxford, UK.
  • Bowers, P.M., Pellegrini, M., Thompson, M.J., Fierro, J., Yeates, T.O., and Eisenberg, D. 2004. Prolinks: A database of protein functional linkages derived from coevolution. Genome Biol. 5: R35. [PMC free article] [PubMed]
  • Brown, M.P., Grundy, W.N., Lin, D., Cristianini, N., Sugnet, C.W., Furey, T.S., Ares Jr., M., and Haussler, D. 2000. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. 97: 262-267. [PMC free article] [PubMed]
  • Cho, R.J., Campbell, M.J., Winzeler, E.A., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T.G., Gabrielian, A.E., Landsman, D., Lockhart, D.J., et al. 1998. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell 2: 65-73. [PubMed]
  • Drawid, A., Jansen, R., and Gerstein, M. 2000. Genome-wide analysis relating expression level with protein subcellular localization. Trends Genet. 16: 426-430. [PubMed]
  • Duda, R.O., Hart, P.E., and Stork, D.G. 2001. Pattern classification Wiley, New York; Chichester, UK.
  • Eisenberg, D., Marcotte, E.M., Xenarios, I., and Yeates, T.O. 2000. Protein function in the post-genomic era. Nature 405: 823-826. [PubMed]
  • Freund, Y. and Schapire, R.E. 1996. Experiments with a new boosting algorithm. In Proceedings of the thirteenth conference on machine learning, pp. 148-156.
  • ———. 1999. A short introduction to boosting. J. Japanese Soc. Artificial Intell. 14: 771-780.
  • Friedman, N. 2004. Inferring cellular networks using probabilistic graphical models. Science 303: 799-805. [PubMed]
  • Friedman, J., Hastie, T., and Tibshirani, R. 2000. Additive logistic regression: A statistical view of boosting, Ann. Stat. 28: 337-374.
  • Gavin, A.C., Bosche, M., Krause, R., Grandi, P., Marzioch, M., Bauer, A., Schultz, J., Rick, J.M., Michon, A.M., Cruciat, C.M., et al. 2002. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415: 141-147. [PubMed]
  • Ge, H., Liu, Z., Church, G.M., and Vidal, M. 2001. Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nat. Genet. 29: 482-486. [PubMed]
  • Gerstein, M., Lan, N., and Jansen, R. 2002. Proteomics. Integrating interactomes. Science 295: 284-287. [PubMed]
  • Goh, C.S. and Cohen, F.E. 2002. Co-evolutionary analysis reveals insights into protein–protein interactions. J. Mol. Biol. 324: 177-192. [PubMed]
  • Goh, C.S., Bogan, A.A., Joachimiak, M., Walther, D., and Cohen, F.E. 2000. Co-evolution of proteins with their interaction partners. J. Mol. Biol. 299: 283-293. [PubMed]
  • Greenbaum, D., Jansen, R., and Gerstein, M. 2002. Analysis of mRNA expression and protein abundance data: An approach for the comparison of the enrichment of features in the cellular population of proteins and transcripts. Bioinformatics 18: 585-596. [PubMed]
  • Greenbaum, D., Colangelo, C., Williams, K., and Gerstein, M. 2003. Comparing protein abundance and mRNA expression levels on a genomic scale. Genome Biol. 4: 117. [PMC free article] [PubMed]
  • Hartwell, L.H., Hopfield, J.J., Leibler, S., and Murray, A.W. 1999. From molecular to modular cell biology. Nature 402: C47-C52. [PubMed]
  • Ho, Y., Gruhler, A., Heilbut, A., Bader, G.D., Moore, L., Adams, S.L., Millar, A., Taylor, P., Bennett, K., Boutilier, K., et al. 2002. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415: 180-183. [PubMed]
  • Horak, C.E. and Snyder, M. 2002. ChIP-chip: A genomic approach for identifying transcription factor binding sites. Methods Enzymol. 350: 469-483. [PubMed]
  • Ideker, T., Thorsson, V., Ranish, J.A., Christmas, R., Buhler, J., Eng, J.K., Bumgarner, R., Goodlett, D.R., Aerbersold, R., and Hood, L. 2001. Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science 292: 929-934. [PubMed]
  • Ito, T., Chiba, T., Ozawa, R., Yoshida, M., Hattori, M., and Sakaki, Y. 2001. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl. Acad. Sci. 98: 4569-4574. [PMC free article] [PubMed]
  • Jansen, R., Greenbaum, D., and Gerstein, M. 2002a. Relating whole-genome expression data with protein–protein interactions. Genome Res. 12: 37-46. [PMC free article] [PubMed]
  • Jansen, R., Lan, N., Qian, J., and Gerstein, M. 2002b. Integration of genomic datasets to predict protein complexes in yeast. J. Struct. Funct. Genomics 2: 71-81. [PubMed]
  • Jansen, R., Yu, H., Greenbaum, D., Kluger, Y., Krogan, N.J., Chung, S., Emili, A., Snyder, M., Greenblatt, J.F., and Gerstein, M. 2003. A Bayesian networks approach for predicting protein–protein interactions from genomic data. Science 302: 449-453. [PubMed]
  • Joachims, T. 1997. A probabilistic analysis of the Rocchio Algorithm with TFIDF for text categorization. 14th International Conference on Machine Learning.
  • Kemmeren, P., van Berkum, N.L., Vilo, J., Bijma, T., Donders, R., Brazma, A., and Holstege, F.C. 2002. Protein interaction verification and functional annotation by integrated analysis of genome-scale data. Mol. Cell. 9: 1133-1143. [PubMed]
  • Lee, T.I., Rinaldi, N.J., Robert, F., Odom, D.T., Bar-Joseph, Z., Gerber, G.K., Hannett, N.M., Harbison, C.T., Thompson, C.M., Simon, T., et al. 2002. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298: 799-804. [PubMed]
  • Lee, I., Date, S.V., Adai, A.T., and Marcotte, E.M. 2004. A probabilistic functional network of yeast genes. Science 306: 1555-1558. [PubMed]
  • Letovsky, S. and Kasif, S. 2003. Predicting protein function from protein/protein interaction data: A probabilistic approach. Bioinformatics 19: 197-204. [PubMed]
  • Lin, N., Wu, B., Jansen, R., Gerstein, M., and Zhao, H. 2004. Information assessment on predicting protein–protein interactions. BMC Bioinformatics 5: 154. [PMC free article] [PubMed]
  • Lu, L., Lu, H., and Skolnick, J. 2002. MULTIPROSPECTOR: An algorithm for the prediction of protein–protein interactions by multimeric threading. Proteins 49: 350-364. [PubMed]
  • Lu, L., Arakaki, A.K., Lu, H., and Skolnick, J. 2003. Multimeric threading-based prediction of protein–protein interactions on a genomic scale: Application to the Saccharomyces cerevisiae proteome. Genome Res. 13: 1146-1154. [PMC free article] [PubMed]
  • Marcotte, E.M., Pellegrini, M., Ng, H.L., Rice, D.W., Yeates, T.O., and Eisenberg, D. 1999a. Detecting protein function and protein–protein interactions from genome sequences. Science 285: 751-753. [PubMed]
  • Marcotte, E.M., Pellegrini, M., Thompson, M.J., Yeates, T.O., and Eisenberg, D. 1999b. A combined algorithm for genome-wide prediction of protein function. Nature 402: 83-86. [PubMed]
  • Martone, R., Euskirchen, G., Bertone, P., Hartman, S., Royce, T.E., Luscombe, N.M., Rinn, J.L., Nelson, F.K., Miller, P., Gerstein, M., et al. 2003. Distribution of NF-κB-binding sites across human chromosome 22. Proc. Natl. Acad. Sci. 100: 12247-12252. [PMC free article] [PubMed]
  • McCallum, A. and Nigam, K. 1998. A comparison of event models for Naive Bayes text classification. AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48.
  • Mewes, H.W., Frishman, D., Guldener, U., Mannhaupt, G., Mayer, K., Mokrejs, M., Morganstern, B., Munsterkotter, M., Rudd, S., and Weil, B. 2002. MIPS: A database for genomes and protein sequences. Nucleic Acids Res. 30: 31-34. [PMC free article] [PubMed]
  • Pazos, F. and Valencia, A. 2002. In silico two-hybrid system for the selection of physically interacting protein pairs. Proteins 47: 219-227. [PubMed]
  • Pellegrini, M., Marcotte, E.M., Thompson, M.J., Eisenberg, D., and Yeates, T.O. 1999. Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. Proc. Natl. Acad. Sci. 96: 4285-4288. [PMC free article] [PubMed]
  • Schapire, R.E. 1990. The strength of weak learnability. Machine Learning 5: 197-227.
  • Schwikowski, B., Uetz, P., and Fields, S. 2000. A network of protein–protein interactions in yeast. Nat. Biotechnol. 18: 1257-1261. [PubMed]
  • Skolnick, J. and Kolinski, A. 2002. In Computational methods for protein folding. Vol. 120 (ed. R.A. Friesner), pp. 131-192. John Wiley & Sons, New York.
  • Strong, M., Mallick, P., Pellegrini, M., Thompson, M.J., and Eisenberg, D. 2003. Inference of protein function and protein linkages in Mycobacterium tuberculosis based on prokaryotic genome organization: A combined computational approach. Genome Biol. 4: R59. [PMC free article] [PubMed]
  • Tamames, J., Casari, G., Ouzounis, C., and Valencia, A. 1997. Conserved clusters of functionally related genes in two bacterial genomes. J. Mol. Evol. 44: 66-73. [PubMed]
  • Thatcher, J.W., Shaw, J.M., and Dickinson, W.J. 1998. Marginal fitness contributions of nonessential genes in yeast. Proc. Natl. Acad. Sci. 95: 253-257. [PMC free article] [PubMed]
  • Tong, A.H., Evangelista, M., Parsons, A.B., Xu, H., Bader, G.D., Page, N., Robinson, M., Raghibizadeh, S., Hogue, C.W., Bussey, H., et al. 2001. Systematic genetic analysis with ordered arrays of yeast deletion mutants. Science 294: 2364-2368. [PubMed]
  • Tong, A.H., Lesage, G., Bader, G.D., Ding, H., Xu, H., Xin, X., Young, J., Berriz, G.F., Brost, R.L., Chang, M., et al. 2004. Global mapping of the yeast genetic interaction network. Science 303: 808-813. [PubMed]
  • Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., and Altman, R.B. 2001. Missing value estimation methods for DNA microarrays. Bioinformatics. 17: 520-525. [PubMed]
  • Uetz, P., Giot, L., Cagney, G., Mansfield, T.A., Judson, R.S., Knight, J.R., Lockshon, D., Narayan, V., Srinivasan, M., Pochart, P., et al. 2000. A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae. Nature 403: 623-627. [PubMed]
  • Valencia, A. and Pazos, F. 2002. Computational methods for the prediction of protein interactions. Curr. Opin. Struct. Biol. 12: 368-373. [PubMed]
  • Vazquez, A., Flammini, A., Maritan, A., and Vespignani, A. 2003. Global protein function prediction from protein–protein interaction networks. Nat. Biotechnol. 21: 697-700. [PubMed]
  • von Mering, C., Krause, R., Snel, B., Cornell, M., Oliver, S.G., Fields, S., and Bork, P. 2002. Comparative assessment of large-scale data sets of protein–protein interactions. Nature 417: 399-403. [PubMed]
  • Wong, S.L., Zhang, L.V., Tong, A.H., Li, Z., Goldberg, D.S., King, O.D., Lesage, G., Vidal, M., Andrews, B., Bussey, H., et al. 2004. Combining biological networks to predict genetic interactions. Proc. Natl. Acad. Sci. 101: 15682-15687. [PMC free article] [PubMed]
  • Xia, Y., Yu, H., Jansen, R., Seringhaus, M., Baxter, S., Greenbaum, D., Zhao, H., and Gerstein, M. 2004. Analyzing cellular biochemistry in terms of molecular networks. Annu. Rev. Biochem. 73: 1051-1087. [PubMed]
  • Yu, H., Luscombe, N.M., Qian, J., and Gerstein, M. 2003. Genomic analysis of gene expression relationships in transcriptional regulatory networks. Trends Genet. 19: 422-427. [PubMed]
  • Yu, H., Greenbaum, D., Xin Lu, H., Zhu, X., and Gerstein, M. 2004a. Genomic analysis of essentiality within protein networks. Trends Genet. 20: 227-231. [PubMed]
  • Yu, H., Luscombe, N.M., Lu, H.X., Zhu, X., Xia, Y., Han, J.D., Bertin, N., Chung, S., Vidal, M., and Gerstein, M. 2004b. Annotation transfer between genomes: Protein–protein interologs and protein–DNA regulogs. Genome Res. 14: 1107-1118. [PMC free article] [PubMed]
  • Zhang, L.V., Wong, S.L., King, O.D., and Roth, F.P. 2004. Predicting co-complexed protein pairs using genomic and proteomic data integration, BMC Bioinformatics 5: 38. [PMC free article] [PubMed]

Articles from Genome Research are provided here courtesy of Cold Spring Harbor Laboratory Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

    Your browsing activity is empty.

    Activity recording is turned off.

    Turn recording back on

    See more...