• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of plntphysLink to Publisher's site
Plant Physiol. Sep 2009; 151(1): 34–46.
PMCID: PMC2735983

Computational Identification of Potential Molecular Interactions in Arabidopsis1,[C][W]


Knowledge of the protein interaction network is useful to assist molecular mechanism studies. Several major repositories have been established to collect and organize reported protein interactions. Many interactions have been reported in several model organisms, yet a very limited number of plant interactions can thus far be found in these major databases. Computational identification of potential plant interactions, therefore, is desired to facilitate relevant research. In this work, we constructed a support vector machine model to predict potential Arabidopsis (Arabidopsis thaliana) protein interactions based on a variety of indirect evidence. In a 100-iteration bootstrap evaluation, the confidence of our predicted interactions was estimated to be 48.67%, and these interactions were expected to cover 29.02% of the entire interactome. The sensitivity of our model was validated with an independent evaluation data set consisting of newly reported interactions that did not overlap with the examples used in model training and testing. Results showed that our model successfully recognized 28.91% of the new interactions, similar to its expected sensitivity (29.02%). Applying this model to all possible Arabidopsis protein pairs resulted in 224,206 potential interactions, which is the largest and most accurate set of predicted Arabidopsis interactions at present. In order to facilitate the use of our results, we present the Predicted Arabidopsis Interactome Resource, with detailed annotations and more specific per interaction confidence measurements. This database and related documents are freely accessible at http://www.cls.zju.edu.cn/pair/.

The complex cellular functions of an organism rely on physical interactions between proteins. Deciphering the protein-protein interaction network to understand higher level phenotypes and their regulations is always a major focus of both experimental biologists and computational biologists. A number of high-throughput (HTP) assays have been developed to identify in vitro protein interactions from several model organisms (Uetz et al., 2000; Giot et al., 2003; Li et al., 2004). A number of initiatives, such as IntAct (Kerrien et al., 2006), Molecular INTeraction database (Chatr-aryamontri et al., 2007), the Database of Interacting Proteins (Salwinski et al., 2004), Biomolecular Interaction Network Database (BIND; Alfarano et al., 2005), and BioGRID (Stark et al., 2006), have been established to systematically collect and organize the interaction data reported by both proteome-scale HTP experiments and traditional low-throughput studies focusing on individual proteins or pathways.

Arabidopsis (Arabidopsis thaliana) has long been studied as a model organism to investigate the physiology, biochemistry, growth, development, and metabolism of a flowering plant at the molecular level. The molecular mechanism studies of various phenotypes and their regulations in Arabidopsis may be facilitated by a comprehensive reference protein interaction network, based on which working hypotheses could be invented with more guidance and confidence. However, due to technological limitations, most experimentally reported protein interactions in available databases were from other organisms. A very limited number of plant interactions could be found in these databases. Therefore, an accurate prediction of the Arabidopsis interactome would be valuable to assist relevant research.

Studies on the computational identification of potential interactions started along with the advent of HTP interaction-detection technologies, which often produced a large number of false positives (Deane et al., 2002). Indirect evidence of protein interaction (e.g. protein colocalization and relevance in function) were hence introduced to boost the confidence of HTP results (Jansen et al., 2003). Further investigations demonstrated that direct inference of protein interactions from such indirect evidence alone was possible (Scott and Barton, 2007). The accuracy and effectiveness of using indirect evidence to predict interactions have also been thoroughly assessed (Qi et al., 2006; Suthram et al., 2006). These works offered precious insights into how protein interactions may be predicted accurately on a proteomic scale. In other organisms such as Homo sapiens, the prediction of an entire interactome has already been proven applicable and useful (Rhodes et al., 2005).

On the other side, several efforts have been made to collect and organize a comprehensive map of Arabidopsis molecular interactions. For instances, around 20,000 interactions were inferred by homology to known interactions in other organisms (Geisler-Lee et al., 2007). Another work predicted 23,396 interactions based on multiple indirect data and curated 4,666 interactions from the literature and enzyme complexes (Cui et al., 2008). The Arabidopsis reactome database was established describing the functions of 2,195 proteins with 8,269 reactions in 318 superpathways (Tsesmetzis et al., 2008). And a general interaction database, IntAct (Kerrien et al., 2006), had allocated a special unit actively curating all plant protein interactions from literature and submitted data sets, which now contains 2,649 Arabidopsis interactions. However, in yeast, approximately 18,000 protein-protein interactions had been estimated for approximately 6,000 genes (Yu et al., 2008). Assuming the same rate of interaction, approximately 200,000 protein interactions would be expected for approximately 20,000 Arabidopsis genes. Therefore, the current collection of Arabidopsis interactions is still significantly limited. Moreover, most previous prediction works did not provide rigorous confidence measurements for their predicted interactions, which further limited their scope of applications.

Recent advances in statistical learning presented a powerful algorithm, support vector machine (SVM), which may be used to predict interactions based on multiple indirect data. Although the basis of SVM had been laid in the 1960s, the idea of SVM was only officially proposed in the 1990s by Vapnik (1998, 2000). Then, research on its theoretical and application aspects thrived. It has been applied in a wide range of problems, including text categorization (de Vel et al., 2001; Kim et al., 2001), image classification and object detection (Ben-Yacoub et al., 1999; Karlsen et al., 2000), flood stage forecasting (Liong and Sivapragasam, 2002), microarray gene expression data analysis (Brown et al., 2000), drug design (Zhao et al., 2006a, 2006b), protein solvent accessibility prediction (Yuan et al., 2002), and protein fold prediction (Ding and Dubchak, 2001; Hua and Sun, 2001). Many studies have demonstrated that SVM was consistently superior to other supervised learning methods (Brown et al., 2000; Burbidge et al., 2001; Cai et al., 2003).

In this work, with careful preparation of example data and selection of indirect evidence, we constructed an SVM model to predict potential Arabidopsis interactions. False positives were tightly controlled. With the high-confidence model, we identified altogether 224,206 potential interactions, which were expected to be 48.67% accurate and to cover 29.02% of the entire Arabidopsis interactome. More specific confidence measurements were also assigned on a per interaction basis. To facilitate the use of our results, we present the Predicted Arabidopsis Interactome Resource (PAIR; http://www.cls.zju.edu.cn/pair/), featuring detailed annotations and a friendly user interface.


Example Data Set

Known pairs of interacting proteins (positive examples) and noninteracting proteins (negative examples) were required to train an SVM prediction model. As detailed in “Materials and Methods,” we assembled a gold-standard example set consisting of 4,139 positive and 4,139 negative examples (Table I). In order to verify whether the proteins in our example interactions were a good sample of the entire Arabidopsis proteome, we used four categorizing systems to compute and compare the protein distributions in our example interactions and those in the entire Arabidopsis proteome. The Arabidopsis Information Resource (TAIR) database (Rhee et al., 2003) provided three classification systems based on Gene Ontology (GO) annotation (i.e. the molecular function-based system, the cellular component-based system, and the biological process-based system, which are collectively known as the GO Slim classification systems). In most cases, a GO Slim category in each classification system corresponds to an existing term in GO. Gene products that are annotated to the term itself, or to any of the children of this term, are included in the corresponding GO Slim category. In some cases, a GO Slim category may also encompass more than one GO term. In the GO Slim classification systems, categories were carefully chosen and were intended to be nonoverlapping. In our analysis, proteins classified into the “unknown categories” (e.g. “molecular function unknown,” “biological process unknown,” and “cellular component unknown”) were ignored, while the proteins classified into the “other categories” (e.g. “other molecular functions,” “other biological processes,” and “other cellular components”) were counted.

Table I.
Composition of gold-standard interaction examples

As shown in Figure 1A, proteins in our interaction examples and proteins in the entire Arabidopsis proteome showed similar histograms according to the GO Slim Biological Process categories. The Pearson's correlation coefficient (r) between these two distributions was 0.95. Using Fisher's z transformation, this coefficient corresponds to a statistical significance of P = 3.17 × 10−7. Similarly, according to the GO Slim Molecular Function categories, we observed a correlation coefficient of 0.52 (P = 2.8 × 10−2; Fig. 1B), and according to the GO Slim Cellular Component categories, we observed a correlation coefficient of 0.68 (P = 2.9 × 10−3; Fig. 1C). These significant correlations indicated that proteins in our example data set were indeed a good sample of the entire proteome. Besides the annotation-based categories, sequence-based protein families also served as a good classification system to compare protein distributions. According to the Pfam database (Coggill et al., 2008), all Arabidopsis proteins were classified into 2,854 families. As shown in Figure 1D, the frequency distributions of example proteins and of all Arabidopsis proteins were roughly consistent, showing a correlation coefficient of 0.75 (P < 1 × 10−10). All of these data suggested that the proteins in our interaction examples represented a good sample of the Arabidopsis proteome.

Figure 1.
Protein distribution in our interaction examples and in the entire Arabidopsis proteome. Blue bars show the fraction of interacting proteins in each category, and red bars show the fraction of all Arabidopsis proteins in each category. A, Protein distribution ...

Strength of the Indirect Evidence

As detailed in “Materials and Methods,” potential interactions were predicted based on 14 features derived from four types of indirect evidence (Table II). Therefore, it is of interest to evaluate the extent to which these features carried information about protein interaction. One way to assess the discrimination power of a feature is to compare its value distribution in interaction examples against its value distribution in noninteraction examples. Informative features are expected to show apparent differences.

Table II.
Composition of features

In this way, we divided the value range of each feature into 20 equal bins. The fraction of positive examples and the fraction of negative examples that fell into each bin were calculated to make the corresponding distribution histograms. We plotted these distribution histograms and the associated likelihood ratio (LR) curve in Figure 2. The LR for each bin is defined as the fraction of interaction examples above this bin divided by the fraction of noninteraction examples above this bin. Above a bin means that the feature values are expected to better indicate an interaction than all values in this bin. For features describing coexpression, colocalization, and domain interaction, this means that the feature values are greater than the biggest number in this bin. But for features describing shared annotation, above a bin means the contrary. For convenience in comparison, we wanted all LR curves to increase with the horizontal axes. Therefore, for features describing shared annotation, the horizontal axes were plotted as 1 – the feature value. Figure 2 shows several typical feature strength diagrams. A complete set of these diagrams is provided in Supplemental Text S1. It was clear that in these diagrams, no sharp differences existed between the value distributions in interaction examples and those in noninteraction examples. Pearson's correlation between the interaction and noninteraction distributions ranged from 0.324 to 1. The Kolmogorov-Smirnov test also indicated that these interaction and noninteraction distributions were not significantly different, displaying distances from 0.1 to 0.4, corresponding to a P value range of 1.0 to 0.313 (no significance to reject the hypothesis that they were the same; Supplemental Text S1). However, it was also obvious that there were always noticeable deviations between the interaction and noninteraction distributions for each feature. In the best scenarios (at the right-most point), these features always led to significant LRs (i.e. >10-fold). Hence, they shall be considered weak but relevant. A delicate scheme to exploit their combined discriminative power was necessary for accurate prediction.

Figure 2.
Comparison of the feature value distribution in interaction examples and the feature value distribution in noninteraction examples. For each feature, its value range is divided into 20 equal bins. Blue curves and red curves show the fraction of interaction ...

The above set of feature LRs also offered an interesting possibility to predict potential interactions based on the total LR of a protein pair being an interaction rather than a noninteraction, which was computed as the product of the 14 single feature LRs (Rhodes et al., 2005; Cui et al., 2008). While we regarded this approach as a useful supplement to assign per interaction prediction confidence, using it as an interactome prediction method may not be optimal in our case, for two reasons. First, the features we used were not independent. Strong correlations existed among features describing the same type of evidence (e.g. the Pearson's correlation between the nine domain interaction features ranged from 0.0049 to 0.918). And features describing different indirect evidence may also correlate, such as the colocalization feature and the shared GO cellular component annotation feature (r = 0.0638, P < 1 × 10−10). Therefore, the product of LRs would be dominated by the information coded in multiple correlated features. Useful information carried by other independent features may be obscured, which may lead to significant bias in prediction. Second, as shown in Figure 2, the LR curves were not monotonically increasing, which meant that using a single cutoff value to compute LR was too simple. Finer structures of the feature value distributions could be exploited for better prediction accuracy. Consequently, prediction systems that are more complicated and tolerate correlated features were desired. Many previous machine learning studies had suggested that the SVM algorithm excelled in this scenario.

Accuracy of Our Prediction Models

We constructed an SVM model to predict Arabidopsis interactions. Its parameters, C and γ, were optimized by a grid search. Model performances during parameter optimization were estimated by 10-fold cross-validation with the F1 measure. Accuracy of the resulting model was evaluated by a more precise 100-iteration bootstrap experiment. In each iteration, one-tenth of the example data were randomly left out from training, and these data were used to test the model and compute the confusion matrix. After 100 iterations, the mean and sd of the confusion matrix (Table III), together with other derivative accuracy measurements, were computed. The optimal model produced 84.52% ± 0.20% sensitivity, 90.58% ± 0.25% specificity, and 87.44% ± 0.14% F1 measure. Its overall accuracy was estimated to be 87.55% ± 0.14%.

Table III.
The confusion matrix of our prediction models

In addition, a receiver operating characteristic (ROC) curve analysis was performed to verify whether the SVM algorithm was effective in interaction prediction. In the ROC curve, sensitivity was plotted against 1 − specificity for SVM models trained with all range of parameters. Therefore, the ROC curve analysis is a global and unbiased evaluation of the effectiveness of an algorithm, since it does not depend on any specific parameter used to build a particular model (Hanley and McNeil, 1982). The bigger the area under the curve, the more effective a prediction algorithm would be. As a result, we had an area under the curve of 91.86% for the SVM approach, which was slightly better than those for several other widely used prediction methods (Supplemental Fig. S1). These results demonstrated that the SVM approach was suitable to exploit the information carried in the weak evidence and to make accurate predictions.

Although the above prediction model worked well on our example set consisting of equal numbers of positive and negative examples, its application in interactome prediction was still limited. Assuming that the ratio of interacting protein pairs out of random pairs in Arabidopsis was similar to that in yeast (i.e. approximately 1:775; Yu et al., 2008), its 90.58% specificity would translate to a large number of false positives. And the precision of its predicted interactions would drop to approximately 1.15%, worthless for expert inspection. Based on our past experiences (Cai et al., 2003; Xue et al., 2004; Xu et al., 2007), the SVM algorithm usually displayed an increased accuracy for the type of examples with either larger number or greater diversity. Thus, one way to increase specificity without significantly compromising sensitivity was to expand the negative examples in the training data. With this strategy, we added 409,761 randomly generated negative examples to the training set, raising the positive-to-negative ratio to 1:100. With the same protocol, this enlarged example set produced an optimal SVM model showing 29.02% ± 0.13% sensitivity, 99.95% ± 0.0013% specificity, 85.15% ± 0.33% precision, and 43.29% ± 0.17% F1 measure. Applying this model to all possible Arabidopsis protein pairs, 224,206 potential interactions were identified.

These 224,206 predicted interactions also implied an estimation of the Arabidopsis interactome size. The equation “true positives + false positives = all predicted positives” can be rewritten as:

equation M1

where Nint represents the interactome size implied by our model, Nall is the number of all possible Arabidopsis protein pairs (2.29 × 108), Npred is the number of predicted interactions (224,206), and Sen and Spe are the sensitivity and specificity of our model. Solving this equation gave an estimated Arabidopsis interactome size of 3.78 × 105, corresponding to an interaction rate of 1:606, similar to that estimated in yeast (approximately 1:775; Yu et al., 2008). With this estimation, this model was expected to have a precision of 48.67% in interactome prediction. And the first model trained with an equal number of positive and negative examples was expected to display a precision of 1.46% in interactome prediction.

Comparing the above two models, the one trained with an equal number of positive and negative examples had a high sensitivity (84.52%). Or, it would be less likely to miss a real interaction. For this reason, we call it the high-coverage model, or the 1:1 model. However, this good coverage came at a cost. The confidence of its predicted interactions (precision) was estimated to be as low as 1.46%. In contrast, the model trained with 100 times more negative examples showed a high precision (48.67%). Or, it would be less likely to wrongly predict an interaction. Thus, we call this model the high-confidence model, or the 1:100 model. Nonetheless, its predicted interactions were expected to cover only 29.02% of the entire interactome. While the interactions predicted by the high-coverage model could be more useful to reconfirm interactions identified by other low-confidence approaches (e.g. some HTP experiments that tend to report many false positives), the interactions predicted by the high-confidence model would be more interesting for direct expert inspection.

In addition, it was noted that our high-coverage predictions included the entire set of high-confidence predictions, suggesting the self-consistency of our methods.

Rediscovery of Newly Reported Interactions

The sensitivity of our prediction models in interactome prediction was further validated with one independent evaluation data set containing newly reported interactions that were not included in the major interaction databases at the time when our positive examples were assembled. This data set was extracted from the latest update of BioGRID (Stark et al., 2006), containing 166 novel interactions. As shown in Supplemental Table S1, 122 (73.49%) novel interactions could be successfully recognized by the high-coverage model. This rate of sensitivity was slightly lower than the expected one (84.52%). Therefore, we considered this rate (73.49%) as the likely sensitivity of our high-coverage model when generalized to interactome prediction. On the other hand, the high-confidence model was able to identify 48 (28.91%) novel interactions. This sensitivity was comparable to its expected one (29.02%). In contrast, by chance alone, our high-coverage model and high-confidence model would be expected to have approximately 70.27 ± 8.37 and approximately 4.96 ± 1.94 overlaps, respectively, with this independent evaluation data set. Consequently, the significance of recognizing 122 interactions by our high-coverage model was 3.17 × 10−10. And the significance of recognizing 48 interactions by our high-confidence model was less than 1 × 10−10.

Comparison with Other Predicted Interactomes

Before this work, there had been several efforts to predict Arabidopsis interactions. As mentioned in the introduction, without using complex statistical models, Geisler-Lee et al. (2007) predicted interactions by homology to reported interactions in other organisms (herein referred to as the Interologs data set). Also, Cui et al. (2008) identified 23,396 potential interactions based on multiple indirect evidence with a total likelihood ratio cutoff and published as the AtPID database (the AtPID data set). In addition to these predicted interactomes, Obayashi et al. (2009) computed a coexpression network from 58 microarray experiments, 1,388 slides, which was available as the ATTED-II database (the ATTED-II data set). These data sets offered an interesting opportunity to compare with our results.

Table IV shows the overlaps between our predictions and the above data sets, together with the significance values. It was shown that all data sets exhibited significant overlaps that were not likely produced by chance, indicating their consistency in general. The AtPID set showed the largest proportion of overlap with our predictions, most likely as a result of using a similar set of indirect evidence (e.g. gene coexpression and GO biological process similarity). In contrast, the ATTED-II set had relatively lower overlaps with all data sets, including the gold-standard interaction examples. This was not surprising, as coexpression does not equal interaction. Although gene coexpression and protein interaction are well correlated in prokaryotes, the correlation is not as strong in eukaryotes (Bhardwaj and Lu, 2005), probably because the layer of regulatory mechanisms is different between coexpression and protein interaction (Obayashi et al., 2009). This was the reason why, in this work, coexpression was employed as one feature. Besides it, several other indirect data were also considered to make more reliable predictions.

Table IV.
Comparison with other predicted interactomes

However, although coexpression is weakly correlated with interaction and not enough to indicate an interaction by itself, we indeed observed an increased level of expression correlation in our predicted interactions. The ATTED-II database used two measurements to gauge the correlations between expression profiles: the Pearson's correlation (PC) and mutual rank (MR). A gene pair with higher PC and lower MR was more likely to interact. Results showed that our predicted interactions had an average PC of 0.1341, higher than that of all protein pairs (0.0075; P < 1 × 10−10), and the mean MR of our predicted interactions was 7,454.6, lower than that of all protein pairs (11,286.6; P < 1 × 10−10). This increase of expression correlation was similarly observed in our gold-standard interactions, which displayed an average PC of 0.1224 (P < 1 × 10−10) and a mean MR of 7,994.5 (P < 1 × 10−10). Interestingly, the Wilcoxon rank sum test indicated that the average PC of our predicted interactions (0.1341) was slightly higher than that of the gold-standard interactions (0.1224; P = 0.002), while the mean MR of our predicted interactions (7,454.6) was largely comparable to that of the gold-standard interactions (7,994.5; P = 0.077, no significance).

It was also noted that our predicted interactions had a significant overlap with the potential interactions identified by homology to known interactions in other species. Existence of homologous interactions was one of piece of indirect evidence that we did not use in our prediction scheme. Therefore, if predictions by the homology-based approach were independent of ours, the interactions predicted by both approaches should be more reliable. Indeed, for this set of interactions, we had an average SVM confidence score of 1.586 (detailed in “Data Access” below), which was higher than the average of all our predicted interactions (0.761; P < 1 × 10−10 by Wilcoxon rank sum test). And on the other side, Geisler-Lee et al. (2007) computed a confidence value to indicate the confidence of each homology-predicted interaction. For the interactions predicted by both approaches, we observed an average confidence value of 18.507, higher than that of all the homology-predicted interactions (4.208; P < 1 × 10−10). These results suggested the opportunity to derive a set of predicted interactions of extra high confidence. For this purpose, we assembled an updated set of known interactions consisting of 198,537 low-throughput interactions from Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, and H. sapiens. These interactions were then mapped onto 275,403 potential Arabidopsis interactions based on the latest homologous gene group assignments as in the Inparanoid database (O'Brien et al., 2005). Among them, 7,927 interactions were also predicted by our high-confidence model. In this set, the average number of homologous interactions for one predicted interaction was approximately 1.79, larger than that in all the homology-based predictions (1.39; P < 2.2e-16). In other words, our SVM model was able to pick out those homology-based predictions that had more support from reported interactions in other species.

Analysis of the Meiotic Recombination-Related Interaction Network

Usage of our predicted interactome was illustrated by an analysis of the meiotic recombination-related interactions. Meiosis is essential for sexual reproduction, in which the step of recombination is critical for normal meiosis. Understanding its underlying molecular mechanisms and regulation has attracted constant attention. The current model of meiotic recombination, known as the double-strand break repair mechanism, was mostly established through analysis of the budding yeast, S. cerevisiae. According to this model (Supplemental Fig. S2), recombination is initiated by the introduction of a double-strand break into the recipient chromatid by a double-strand endonuclease (step 1). This cut is enlarged to a gap, and 3′ single-stranded termini are produced by the action of exonucleases (step 2). One of the free 3′ ends then invades a homologous region of the donor duplex, displacing a small D loop (step 3). The D loop is then enlarged by repair synthesis until the other 3′ end can anneal to complementary single-stranded sequences. Repair synthesis from the second 3′ end completes the process of gap repair, and branch migration results in the formation of two Holliday junctions (step 4). Resolution of two junctions by cutting either inner or outer strands leads to two possible noncrossover and two possible crossover configurations (step 5; Szostak et al., 1983). The Arabidopsis proteins known to function in recombination was nicely reviewed by Wijeratne and Ma (2007).

In our predicted high-confidence interaction network, 11 proteins mentioned in this review were identified as seed proteins (Table V) and extracted with their first neighbors to create a subnetwork for in-depth analysis. This subnetwork contained 122 proteins and 884 interactions (Fig. 3). Among them, eight interactions had been experimentally confirmed and 40 interactions were supported by homologous interactions in other species (Fig. 3; Supplemental Tables S2 and S3). For example, BRCA2A and BRCA2B, orthologs of the human breast cancer susceptibility protein 2, had been reported to physically interact with DMC1 and RAD51 in Arabidopsis (Siaud et al., 2004; Dray et al., 2006), which were also predicted by our SVM model. Another example of predicted interactions supported by known homologue interactions could be found among the Structural Maintenance of Chromosomes family proteins (i.e. SMC1, SMC2, and SMC3) and the Sister Chromatid Cohesion family proteins (i.e. SYN2, SYN3, and SYN4). These proteins are part of the evolutionarily conserved cohesin complex that mediates chromosome cohesion and enables accurate chromosome segregation in both mitosis and meiosis (Revenkova and Jessberger, 2005). A clustering analysis was also performed with the Markov Cluster algorithm (Enright et al., 2002) to reveal the internal structure of our predicted subnetwork. Seven clusters were identified, with six of them connected (Fig. 3).

Table V.
Arabidopsis proteins that function in meiotic recombination
Figure 3.
The meiotic recombination-related proteins in our predicted interactome. Eleven seed proteins reviewed by Wijeratne and Ma (2007) and their first neighbors were extracted from our predicted interactions. Interactions between them can be grouped into seven ...

Interestingly, the seed proteins that function in different recombination steps were distributed in nonoverlapping clusters (Fig. 3; Supplemental Table S4), suggesting that our subnetwork structure was functionally relevant. A closer look at the GO biological process annotations of these proteins indicated that proteins in the same cluster were often involved in similar or related biological processes. For example, for the first step of recombination that creates a DNA double-strand break, there were two seed proteins, meiotic recombination proteins SPO11-1 (AtSPO11-1) and SPO11-2 (AtSPO11-2; Grelon et al., 2001; Stacey et al., 2006). Both of them were partitioned into cluster 1, together with three other proteins. All of them were involved in the process of “DNA topological change.” Similarly, the four seed proteins for the third step of recombination (i.e. DNA repair protein RAD51 homolog 1 [RAD51] and homolog 3 [RAD51C], DNA repair protein XRCC3 homolog, and meiotic recombination protein DMC1 homolog) were all partitioned into cluster 4, together with seven other proteins. These proteins were all annotated with similar or related biological processes, such as “base-excision repair,” “DNA repair,” and “meiosis.” The seed proteins for step 4 made two clusters, cluster 5 and cluster 6. Both of them were seemingly homogenous. Cluster 5 had the seed protein Rock-N-Rollers (RCK), a DNA helicase required for interference-sensitive crossover events (Chen et al., 2005). It also contained 21 other helicases, including the RecQ helicase family (i.e. RecQL2, RecQL3, RecQL4A, RecQL1, and RecQL4B), required for genome stability maintenance (Hartung et al., 2000). RecQL2 and RecQL4A are able to suppress homologous recombination (Hartung et al., 2007; Kobbe et al., 2008), while RecQL4B is specifically required to promote crossovers (Hartung et al., 2007). These opposing effects suggested that a delicate regulation of homologous crossover might be achieved by this cluster of proteins. The other cluster, cluster 6, was around MSH4, a meiosis-specific member of the MutS homolog family (Higgins et al., 2004). This cluster consisted of proteins involved in “mismatch repair” (e.g. MutS homolog proteins MSH2, MSH3, MSH4, and MSH7) and “regulation of transcription” (e.g. Arabidopsis trithorax-like proteins ATX1, ATX2, ATX3, ATX4, and ATX5). It was known that MSH2, a DNA mismatch repair gene MutS homolog, was able to repress recombination of mismatched heteroduplexes (Emmanuel et al., 2006). And one of the trithorax-like proteins, ATX1, was reported to be an epigenetic regulator with histone H3 Lys-4 methyltransferase activity (Alvarez-Venegas et al., 2003), which is critical in yeast for the formation of programmed DNA double-strand breaks that initiate homologous recombination during meiosis (Borde et al., 2009). These reports suggested the functional relevance of cluster 6 proteins.

There were also cases of proteins in one cluster that displayed more heterogeneous annotations. For example, although one of the seed proteins for the second step of recombination, DNA repair protein RAD50, formed a relatively homogenous cluster (cluster 3; with proteins from the SMC family and the Sister Chromatid Cohesion family), the other seed protein, double-strand break repair protein MRE11, was part of cluster 2, with 32 other proteins annotated with a diversity of biological processes ranging from development to stress responses. However, it was noticed that, despite the fact that they belonged to a number of biological processes, many of the cluster 2 proteins had the capability to bind small molecules, such as calcium, FK506, and cyclosporine A. It was reported that calmodulin antagonists and cAMP inhibited the enhancement of double-strand break repair by ionizing radiation in human cells (Wang et al., 2000). Our data implied the possibility that MRE11 might mediate this suppression. It would be interesting to experimentally demonstrate whether MRE11 actually participates in the regulation of double-strand break repair by calmodulin antagonists and cAMP as well as in the regulation by other small molecules suggested by cluster 2 proteins.

Data Access

In order to facilitate the use of our predicted interactions, we present PAIR, which is freely accessible at http://www.cls.zju.edu.cn/pair/. The current version of PAIR supports searches by protein, by interaction, and by homologous interaction in S. cerevisiae, D. melanogaster, C. elegans, and H. sapiens. To search a protein, users can specify its name (or alias), locus, or National Center for Biotechnology Information accession number. Interactions can be searched by specifying their component proteins or by specifying the component proteins in their homologues interactions. All queries support the use of wild characters such as “*” (representing a string of any length) and “” (representing a single character). If multiple hits are found, a list is presented allowing the choice of a specific protein or interaction to show in detail.

The protein detail page provides annotations of a protein taken from other databases, including its accession numbers, description, aliases, domain composition, GO annotations, and cellular localizations. Predicted interactions involving this protein are also listed at the bottom. The interaction detail page shows concise descriptions of the two component proteins, the evidence supporting this interaction, along with its homologous interactions in other species. Two types of per interaction confidence measurements are presented for each interaction. One is the LRs for features and the total LR for each interaction (see “Strength of the Indirect Evidence” above). These LRs are provided along with their percentile rankings in interaction examples to indicate their significance. The other type of per interaction confidence measurement is a SVM-based confidence score. It had been proposed that the absolute value of decision function output was correlated with the prediction accuracy (Hua and Sun, 2001). Based on this idea, we computed this value for each predicted interaction as its confidence score. Similarly, these confidence scores are provided along with their percentile rankings in interaction examples to indicate their significance. Whenever a list of interactions is presented, filters based on the total LR and the SVM-based confidence score are available for selection of more reliable predictions. However, because these per interaction confidence measurements were not rigorously assessed and were not able to directly translate into a probability of true prediction, they shall be used with caution.

Users interested in pathway analysis are able to register with PAIR. After login, the database allows users to add specific interactions into a personalized storage place called “My Collection.” Once pathway mining is done, the interactions stored in My Collection can be exported in the HUPO-PSI format (Kerrien et al., 2007), which is readily usable by a variety of mainstream pathway analysis softwares, such as Cytoscape (Shannon et al., 2003).

Future Prospects

As discussed above, the accuracy and coverage of our predicted interactions may seem to reach a meaningful level to assist plant biologists. However, there are still margins for improvement. For example, our training examples were far from adequate. Our interaction examples covered approximately 1% of the total interactome. Rapid advances in plant molecular biology will certainly provide more interaction examples that are able to better characterize their shared characteristics and produce more accurate prediction models. Also, when the feature strength diagrams were compared among features describing the same type of indirect evidence, those based on experimental data or annotation tended to be more discriminative than those based on predicted data (e.g. the domain interaction features derived from Protein Data Bank complex entries versus the domain interaction features derived from predicted data). Yet, unfortunately, for a significant fraction of protein pairs, many experimental data-based features were missing. In other words, these features saw no differences in a great number of protein pairs, which considerably limited their strength and hence required prediction-based features to complement. Therefore, more comprehensive indirect data may also help to increase prediction accuracy. In addition, novel approaches for feature extraction (Ramani and Marcotte, 2003), kernel computation (Xu et al., 2008), and SVM optimization (Shen et al., 2007) may offer new opportunities to build a better prediction model.

For these reasons, regular updates of PAIR are necessary to take advantage of the increasing amount of interaction information and indirect evidence as well as the quickly advancing computational techniques in feature extraction and model construction. The PAIR database was established as part of the informatics infrastructure supporting the State Key Laboratory of Plant Physiology and Biochemistry, China. Two updates per year are scheduled. Its long-term running cost is covered by a maintenance grant to the State Key Laboratory of Plant Physiology and Biochemistry from the Chinese Ministry of Science and Technology.


The PAIR database was constructed with careful consideration of the lessons learned in previous interaction prediction studies. It offers a high-confidence interaction set with an expected accuracy of approximately 48.7% and a high-coverage interaction set that was estimated to include approximately 73.5% of the entire interactome. These data currently represent the most accurate and comprehensive set of predicted Arabidopsis interactions. Detailed confidence measurements for each predicted interaction are also available. PAIR can be accessed freely online. As part of the infrastructure supporting the State Key Laboratory of Plant Physiology and Biochemistry, China, PAIR will be updated and upgraded twice a year. It is our hope that this resource is able to complement the existing informatics framework for Arabidopsis research and offer further help in the study of important molecular mechanisms in this model plant.



SVM is a statistical learning algorithm that learns from a set of training examples and builds a mathematical model to classify new examples. In typical settings, the training set contains two types (classes) of examples, positive examples (e.g. interactions) and negative examples (e.g. noninteractions). Each example is represented by a vector of real numbers (also known as features) describing its characteristics (e.g. indirect evidence of interaction). The SVM algorithm learns the relationship between the class labels and feature values in the training examples and encodes this relationship into a mathematical model. This model can then be used to predict the class label of any given example based on its feature values. A brief account of the SVM algorithm is provided in Supplemental Text S1.

Preparation of the Example Data Set

Known pairs of interacting proteins (positive examples) and noninteracting proteins (negative examples) are required to train an SVM prediction model. To assemble an accurate positive example set, we retrieved and integrated experimentally determined Arabidopsis (Arabidopsis thaliana) physical interactions from IntAct (Kerrien et al., 2006), BIND (Alfarano et al., 2005), BioGRID (Stark et al., 2006), and TAIR (Rhee et al., 2003). Because HTP experiments tend to suffer from high false-positive rates (Deane et al., 2002; von Mering et al., 2002), we excluded interactions only supported by HTP experiments. As shown in Table I, this resulted in 4,139 interactions involving 1,679 proteins.

Unlike interacting proteins, it is rare to find reported noninteracting proteins. Several studies have attempted to choose negative examples as pairs of proteins with different cellular localizations. However, this approach did not seem to improve model accuracy. Furthermore, it was argued that this would lead to biased estimations of accuracy, because the constraints placed on the cellular distribution of the negative examples could make the prediction task easier (Ben-Hur and Noble, 2006). Therefore, we followed the approach described by Zhang et al. (2004) to select negative examples as random protein pairs that did not overlap with the positive examples. The resulting negative examples may contain several potential interactions. Yet, given the ratio of interacting protein pairs versus noninteracting protein pairs in yeast (approximately 1:775; Yu et al., 2008), this level of contamination was likely acceptable. In this way, we generated the same number of negative examples. The gold-standard data set consisted of a total of 8,278 protein pairs, half positive, half negative.


Four types of informative indirect evidence were chosen for interaction prediction based on previous studies on the strength of various indirect evidence (Qi et al., 2006; Suthram et al., 2006). Fourteen features were computed by different mathematical characterizations of this evidence. Therefore, a 14-dimension feature vector was constructed to represent each protein pair.

The first type of indirect evidence was gene coexpression. Interacting proteins are often coexpressed. A large set of Arabidopsis gene expression profiles were available at the TAIR database (ftp://ftp.arabidopsis.org/home/tair/Microarrays/), which contained 1,436 experiments normalized by a robust multiarray average method (Rhee et al., 2003). Using this data set, we calculated the Pearson's correlation coefficient for the expression profiles of each protein pair as one feature.

The second type of indirect evidence was domain interaction. Protein interactions involve physical interactions between their domains. It has been proposed that novel protein interactions can be inferred by known domain interactions. Here, we annotated the domain composition of each Arabidopsis protein according to the Pfam database (Coggill et al., 2008), which assigned one or more of the 2,854 distinctive domains to one or more of the 20,183 Arabidopsis proteins. Known interacting domains were retrieved from the DOMINE database (Raghavachari et al., 2008), which contained two data sets derived from the Protein Data Bank molecular complex entries and seven data sets predicted by computational approaches. According to each data set, we counted the number of interacting domains in a pair of proteins as one feature. This resulted in nine different features.

The third type of indirect evidence was shared annotation. Interacting proteins were expected to share similar molecular functions, to be involved in the same biological processes, and to l ocalize in the same cellular components. GO (Raghavachari et al., 2008) provided a standardized way to annotate proteins from these aspects and therefore offered an excellent chance to compare their annotations. In the GO term tree, strongly related terms were expected to have a shared parent term close to the leaf terms and narrow in concept. The fraction of proteins annotated to this shared parent term and all of its child terms reflects the unexpectedness that two proteins share this particular parent annotation term. Consequently, this fraction could be a suitable score to measure the similarity between two annotation terms. The smaller this score is, the more similar the two terms are. Extending the concept of term similarity, we defined the protein annotation similarity score as the smallest term similarity score between their annotation terms. In TAIR, there were 112,347 annotations assigned to Arabidopsis gene products. By focusing specifically on the terms in one of the three categories (i.e. molecular function, biological process, and cellular component), we computed three protein annotation similarity scores as three features.

The last type of indirect evidence was colocalization. Interacting proteins are usually colocalized. We obtained protein subcellular localization information from the Arabidopsis Subcellular Database (Heazlewood et al., 2007). The colocalization feature was then calculated as described by Qiu and Noble (2008). Let f(l) be the fraction of all proteins presented in location l, and let I(i,l) be an indicator function indicating that protein i is observed to localize to l. The colocalization feature value is then calculated as follows: F(A,B) = maxl {− log(f(l))I(A,l)I(B,l)}. In other words, for protein pair (A,B), this feature was computed as the negative logarithm of the fraction of all proteins presented in the most specific location where both proteins A and B were observed to localize.

As shown in Table II, 14 features were computed for each pair of proteins: one feature based on coexpression, one feature based on colocalization, nine features based on domain interaction, and three features based on shared annotation.

Construction of a Prediction Model

An SVM model was constructed to predict Arabidopsis protein interactions. The SVM algorithm is well documented (Winters-Hilt et al., 2006). A brief introduction is also provided in Supplemental Text S1. For implementation, we used the public software package LIBSVM version 2.88 (http://www.csie.ntu.edu.tw/~cjlin/libsvm/). After comprehensive evaluation, we chose to use the radial basis function kernel for its best performance (data not shown). In this setup, two parameters, the regularization parameter C and the kernel width parameter γ, were optimized by a grid search in an empirical range (0.001 to 1,000 for C and 3 × 10−5 to 4 × 104 for γ), maximizing the F1 measure as defined below. Model performance with each parameter set was estimated by 10-fold cross-validation.

Based on whether a real positive or negative example in the testing set was predicted as positive or negative, the prediction result for each protein pair can be divided into four categories: true positive (TP), when a real positive example was predicted correctly; true negative (TN), when a real negative example was predicted correctly; false positive (FP), when a real negative example was predicted as positive; and false negative (FN), when a real positive example was predicted as negative. These four indicators are collectively called the confusion matrix (Table III). Many accuracy measurements are derived from the confusion matrix, such as the sensitivity [or recall; TP/(TP + FN)], specificity [TN/(TN + FP)], and precision [TP/(TP + FP)]. In the application of predicting interactions, the ability to accurately and comprehensively recognize potential interactions is critical. Therefore, the measurements of precision and recall are most relevant. Precision conveys the idea of how confident we are when the model predicts an interaction. Recall measures how much proportion of the real interactions can be recognized by the model. Giving equal weights to precision (p) and recall (r), Rijsbergen (1979) introduced the “F1 measure” as the geometric mean [2rp/(r + p)]. This measurement favors a balanced accuracy on both interacting and noninteracting protein pairs. For example, if a model predicted all examples as interactions, the overall accuracy would be the fraction of interactions in the test data set. However, the F1 measure would be only zero, reflecting the worst accuracy on all types of examples.

Comparison of Our Predictions with Other Predicted Interactomes

Our predicted interactions were compared with three published data sets. The Interologs data set was obtained as the supplemental data of Geisler-Lee et al. (2007). The AtPID data set was retrieved from the AtPID database (version 3; Cui et al., 2008). And the ATTED-II data set was downloaded from the ATTED-II database (version 4.1; Obayashi et al., 2009). The ATTED-II database did not explicitly suggest a criterion by which potential interactions shall be identified based on the expression correlations they provided. However, this database did provide a set of 48,276 highly coexpressed gene pairs, consisting of each gene paired with three other genes that displayed the top three expression similarities. It is clear that coexpression does not equal to interaction. Yet for the purpose of comparing predicted interactomes with a coexpression network, we took these highly coexpressed gene pairs as the potential interactions identified by the coexpression method. It shall also be noted that in our comparison, the Interologs set and the AtPID set included all of their predicted interactions. Their gold-standard examples not recognized by their respective prediction methods were not counted.

The overlaps between data sets were counted, and the probabilities of these overlaps appearing by chance were calculated as follows. For any two data sets, with 10,000 time iterations, we first estimated the distribution of overlaps between two random set of protein pairs having the same sizes and protein distributions as the two data sets being compared. Assuming a normal distribution, the probability of the observed overlap between these two data sets was then computed against the distribution of random overlaps. This probability was taken as the significance value for these two data sets being correlated.

Supplemental Data

The following materials are available in the online version of this article.

  • Supplemental Figure S1. ROC curve analysis of four prediction methods.
  • Supplemental Figure S2. Double-strand break repair model for meiotic recombination.
  • Supplemental Table S1. Rediscovery of newly reported interactions by our prediction models.
  • Supplemental Table S2. Eight experimentally confirmed interactions in our predicted meiotic recombination subnetwork.
  • Supplemental Table S3. Forty predicted interactions with known homologous interactions in our predicted meiotic recombination subnetwork.
  • Supplemental Table S4. The predicted meiotic recombination subnetwork.
  • Supplemental Text S1. A brief introduction to the SVM algorithm and the complete set of feature analysis diagrams.

Supplementary Material

[Supplemental Data]


We thank Lu Zhang for helpful discussions on feature extraction and Li Chen for help in Web site construction.


1This work was supported by the National Natural Science Foundation of China (grant no. 30600039), by the Chinese Ministry of Science and Technology (maintenance grant to the State Key Laboratory of Plant Physiology and Biochemistry, Zhejiang University, for long-term running cost of the PAIR database), and by the National Basic Research Program of China (grant no. 2005CB20901 to P.W.).

The author responsible for distribution of materials integral to the findings presented in this article in accordance with the policy described in the Instructions for Authors (www.plantphysiol.org) is: Xin Chen (nc.ude.ujz@nehcnix).

[C]Some figures in this article are displayed in color online but in black and white in the print edition.

[W]The online version of this article contains Web-only data.



  • Alfarano C, Andrade CE, Anthony K, Bahroos N, Bajec M, Bantoft K, Betel D, Bobechko B, Boutilier K, Burgess E, et al (2005) The Biomolecular Interaction Network Database and related tools 2005 update. Nucleic Acids Res 33 D418–D424 [PMC free article] [PubMed]
  • Alvarez-Venegas R, Pien S, Sadder M, Witmer X, Grossniklaus U, Avramova Z (2003) ATX-1, an Arabidopsis homolog of trithorax, activates flower homeotic genes. Curr Biol 13 627–637 [PubMed]
  • Ben-Hur A, Noble WS (2006) Choosing negative examples for the prediction of protein-protein interactions. BMC Bioinformatics (Suppl 1) 7: S2 [PMC free article] [PubMed]
  • Ben-Yacoub S, Abdeljaoued Y, Mayoraz E (1999) Fusion of face and speech data for person identity verification. IEEE Trans Neural Netw 10 1065–1074 [PubMed]
  • Bhardwaj N, Lu H (2005) Correlation between gene expression profiles and protein-protein interactions within and across genomes. Bioinformatics 21 2730–2738 [PubMed]
  • Borde V, Robine N, Lin W, Bonfils S, Geli V, Nicolas A (2009) Histone H3 lysine 4 trimethylation marks meiotic recombination initiation sites. EMBO J 28 99–111 [PMC free article] [PubMed]
  • Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M, Haussler D (2000) Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci USA 97 262–267 [PMC free article] [PubMed]
  • Burbidge R, Trotter M, Buxton B, Holden S (2001) Drug design by machine learning: support vector machines for pharmaceutical data analysis. Comput Chem 26 5–14 [PubMed]
  • Cai CZ, Han LY, Ji ZL, Chen X, Chen YZ (2003) SVM-Prot: Web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res 31 3692–3697 [PMC free article] [PubMed]
  • Chatr-aryamontri ACA, Palazzi LM, Nardelli G, Schneider MV, Castagnoli L, Cesareni G (2007) MINT: the Molecular INTeraction database. Nucleic Acids Res 35 D572–D574 [PMC free article] [PubMed]
  • Chen C, Zhang W, Timofejeva L, Gerardin Y, Ma H (2005) The Arabidopsis ROCK-N-ROLLERS gene encodes a homolog of the yeast ATP-dependent DNA helicase MER3 and is required for normal meiotic crossover formation. Plant J 43 321–334 [PubMed]
  • Coggill P, Finn RD, Bateman A (2008) Identifying protein domains with the Pfam database. Curr Protoc Bioinformatics 23 2.5.1–2.5.17
  • Cui J, Li P, Li G, Xu F, Zhao C, Li Y, Yang Z, Wang G, Yu Q, Shi T (2008) AtPID: Arabidopsis thaliana Protein Interactome Database. An integrative platform for plant systems biology. Nucleic Acids Res 36 D999–D1008 [PMC free article] [PubMed]
  • Deane CM, Salwinski L, Xenarios I, Eisenberg D (2002) Protein interactions: two methods for assessment of the reliability of high throughput observations. Mol Cell Proteomics 1 349–356 [PubMed]
  • de Vel O, Anderson A, Corney M, Mohay G (2001) Mining e-mail content for author identification forensics. Sigmod Record 30 55–64
  • Ding CH, Dubchak I (2001) Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 17 349–358 [PubMed]
  • Dray E, Siaud N, Dubois E, Doutriaux MP (2006) Interaction between Arabidopsis Brca2 and its partners Rad51, Dmc1, and Dss1. Plant Physiol 140 1059–1069 [PMC free article] [PubMed]
  • Emmanuel E, Yehuda E, Melamed-Bessudo C, Avivi-Ragolsky N, Levy AA (2006) The role of AtMSH2 in homologous recombination in Arabidopsis thaliana. EMBO Rep 7 100–105 [PMC free article] [PubMed]
  • Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30 1575–1584 [PMC free article] [PubMed]
  • Geisler-Lee J, O'Toole N, Ammar R, Provart NJ, Millar AH, Geisler M (2007) A predicted interactome for Arabidopsis. Plant Physiol 145 317–329 [PMC free article] [PubMed]
  • Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao YL, Ooi CE, Godwin B, Vitols E, et al (2003) A protein interaction map of Drosophila melanogaster. Science 302 1727–1736 [PubMed]
  • Grelon M, Vezon D, Gendrot G, Pelletier G (2001) AtSPO11-1 is necessary for efficient meiotic recombination in plants. EMBO J 20 589–600 [PMC free article] [PubMed]
  • Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143 29–36 [PubMed]
  • Hartung F, Plchova H, Puchta H (2000) Molecular characterisation of RecQ homologues in Arabidopsis thaliana. Nucleic Acids Res 28 4275–4282 [PMC free article] [PubMed]
  • Hartung F, Suer S, Puchta H (2007) Two closely related RecQ helicases have antagonistic roles in homologous recombination and DNA repair in Arabidopsis thaliana. Proc Natl Acad Sci USA 104 18836–18841 [PMC free article] [PubMed]
  • Heazlewood JL, Verboom RE, Tonti-Filippini J, Small I, Millar AH (2007) SUBA: the Arabidopsis Subcellular Database. Nucleic Acids Res 35 D213–D218 [PMC free article] [PubMed]
  • Higgins JD, Armstrong SJ, Franklin FC, Jones GH (2004) The Arabidopsis MutS homolog AtMSH4 functions at an early step in recombination: evidence for two classes of recombination in Arabidopsis. Genes Dev 18 2557–2570 [PMC free article] [PubMed]
  • Hua S, Sun Z (2001) A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. J Mol Biol 308 397–407 [PubMed]
  • Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan NJ, Chung S, Emili A, Snyder M, Greenblatt JF, Gerstein M (2003) A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science 302 449–453 [PubMed]
  • Karlsen RE, Gorsich DJ, Gerhart GR (2000) Target classification via support vector machines. Opt Eng 39 704–711
  • Kerrien S, Alam-Faruque Y, Aranda B, Bancarz I, Bridge A, Derow C, Dimmer E, Feuermann M, Friedrichsen A, Huntley R, et al (2006) IntAct: open source resource for molecular interaction data. Nucleic Acids Res 35 D561–D565 [PMC free article] [PubMed]
  • Kerrien S, Orchard S, Montecchi-Palazzi L, Aranda B, Quinn AF, Vinod N, Bader GD, Xenarios I, Wojcik J, Sherman D, et al (2007) Broadening the horizon: level 2.5 of the HUPO-PSI format for molecular interactions. BMC Biol 5 44. [PMC free article] [PubMed]
  • Kim KI, Jung K, Park SH, Kim HJ (2001) Support vector machine-based text detection in digital video. Pattern Recognit 34 527–529
  • Kobbe D, Blanck S, Demand K, Focke M, Puchta H (2008) AtRECQ2, a RecQ helicase homologue from Arabidopsis thaliana, is able to disrupt various recombinogenic DNA structures in vitro. Plant J 55 397–405 [PubMed]
  • Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain PO, Han JD, Chesneau A, Hao T, et al (2004) A map of the interactome network of the metazoan C. elegans. Science 303 540–543 [PMC free article] [PubMed]
  • Liong SY, Sivapragasam C (2002) Flood stage forecasting with support vector machines. J Am Water Resour Assoc 38 173–186
  • Obayashi T, Hayashi S, Saeki M, Ohta H, Kinoshita K (2009) ATTED-II provides coexpressed gene networks for Arabidopsis. Nucleic Acids Res 37 D987–D991 [PMC free article] [PubMed]
  • O'Brien KP, Remm M, Sonnhammer EL (2005) Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res 33 D476–D480 [PMC free article] [PubMed]
  • Qi Y, Bar-Joseph Z, Klein-Seetharaman J (2006) Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Proteins 63 490–500 [PMC free article] [PubMed]
  • Qiu J, Noble WS (2008) Predicting co-complexed protein pairs from heterogeneous data. PLOS Comput Biol 4 e1000054. [PMC free article] [PubMed]
  • Raghavachari B, Tasneem A, Przytycka TM, Jothi R (2008) DOMINE: a database of protein domain interactions. Nucleic Acids Res 36 D656–D661 [PMC free article] [PubMed]
  • Ramani AK, Marcotte EM (2003) Exploiting the co-evolution of interacting proteins to discover interaction specificity. J Mol Biol 327 273–284 [PubMed]
  • Revenkova E, Jessberger R (2005) Keeping sister chromatids together: cohesins in meiosis. Reproduction 130 783–790 [PubMed]
  • Rhee SYBW, Berardini TZ, Chen G, Dixon D, Doyle A, Garcia-Hernandez M, Huala E, Lander G, Montoya M, Miller N, et al (2003) The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Res 31 224–228 [PMC free article] [PubMed]
  • Rhodes DRTS, Varambally S, Mahavisno V, Barrette T, Kalyana-Sundaram S, Ghosh D, Pandey A, Chinnaiyan AM (2005) Probabilistic model of the human protein-protein interaction network. Nat Biotechnol 23 951–959 [PubMed]
  • Rijsbergen CJ (1979) Information Retrieval. Butterworths, London
  • Salwinski LMC, Smith AJ, Pettit FK, Bowie JU, Eisenberg D (2004) The Database of Interacting Proteins: 2004 update. Nucleic Acids Res 32 D449–D451 [PMC free article] [PubMed]
  • Scott MS, Barton GJ (2007) Probabilistic prediction and ranking of human protein-protein interactions. BMC Bioinformatics 8 239. [PMC free article] [PubMed]
  • Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13 2498–2504 [PMC free article] [PubMed]
  • Shen Q, Shi WM, Kong W, Ye BX (2007) A combination of modified particle swarm optimization algorithm and support vector machine for gene selection and tumor classification. Talanta 71 1679–1683 [PubMed]
  • Siaud N, Dray E, Gy I, Gerard E, Takvorian N, Doutriaux MP (2004) Brca2 is involved in meiosis in Arabidopsis thaliana as suggested by its interaction with Dmc1. EMBO J 23 1392–1401 [PMC free article] [PubMed]
  • Stacey NJ, Kuromori T, Azumi Y, Roberts G, Breuer C, Wada T, Maxwell A, Roberts K, Sugimoto-Shirasu K (2006) Arabidopsis SPO11-2 functions with SPO11-1 in meiotic recombination. Plant J 48 206–216 [PubMed]
  • Stark CBB, Requly T, Boucher L, Breitkreutz A, Tyers M (2006) BioGRID: a general repository for interaction datasets. Nucleic Acids Res 34 D535–D539 [PMC free article] [PubMed]
  • Suthram SST, Ruppin E, Sharan R, Ideker T (2006) A direct comparison of protein interaction confidence assignment schemes. BMC Bioinformatics 7 360. [PMC free article] [PubMed]
  • Szostak JW, Orr-Weaver TL, Rothstein RJ, Stahl FW (1983) The double-strand-break repair model for recombination. Cell 33 25–35 [PubMed]
  • Tsesmetzis N, Couchman M, Higgins J, Smith A, Doonan JH, Seifert GJ, Schmidt EE, Vastrik I, Birney E, Wu G, et al (2008) Arabidopsis reactome: a foundation knowledgebase for plant systems biology. Plant Cell 20 1426–1436 [PMC free article] [PubMed]
  • Uetz PGL, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, Qureshi-Emili A, et al (2000) A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 503 623–627 [PubMed]
  • Vapnik VN (1998) Statistical Learning Theory. Wiley, New York
  • Vapnik VN (2000) The Nature of Statistical Learning Theory, Ed 2. Springer, New York
  • von Mering CKR, Snel B, Cornell M, Oliver SG, Fields S, Bork P (2002) Comparative assessment of large-scale data sets of protein-protein interactions. Nature 417 399–403 [PubMed]
  • Wang Y, Mallya SM, Sikpi MO (2000) Calmodulin antagonists and cAMP inhibit ionizing-radiation-enhancement of double-strand-break repair in human cells. Mutat Res 460 29–39 [PubMed]
  • Wijeratne, Asela J, Ma, H (2007) Genetic analyses of meiotic recombination in Arabidopsis. Journal of Integrative Plant Biology 49 1199–1207
  • Winters-Hilt S, Yelundur A, McChesney C, Landry M (2006) Support vector machine implementations for classification and clustering. BMC Bioinformatics (Suppl 2) 7: S4 [PMC free article] [PubMed]
  • Xu H, Lin M, Wang W, Li Z, Huang J, Chen Y, Chen X (2007) Learning the drug target-likeness of a protein. Proteomics 7 4255–4263 [PubMed]
  • Xu JH, Li F, Sun QF (2008) Identification of microRNA precursors with support vector machine and string kernel. Genomics Proteomics Bioinformatics 6 121–128 [PubMed]
  • Xue Y, Li ZR, Yap CW, Sun LZ, Chen X, Chen YZ (2004) Effect of molecular descriptor feature selection in support vector machine classification of pharmacokinetic and toxicological properties of chemical agents. J Chem Inf Comput Sci 44 1630–1638 [PubMed]
  • Yu H, Braun P, Yildirim MA, Lemmens I, Venkatesan K, Sahalie J, Hirozane-Kishikawa T, Gebreab F, Li N, Simonis N, et al (2008) High-quality binary protein interaction map of the yeast interactome network. Science 322 104–110 [PMC free article] [PubMed]
  • Yuan Z, Burrage K, Mattick JS (2002) Prediction of protein solvent accessibility using support vector machines. Proteins 48 566–570 [PubMed]
  • Zhang LV, Wong SL, King OD, Roth FP (2004) Predicting co-complexed protein pairs using genomic and proteomic data integration. BMC Bioinformatics 5 38. [PMC free article] [PubMed]
  • Zhao C, Zhang H, Zhang X, Zhang R, Luan F, Liu M, Hu Z, Fan B (2006. a) Prediction of milk/plasma drug concentration (M/P) ratio using support vector machine (SVM) method. Pharm Res 23 41–48 [PubMed]
  • Zhao CY, Zhang HX, Zhang XY, Liu MC, Hu ZD, Fan BT (2006. b) Application of support vector machine (SVM) for prediction toxic activity of different data sets. Toxicology 217 105–119 [PubMed]

Articles from Plant Physiology are provided here courtesy of American Society of Plant Biologists
PubReader format: click here to try