Logo of bmcbioiBioMed Centralsearchsubmit a manuscriptregisterthis articleBMC Bioinformatics
BMC Bioinformatics. 2009; 10(Suppl 1): S57.
Published online Jan 30, 2009. doi:  10.1186/1471-2105-10-S1-S57
PMCID: PMC2648735

Finding motif pairs in the interactions between heterogeneous proteins via bootstrapping and boosting

Abstract

Background

Supervised learning and many stochastic methods for predicting protein-protein interactions require both negative and positive interactions in the training data set. Unlike positive interactions, negative interactions cannot be readily obtained from interaction data, so these must be generated. In protein-protein interactions and other molecular interactions as well, taking all non-positive interactions as negative interactions produces too many negative interactions for the positive interactions. Random selection from non-positive interactions is unsuitable, since the selected data may not reflect the original distribution of data.

Results

We developed a bootstrapping algorithm for generating a negative data set of arbitrary size from protein-protein interaction data. We also developed an efficient boosting algorithm for finding interacting motif pairs in human and virus proteins. The boosting algorithm showed the best performance (84.4% sensitivity and 75.9% specificity) with balanced positive and negative data sets. The boosting algorithm was also used to find potential motif pairs in complexes of human and virus proteins, for which structural data was not used to train the algorithm. Interacting motif pairs common to multiple folds of structural data for the complexes were proven to be statistically significant. The data set for interactions between human and virus proteins was extracted from BOND and is available at http://virus.hpid.org/interactions.aspx. The complexes of human and virus proteins were extracted from PDB and their identifiers are available at http://virus.hpid.org/PDB_IDs.html.

Conclusion

When the positive and negative training data sets are unbalanced, the result via the prediction model tends to be biased. Bootstrapping is effective for generating a negative data set, for which the size and distribution are easily controlled. Our boosting algorithm could efficiently predict interacting motif pairs from protein interaction and sequence data, which was trained with the balanced data sets generated via the bootstrapping method.

Background

Linear motifs are known to facilitate many protein-protein interactions [1]. Despite the availability of a large volume of data about protein-protein interactions and their sequences, linear motifs are difficult to discover, due to their short length, which is between three and ten amino acids [2]. Recently, several methods have been developed for discovering linear motifs of protein-protein interactions [1,3], but most methods focus on detecting individual linear motifs rather than interacting motif pairs. Motif pairs are more useful than motifs for filtering many spurious protein interactions in current high-throughput data, and for identifying a functional target.

Supervised learning or stochastic methods are often used to predict linear motifs involved in protein-protein interactions. Both negative and positive interactions are required to train the methods. Unlike positive interaction data, negative samples cannot be readily obtained from protein-protein interaction data. Assuming a negative interaction where there is no explicit evidence of a positive interaction results in a much larger negative data set than a positive data set. Such an unbalance between positive and negative data sets makes a prediction biased [4,5]. Generating a negative data set via random selection often does not reflect the original distribution of data, thus it does not produce a good prediction model.

There are a few methods for generating a negative data set. Jansen et al. [6] generate a data set of negative interactions by assuming that proteins in different subcellular compartments of a cell do not interact. However, different subcellular locations only indicate that the proteins have a lower chance of binding than those in the same location, and some proteins are found in more than one subcellular compartment of a cell [7]. The method developed by Gomez et al. [8] assumes a negative protein interaction, if there is no explicit evidence of an interaction. However, this assumption generates a negative data set that is too large, resulting in low sensitivity in interaction predictions. The method that uses the shortest path [7] has difficulty in obtaining a negative data set of the desired size. The method that uses sequence similarity [9] also has difficulty in controlling the size of the negative data set.

In this study, we developed a bootstrapping algorithm for generating a negative data set of protein-protein interactions, and a new boosting algorithm for finding interacting motif pairs from positive and negative data sets. The remainder of the paper describes the algorithms and their experimental results with various parameter values.

Results and discussion

We measured the prediction performance of the boosting algorithm in terms of sensitivity, specificity and accuracy.

S e n s i t i v i t y = T P T P + F N
(1)
S p e c i f i c i t y = T N T N + F P
(2)
A c c u r a c y = T P + T N T P + F P + T N + F N
(3)

In the following description, the sampling size S is the number of negative samples that were examined to generate a single negative data via bootstrapping. When the number of negative samples with m-th feature = 1 is greater than the acceptance ratio A, the m-th feature of the re-sampled negative data is set to 1. The feature vector and the acceptance ratio are described in detail in the method section.

Affect of acceptance ratios

From the interactions between human and virus proteins, we generated four different negative data sets, by executing the bootstrapping algorithm with four acceptance ratios (1/10, 1/8, 1/6, 1/4). Then, we used both the negative and positive data sets to test the boosting algorithm via five-fold cross validation. Motif pairs predicted from each fold were combined as follows: Mi = {motif pairs found in at least i folds} where i = {1, 2, ..., 5} [7]. Table Table11 shows the number of motif pairs predicted with different acceptance ratios.

Table 1
Motif pairs found during five-fold cross validation

As the acceptance ratio increases, re-sampled negative data have fewer nonzero features, resulting in more motif pairs. This is because the nonzero features of negative data are used to filter out the features that are also nonzero in positive data.

With the sampling size of 120, most non-interaction data were re-sampled to generate a negative data set. We compared the prediction performance of the algorithm with respect to four different acceptance ratios. As shown in Table Table2,2, prediction of motif pairs with a larger acceptance ratio shows a much better performance than that with a smaller acceptance ratio. As the acceptance ratio increases, negative data have more nonzero features. Hence, data with many zero features are easily classified as negative samples.

Table 2
Prediction performance with respect to acceptance ratios of bootstrapping

Affect of proportions of positive and negative data sets

For the purpose of comparing the prediction performance with respect to different proportions of positive and negative data sets, we generated three negative data sets with the sampling size of 120 and acceptance ratio of 1/8. The data set for 1,712 interactions between human proteins and virus proteins was used as the positive data set. Table Table33 and Figure Figure11 show the prediction performance with respect to three different proportions of positive and negative data sets. As the proportion of positive data increases, sensitivity increases, but specificity decreases. It is interesting to note that the size of the negative data sets alone affects the performance.

Figure 1
Sensitivity and specificity of predictions with respect to proportions of positive and the negative data. As the proportion of positive data increases, the sensitivity increases but the specificity decreases.
Table 3
Prediction performance with respect to proportions of positive and negative data

Affect of boosting algorithms

The execution time of the boosting algorithm is influenced by the number of hypotheses (T; for Yu's AdaBoost algorithm only), the number of partitioned data sets (S), and the number of randomly selected training data for weak hypotheses (R). Suppose that we set parameters; T = 4, S = 5 and R = 100,000. Yu's AdaBoost uses 5 × 4 = 20 weak hypotheses. But, our boosting algorithm uses only five weak hypotheses. While Yu's AdaBoost uses four weak hypotheses per data set, our boosting algorithm uses only one weak hypothesis per data set. With fewer weak hypotheses than Yu's AdaBoost algorithm, our algorithm has a better performance, as shown in Table Table44.

Table 4
Prediction performance of two boosting algorithms

Motif pairs found in complexes of human and virus proteins

Table Table55 shows the p-values for each set of motif pairs. The p-value of M1 = 1, implying that motif pairs of M1 had no more significance than random motif pairs. However, motif pairs of M2-M5 were more significant than random motif pairs. Figure Figure22 shows a complex of human and HIV-1 proteins (PDB ID: 1AGF). Among the total of 63 contact residues between chains A and C, 16 residue pairs were included in M2.

Figure 2
Motif pairs predicted for 1AGF. Red balls: contact residue pairs correctly predicted, Cyan balls: contact residue pairs missed in the prediction, Gray wireframe: non-contact residues
Table 5
Motif pairs found in each fold

Conclusion

When positive and negative training data sets are unbalanced, the result via the prediction model tends to be biased. We developed a bootstrapping algorithm for generating a negative data set of arbitrary size from protein-protein interaction data. We also developed an efficient boosting algorithm for finding interacting motif pairs in human and virus proteins. The boosting algorithm showed the best performance (84.4% sensitivity and 75.9% specificity) with balanced positive and negative data sets. The boosting algorithm was also used to find potential motif pairs in complexes of human and virus proteins, for which structural data was not used for training the algorithm. Interacting motif pairs common to multiple folds of structural data of complexes were proven to be statistically significant.

This method predicts protein-protein interactions and motif pairs using the protein sequence data. The sequence information alone is insufficient to predict motif pairs for some proteins, but our method provides a useful model for predicting motif pairs in protein-protein interactions when the sequence is the only information available. The data set for interactions between human and virus proteins was extracted from BOND and is available at http://virus.hpid.org/interactions.aspx. The complexes of human and virus proteins were extracted from PDB and their identifiers are available at http://virus.hpid.org/PDB_IDs.html.

Methods

Data set

We extracted the latest data of interactions between human and virus proteins from BOND [10]. As of May, 2008, there were 1,712 interactions between 1,029 human proteins and 603 virus proteins. These interactions were considered as positive data. From 1,712 interactions, we constructed three negative data sets of 2,252, 1,712, and 2,283 samples via the bootstrapping method.

Feature vector

The way of extracting features in our study was similar to the one used in the studies of Gomez et al. [8] and Yu et al. [7]. In the study by Gomez et al., four-tuple features were used to identify a subsequence of four amino acids. Based on biochemical similarities of amino acids, twenty amino acids were classified into six categories: {IVLM}, {FYW}, {HKR}, {DE}, {QNTP}, and {ACGS} [11]. After classification, there were 64 = 1,296 possible substrings of length four.

For a given protein sequence, a four-tuple feature is represented as a 1,296-bit binary vector, in which each bit indicates whether the corresponding length-four string occurs in the protein. The encoding scheme for the interaction binary vector is described in Table Table66.

Table 6
Encoding scheme for the interacting motif pairs

Both our previous study [9] and the study of Yu et al. [7] found interacting motif pairs in yeast proteins. A binary vector representing an interacting motif pair is a palindrome, so the total number Msymmetric of possible motif pairs is determined by

M s y m m e t r i c = ( 6 4 2 ) + 6 4 = 840 , 456
(4)

The interactions between human and virus proteins are the interactions between heterogeneous proteins. Hence, the total number Masymmetric of possible motif pairs is as follows.

Masymmetric = 64·64 = 1,679,616
(5)

Our method is intended for finding motif pairs with 4 consecutive residues (i, i+1, i+2 and i+3) in each motif. Hence, a motif with non-consecutive residues cannot be found even if the residues are spatially close to each other. Since the total number of possible motif pairs is 6m·6m = (6m)2 = 62m for a motif of size m (equation 5), the total number of possible motif pairs increases exponentially as the size of m increases. The total number of possible motif pairs can be reduced with a motif of a smaller size (e.g., 2 or 3 residues), but the motif of a small size has too many occurrences in the sequences, which significantly reduces the selectivity of the motif.

Bootstrapping for re-sampling

As in Gomez et al.'s method [8], we assumed a negative interaction if there was no explicit evidence of an interaction. However, this assumption generates a much larger number of negative samples than positive samples. If we randomly select only some of the negative samples, we might miss information from unselected negative samples. Dupret and Koda [5] used bootstrapping to identify the optimal re-sampling proportions in binary classification experiments.

In our study, we used bootstrapping to generate negative data sets via re-sampling negative data. Algorithm 1 describes our bootstrapping method, which is controlled by the sampling size S and acceptance ratio A. Executing the bootstrapping algorithm yields a single re-sampled negative data from S negative data. The re-sampled negative data is represented as a feature vector Y = {y1, y2, ..., yM} via Algorithm 1. The number of 1's in the feature vector Y is controlled by the acceptance ratio A. A larger value of A produces a feature vector with fewer nonzero elements.

Algorithm 1 – Bootstrapping algorithm

This algorithm generates the feature vector Y for a single negative data from S samples, where S is the sampling size and A is the acceptance ratio for setting a feature to 1.

1. Randomly sample S protein pairs (Ps1, Ps2) with replacement from non-interacting protein pairs, where s = {1, 2, ..., S}.

2. Initialize ni = 0 for i = {1, 2, ..., M}

3. Initialize yi = 0 for i = {1, 2, ..., M}

4. For s == {1, 2, ..., S}

   a. Make a binary vector Xs = {xs1, xs2, ..., xsM} for a pair of proteins (Ps1, Ps2)

   b. For m = {1...M}

      If xsm = 1, nm = nm + 1 {nm is the number of samples for which the m-th feature = 1}

5. For m = {1...M}

      If nm/S > A, set ym = 1

6. Y = {y1, y2, ..., yM} is a feature vector representing re-sampled negative data.

The boosting algorithm

In general, the boosting method finds a highly accurate hypothesis by combining weak hypotheses, each of which is only moderately accurate. Typically, each weak hypothesis is a simple classification rule. In AdaBoost (Adaptive Boosting), each weak hypothesis generates not only a classification rule but also a confidence score that estimates the reliability of the classification [12].

The study of Yu et al. [7] uses the AdaBoost algorithm for finding motif pairs in homogeneous protein interactions. One of the differences between Yu's algorithm and ours is the number of weak hypotheses used in the algorithms. In Yu's AdaBoost algorithm, if the weight (αs1) of the first weak hypothesis is much greater than the weights of other hypotheses, the final hypothesis is determined mainly by the first weak hypothesis and other hypotheses have negligible effect on the final hypothesis.

Our boosting algorithm determines the weights of weak hypotheses and uses the training data in a different way from Yu's algorithm. While Yu's AdaBoost algorithm uses different weights and the same training data per weak hypothesis, our algorithm uses the same weights and different training data per weak hypothesis. Our boosting algorithm uses fewer weak hypotheses than Yu's algorithm, and requires much less time than their algorithm.

Our algorithm consists of two parts: boosting algorithm and WINNOW2 algorithm. The boosting algorithm described in Algorithm 2 takes as input a training set (x1, y1), ..., (xn, yn), where each pair is a binary vector of length M, which represents an interaction with a label in the label set Y. Y = {-1, +1} indicates whether each interaction is positive or negative. The boosting algorithm calls the WINNOW2 algorithm to obtain a weak hypothesis in an iterative series of rounds, where t = {1, ..., S}. In each round, the boosting algorithm computes the weight (αt) of the weak hypothesis hc,t. The final hypothesis Ht for Sett is the weighted sum of weak hypotheses hc,i (i = 1, ..., S and i t).

We used a regulated stochastic WINNOW2 algorithm [13] with R = 200,000 as a weak classifier (Algorithm 3). The WINNOW2 algorithm is similar to that of Yu et al. [7], except for the step of updating learner factors. Yu's algorithm updates learner factors when xki (feature vector) is 0, but our algorithm updates them when xki is 1. Yu's algorithm takes as input a training set and computes normalized sample weights in each boosting round. In the step of drawing a sample data, data with larger weights are drawn more frequently than those with smaller weights. Since the sample weights are difficult to adjust in each round, our algorithm uses the same weight for every sample and draws samples with equal frequency. But, the training data is changed in every round, and the call to the WINNOW2 algorithm produces different hypotheses according to the training data. Finally, additional regulation is performed to discover effective components. The components with large learner factors are identified as effective components. These effective components are considered as the motif pairs of protein-protein interactions.

Suppose that there are five data sets (S = 5) and four weak hypotheses (T = 4 in Yu's algorithm) per round. Yu's AdaBoost algorithm requires 5 × 4 = 20 weak hypotheses to classify the data. In contrast, our boosting algorithm requires only one weak hypothesis per round, and five weak hypotheses in total, thus it does not need the parameter T. Since the execution times of the algorithms are proportional to the number of hypotheses, our algorithm is more than four times faster than Yu's algorithm for the same data set, without reducing the prediction accuracy [9]. The frameworks for both algorithms are shown in Figures Figures33 and and44.

Figure 3
Framework for Yu's AdaBoost algorithm. The AdaBoost algorithm requires 20 weak hypotheses for T = 4 and S = 5.
Figure 4
The framework of our boosting algorithm. Our algorithm requires only 5 weak hypotheses for S = 5.

Algorithm 2 – boosting algorithm

The boosting algorithm calls the WINNOW2 algorithm to obtain weak hypotheses. S is the number of divided data sets.

1. Given divided data set Set1, Set2, ..., SetS where t=1SSett=Settotal.

2. For t = 1, ..., S

   a. Given training data (x1, y1), (x2, y2), ..., (xn, yn) from Sett where xi [set membership] {0, 1}M, yi [set membership] Y = {-1, +1} for {i = 1, 2, ..., n}

   b. Call the WINNOW2 algorithm to obtain the weak hypothesis hc,t.

   c. Compute the error rt of the weak hypothesis hc,t at level c.

r t = 1 n i y i h c , t ( x i ) .

   d. Compute the weight αt of the weak hypothesis

α t = 1 2 ln ( 1 + r t 1 r t ) .

3. Output the final hypothesis for Sett:

H t ( x ) = s i g n i = 1 S , i t α i h c , i ( x ) .

Algorithm 3 – WINNOW2 algorithm

The WINNOW2 algorithm trains the weak hypothesis. R is the number of randomly selected data.

1. Given training data (x1, y1), (x2, y2)..., (xn, yn).

2. Initialize learner factor wi = 1 for i = {1, 2, ..., M}, and threshold θ = M/2

3. For r = {1, ..., R}

   a. Randomly select a sample data (xk, yk), and let vector xk denote (xk1, xk2, ..., xkM)

   b. The learner responds as follows:

{ h ( x k ) = 1 i f i = 1 M w i x k i > θ h ( x k ) = 1 i f i = 1 M w i x k i θ

   c. Update learner factors wi=wi2xki(yh)/2

4. Define a regulated classifier hc at level c as follows:

{ h c ( x k ) = 1 i f i = 1 M w i , c x k i > θ h c ( x k ) = 1 i f i = 1 M w i , c x k i θ

where wi,c = wi if wi c, and wi,c = 0 otherwise.

5. Let Nc denote the number of positive predictions by classifier h(c) in the training data and N0 denote the number of positive predictions with the cutoff of 0.

   Output the classifier hC where C = arg max {c | Nc = N0}.

6. The features with non-zero wi,c are effective motif pairs.

Verification with structural data

To further evaluate the algorithm for the structures of heterogeneous multi-protein complexes, we extracted structural data for complexes of human and virus proteins from PDB [14]. Complexes with RNA or DNA chains were not retrieved. Circa June 2008, there were a total of 105 complexes of human and virus proteins in PDB.

We used five-fold cross validation to evaluate the algorithm. The data set was split into five parts of equal size. The boosting algorithm using the WINNOW2 algorithm for weak hypotheses was trained with one part and tested with the remaining four parts. The train-test procedure consisted of five iterations.

When a residue pair in different chains contained an atomic pair within the distance of 5 Å, we considered the residue pair as a contact residue pair. If a motif pair had at least one contact residue pair, we considered the motif pair as a verifiable motif pair [7]. To assess the statistical significance of motif pairs predicted by our algorithm, we estimated the p-value of motif pairs by executing Algorithm 4 with m = 100,000 [9]. Motif pairs with lower p-values are more significant than those with higher p-values.

Algorithm 4 – Estimation of p-values of motif pairs

A motif pair with a smaller p-value is more significant than a random motif pair Ri.

1. Given a set S of motif pairs collected by weak hypotheses.

2. Randomly draw m motif pairs {R1, R2, ..., Rm} where Ri has the same size as Mk (k = 1, 2, ...., 5)

3. Compute the p-value of the set S as follows:

p ( S ) = # ( V ( R i ) V ( S ) ) m , i = { 1 , 2 , ... , m } .

where V(S) is the number of verifiable motif pairs.

Competing interests

The authors declare that they have no competing interests.

Acknowledgements

This work was supported by the Korea Research Foundation Grant funded by the Korean Government (KRF-2006-D00038).

This article has been published as part of BMC Bioinformatics Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S1

References

  • Davey NE, Shields DC, Edwards RJ. SLiMDisc: short, linear motif discovery, correcting for common evolutionary. Nucleic Acid Res. 2006;34:3546–3554. doi: 10.1093/nar/gkl486. [PMC free article] [PubMed] [Cross Ref]
  • Neduva V, Russel RB. Linear motifs: Evolutionary interaction switches. FEBS Letters. 2005;579:3342–3345. doi: 10.1016/j.febslet.2005.04.005. [PubMed] [Cross Ref]
  • Neduva V, Russel RB. DILIMOT: discovery of linear motifs in proteins. Nucleic Acid Res. 2006;34:W350–W355. doi: 10.1093/nar/gkl159. [PMC free article] [PubMed] [Cross Ref]
  • Olson DL. Data Set Balancing. Lecture Notes in Artificial Intelligence. 2004;3327:71–80.
  • Dupret G, Koda M. Bootstrap re-sampling for unbalanced data in supervised learning. European Journal of Operational Research. 2001;134:141–156. doi: 10.1016/S0377-2217(00)00244-7. [Cross Ref]
  • Jansen R, Gerstein M. Analyzing protein function on a genomic scale: the importance gold-standard positives and negatives for network prediction. Current opinion in Microbiology. 2004;7:535–545. doi: 10.1016/j.mib.2004.08.012. [PubMed] [Cross Ref]
  • Yu H, Qian M, Deng M. Using a Stochastic AdaBoost Algorithm to Discover Interactome Motif Pairs from Sequences. Lecture Notes in Bioinformatics. 2006;4115:622–630.
  • Gomez SM, Noble WS, Rzhetsky A. Learning to Predict Protein-Protein Interactions from Protein Sequences. Bioinformatics. 2003;19:1875–1881. doi: 10.1093/bioinformatics/btg352. [PubMed] [Cross Ref]
  • Kim J, Park B, Han K. Prediction of Interacting Motif Pairs using Stochastic Boosting. Proceedings of Frontiers in the Convergence of Bioscience and Information Technologies. 2007. pp. 95–100.
  • Alfarano C, Andrade CE, Anthony K, et al. The Biomolecular Interaction Network Database and related tools 2005 update. Nucleic Acid Res. 2005;33:D418–D424. doi: 10.1093/nar/gki051. [PMC free article] [PubMed] [Cross Ref]
  • Taylor WR, Jones DT. Deriving an amino acid distance matrix. Journal of Theoretical Biology. 1993;164:65–83. doi: 10.1006/jtbi.1993.1140. [PubMed] [Cross Ref]
  • Schapire RE, Singer Y. Improved Boosting Algorithms Using Confidence-rated Predictions. Machine Learning. 1999;37:297–336. doi: 10.1023/A:1007614523901. [Cross Ref]
  • Littlestone N. Learning Quickly When Irrelevant Attributes Abound. A New Linear-threshold Algorithm. Machine Learning. 1988;2:285–318.
  • Deshpande N, Addess KJ, Bluhm WF, et al. The RCSB Protein Data Bank: a redesigned query system and relational database based on the mmCIF schema. Nucleic Acids Research. 2005;33:D233–D237. doi: 10.1093/nar/gki057. [PMC free article] [PubMed] [Cross Ref]

Articles from BMC Bioinformatics are provided here courtesy of BioMed Central
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles
  • Substance
    Substance
    PubChem Substance links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...