Format

Send to

Choose Destination
See comment in PubMed Commons below
Bioinformatics. 2011 Nov 1;27(21):3024-8. doi: 10.1093/bioinformatics/btr514. Epub 2011 Sep 9.

Revisiting the negative example sampling problem for predicting protein-protein interactions.

Author information

1
Center for Systems and Synthetic Biology, Institute of Cellular and Molecular Biology, University of Texas at Austin, Austin, Texas 78712, USA. yungki@mail.utexas.edu

Abstract

MOTIVATION:

A number of computational methods have been proposed that predict protein-protein interactions (PPIs) based on protein sequence features. Since the number of potential non-interacting protein pairs (negative PPIs) is very high both in absolute terms and in comparison to that of interacting protein pairs (positive PPIs), computational prediction methods rely upon subsets of negative PPIs for training and validation. Hence, the need arises for subset sampling for negative PPIs.

RESULTS:

We clarify that there are two fundamentally different types of subset sampling for negative PPIs. One is subset sampling for cross-validated testing, where one desires unbiased subsets so that predictive performance estimated with them can be safely assumed to generalize to the population level. The other is subset sampling for training, where one desires the subsets that best train predictive algorithms, even if these subsets are biased. We show that confusion between these two fundamentally different types of subset sampling led one study recently published in Bioinformatics to the erroneous conclusion that predictive algorithms based on protein sequence features are hardly better than random in predicting PPIs. Rather, both protein sequence features and the 'hubbiness' of interacting proteins contribute to effective prediction of PPIs. We provide guidance for appropriate use of random versus balanced sampling.

AVAILABILITY:

The datasets used for this study are available at http://www.marcottelab.org/PPINegativeDataSampling.

CONTACT:

yungki@mail.utexas.edu; marcotte@icmb.utexas.edu.

SUPPLEMENTARY INFORMATION:

Supplementary data are available at Bioinformatics online.

PMID:
21908540
PMCID:
PMC3198576
DOI:
10.1093/bioinformatics/btr514
[Indexed for MEDLINE]
Free PMC Article
PubMed Commons home

PubMed Commons

0 comments
How to join PubMed Commons

    Supplemental Content

    Full text links

    Icon for Silverchair Information Systems Icon for PubMed Central
    Loading ...
    Support Center