^{1}Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA. rgentlem@fhcrc.org

Abstract

We review the estimation of coverage and error rate in high-throughput protein-protein interaction datasets and argue that reports of the low quality of such data are to a substantial extent based on misinterpretations. Probabilistic statistical models and methods can be used to estimate properties of interest and to make the best use of the available data.

Interpreting results on direct physical interactions from Y2H experiments. (a) The observation of interactions A-B and B-C in a Y2H experiment does not indicate whether the two interactions can take place simultaneously (center) or whether they are exclusive of each other (right). (b) The ability of two proteins to interact may depend on post-translational modifications whose presence or absence may be actively regulated. Proteins D and E interact (center) in the absence of a certain post-translational modification (red shape), whose presence inhibits the interaction (right).

The manifestation of protein complexes in Y2H and AP-MS data. AP-MS experiments measure complex co-membership, and the fact that a prey is found by a certain bait means that there is either a direct physical interaction or an indirect physical interaction mediated by a protein complex. The set of proteins pulled down by a particular bait cannot therefore be equated with a single complex: if the bait is part of several different complexes, then the set of prey will be the union of all proteins in all complexes. (a) Protein B is involved in three different multiprotein complexes. In two of these it directly interacts with C, which itself can also interact with proteins F, G or H, whereas in the third complex, B interacts with D and E. (b) Assuming there are no other interactions under the conditions of the experiment, the bipartite graph between proteins B, ... H and complexes 1, 2, and 3 will look like this. (c,d) The result of a hypothetical AP-MS experiment with no false positives and no false negatives when (c) B is used as a bait and (e) F is used as a bait. (e,f) Result from a hypothetical Y2H experiment with a genome-wide set of preys and with no false positives and false negatives when (d) B is used as a bait and (f) F is used as a bait. (g,h) The results of (g) an ideal AP-MS experiment and (h) an ideal Y2H experiment if all proteins were used as baits. The Y2H data in (e,f,h) identifies the direct interactions, but it does not contain information on the number and architecture of the complexes. The maximal cliques identified by the AP-MS experiment in (g) correspond to the complexes in (a). However, the AP-MS data do not contain information on the topology of the direct interactions within each complex.

Graph theory offers a convenient and useful set of terms and concepts to represent relationships between entities. Graphs most commonly represent binary relationships and these can be either directed or undirected. A further type of graph is needed to represent the membership of proteins in complexes: this relationship is not binary and requires a type of graph called a bipartite graph. Box gives precise definitions of these concepts and an overview of how they apply to protein-interaction data.

Standard definitions of various error statistics [] are given in Box . We give them to enable a coherent dialog and to address some of the confusion in the literature. For example, a widely cited evaluation study by Edwards et al. [] reported a "false positive rate" defined as FP/(TP + FP): where FP is the number of false positives and TP the number of true positives. However, the more common name for this quantity is the 'false-discovery rate' (see Box ). The difference between the false-positive rate, as usually defined by FP/N, and the false-discovery rate can be substantial, as their denominators are very different, N being the true tested non-interactions, given by TN + FP (see Box ). Incompatible terminology leads to confusion and makes comparison of error rates reported in different studies difficult.

Scatterplot of n_{in }and n_{out }for the AP-MS data of Krogan et al. [11]. Each point in the plot corresponds to one protein. n_{in }is the number of times that the protein was found as a prey; n_{out }the number of prey it found when used as a bait. The two lines mark contours of probability p = 10^{-4 }according to the Binomial model in Equation (3). Outlying proteins (dark blue) show a significantly large difference between n_{in }and n_{out}, suggesting that at least one of them is wrong. For example, if n_{out }>>n_{in}, one possible reason is that a protein is not expressed when used as prey or of such low abundance that it is outcompeted, but when tagged and expressed as a bait, it will identify and pull down its interaction partners as prey. Further validation experiments are needed to determine in each case whether the unreciprocated interactions correspond to false-positive or false-negative observations.

## PubMed Commons