###### From the Cover

# Estimating the size of the human interactome

^{†}

^{‡}

^{§}Thomas Thorne,

^{†}Eric de Silva,

^{†}Ronald Stewart,

^{†}Hyeong Jun An,

^{¶}Michael Lappe,

^{¶}and Carsten Wiuf

^{§}

^{‖}

^{†}Division of Molecular Biosciences, Imperial College London, Wolfson Building, London SW7 2AZ, United Kingdom;

^{‡}Institute of Mathematical Sciences, Imperial College London, London SW7 2AZ, United Kingdom;

^{‖}Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany; and

^{¶}Bioinformatics Research Center, University of Aarhus, 8000 Aarhus C, Denmark

^{§}To whom correspondence may be addressed. E-mail: ku.ca.lairepmi@fpmuts.m or ; Email: kd.ua.crib@fuiw

Author contributions: M.P.H.S., M.L., and C.W. designed research; M.P.H.S., T.T., E.d.S., M.L., and C.W. performed research; M.P.H.S., T.T., E.d.S., R.S., H.J.A., and C.W. analyzed data; and M.P.H.S., M.L., and C.W. wrote the paper.

## Abstract

After the completion of the human and other genome projects it emerged that the number of genes in organisms as diverse as fruit flies, nematodes, and humans does not reflect our perception of their relative complexity. Here, we provide reliable evidence that the size of protein interaction networks in different organisms appears to correlate much better with their apparent biological complexity. We develop a stable and powerful, yet simple, statistical procedure to estimate the size of the whole network from subnet data. This approach is then applied to a range of eukaryotic organisms for which extensive protein interaction data have been collected and we estimate the number of interactions in humans to be ≈650,000. We find that the human interaction network is one order of magnitude bigger than the *Drosophila melanogaster* interactome and ≈3 times bigger than in *Caenorhabditis elegans*.

**Keywords:**evolutionary systems biology, network inference, network sampling theory, network evolution

One of the perhaps most surprising results of the genome-sequencing projects was that the number of genes is much lower than had been expected and is, in fact, surprisingly similar for very different organisms (1, 2). For example, the nematode *Caenorhabditis elegans* appears to have a similar number of genes as humans, whereas rice and maize appear to have even more genes than humans. It was then quickly suggested that the biological complexity of organisms is not reflected merely by the number of genes but by the number of physiologically relevant interactions (1, 3). In addition to alternative splice variants (4), posttranslational processes (5), and other (e.g., genetic) factors influencing gene expression (6, 7), the structure of interactome is one of the crucial factors underlying the complexity of biological organisms. Here, we focus on the wealth of available protein interaction data and demonstrate that it is possible to arrive at a reliable statistical estimate for the size of these interaction networks. This approach is then used to assess the complexity of protein interaction networks in different organisms from present incomplete and noisy protein interaction datasets.

There are now fairly extensive protein interaction network (PIN) datasets in a number of species, including humans (8, 9). These have been generated by a variety of experimental techniques (as well as some *in silico* inferences). Although these techniques and the resulting data are (*i*) notoriously prone to false positives and negatives (10, 11), and (*ii*) result in highly idealized and averaged network structures (12), such interaction datasets are increasingly turning into useful tools for the analysis of the functional (e.g., ref. 13) and evolutionary properties (14) of biological systems. In particular, in *Saccharomyces cerevisiae* we are beginning to have a fairly complete description of the protein interaction network that is accessible with current experimental technologies; the recent high-quality literature-curated dataset of Reguly *et al.* (15) provides us with a dataset that should be almost completely free from false positives. For most other organisms, however, interaction data are still far from complete and it has recently been shown that subnetworks, in general, have qualitatively different properties from the true network (16–18). Although the importance of network-sampling properties had only been realized relatively recently, this aspect of most systems biology data are increasingly being recognized (11, 19) as important.

There are, however, some properties of the true network that can be inferred even from subnet data, and here we show that the total network size is one property for which this is the case. Present protein-interaction datasets enable us to estimate the size of the interactomes in different species by using graph theoretical invariants. This is particularly interesting for species where more than one experimental dataset is available. Below we first describe a robust and very general estimator of network size from partial network data that overcomes this problem. We then apply it to available PIN data in a range of eukaryotic organisms. In supporting information (SI) *Text* we demonstrate the power of this approach by using extensive simulation studies.

## Estimating Interactome Size

Here, we develop an approach for estimating the size of a network from incomplete data. We will show below (and by using extensive simulations in *SI Text*) that for a given species estimates from different independent datasets—generated by different methods such as yeast-two-hybrid and TAP tagging—yield estimates for the interactome size that are in excellent agreement.

We are concerned with a true network, , which has *N*_{} nodes and *M*_{} edges. The sets of nodes and edges are given by _{} and _{}, respectively; these define the graph representation of the true network:

We pick a subset of nodes _{} _{} and study properties of the subgraph *G*_{} induced by the nodes in _{}

where the set of edges observed in the is a subset of the total set of edges, _{} _{}. Our aim is to predict the number of interactions in the true network *G*_{} based on the available data in the subnet, *G*_{}.

We assume that the network, *G*_{}, is generated according to some (unknown) model characterized by a parameter (vector) θ, and subsequently the observed network, *G*_{} is sampled from it. Then

where it is assumed that the sampling is independent of the network-generating model. The parameter *p* refers to a general sampling process, and not only independent node sampling. Furthermore, we assume the order *N*_{} of the network is known and allow nodes to be annotated with information not related to the wiring of the network (e.g., GO terms or protein family classes). Consequently, the sum is over networks, *G*_{}, with *N*_{} (labeled) nodes only. For convenience, we take labeling information to be included in *N*_{} and *N*_{} (the order of *G*_{}).

If sampling only depends on the nodes in the network and not on their connections, then *P*_{p}(*G*_{}|*G*_{}) splits into a product of two terms,

where *Q _{p}*(

*N*

_{}) is a term denoting the probability of sampling the nodes in the observed PIN and

*q*(

*G*

_{},

*G*

_{}) denotes how many ways this can be done given the (labeled) nodes in

*G*

_{}—by assumption, labeling of nodes is the same in all possible

*G*

_{}s. For example, if the nodes are unlabeled and have degree zero, then

*q*(

*G*

_{},

*G*

_{}) = (

_{N}

^{N}). If all nodes have degree one, a similar factor can be derived based on the number of degree one (

*d*

_{1}) and degree zero (

*d*

_{0}) nodes that are observed in the PIN:

*q*(

*G*

_{},

*G*

_{}) is the number of ways one can choose

*d*

_{0}and

*d*

_{1}out of the

*N*

_{}/2 pairs of connected nodes in

*G*

_{}. If all nodes are labeled then

*q*(

*G*

_{},

*G*

_{}) = 1, because one can only select the nodes in the PIN in one way.

It follows that *Q _{p}*(

*N*

_{}) is sufficient for inference on

*p*(the remaining part of the likelihood does not depend on

*p*). In the case of independent node sampling, each node with probability

*p*, we have

*Q*(

_{p}*N*

_{}) =

*p*

^{N}(1 −

*p*)

^{N − N}and the maximum likelihood estimate of

*p*is

which is unbiased and consistent.

From the likelihood Eq. **4** it follows that

where *G**_{} is a specific network (to distinguish it from the sum over all networks in the denominator). Note that this conditional probability does not depend on *p* and that, in principle, we can only gain knowledge about the interactome if something is assumed about the network-generating model. Note also that this is a general restriction that is not related to independent node sampling alone.

A reasonable estimate of the edge probability in *G*_{} is

where *M*_{} is the number of edges in the PIN. It leads to the following estimate of the interactome size:

The estimate is unbiased and consistent provided the network-generating mechanism ensures some form of uniformity, as is the case for random graphs (Figs. S1 and S2). For example, if *G*_{} has a star topology with one node of degree *N*_{} − 1 and the remaining of degree 1, then *M*_{} = 0 with probability 1 − *p* and *M*_{} = (*N*_{} − 1)/ with probability *p*; hence, _{} is not consistent. We will demonstrate below that the assumption of independent sampling of nodes is not too restrictive and should apply to many, in particular, high-throughput, experimental studies.

So far we have assumed that the number of ORFs, *N*_{} in an organism is known from genome surveys. Total genome size is, however, still not precisely known in most organisms. Uncertainty in the *N*_{} is, however, easily incorporated. Assume that the value *N*_{} is associated with an error or uncertainty ε (i.e., if the genome contains *N*_{0} protein-coding genes of which *N*_{} are known, then ε = (*N*_{0} − *N*_{})/*N*_{0}. Then let *N*_{} := *N*_{0} (1 ± ε) and for ε 0.1 we have

Replacing in Eq. **8** with ^{~}*p* yields the error-corrected estimate for the true network size

Thus, an uncertainty of ε in the number of nodes in the true network results in an uncertainty of 2ε for the number of edges in the true network.

To assess the variability of the estimator we can construct approximate bootstrap confidence intervals (CI) (20). The number of edges is given by

in terms of the degree sequence. Now let **d** = {*d*_{1}, *d*_{2}, …, *d*_{}} be the set of degrees of all of the nodes in the graph _{} describing the subnet. Then we generate bootstrap replicates, **d***, by sampling the degrees of the nodes in the sample with replacement *N*_{}. For each bootstrap replicate, **d***, we obtain an estimate *M**_{} (which may be a noninteger because of the factor 1/2 in Eq. **11**; this does not affect the estimator). Creating a sufficiently large number of bootstrap replicates, **d***, thus allows us to calculate the bootstrap CIs; these have very good coverage properties, as shown in Figs. S3 and S4.

The derivation of Eq. **8** does not depend on any restrictive assumptions (see *SI Text*) but is a generic property of random graphs and their subnets. Crucially Eq. **8** is valid irrespective of the degree sequence or other summary statistics of the networks^{††}; confidence intervals (CI) and their coverage properties (20) may, however, depend on the degree sequence or network structure. Because there is no sufficient statistic for general networks (17) [i.e., a summary statistic that would include all information about the likelihood (21) of a network] it is also not possible to improve on these estimators by, for example., including the numbers of observed triangles or the clustering coefficient. The only limitation is the assumption of independent sampling. This is, however, also implicit in all previous attempts at estimating interactome sizes (22–24). Below we show how nonrandom sampling schemes can be described and how false-positive and false-negative rates of PIN data affect our estimate.

## Other Node-Sampling Schemes

The above approach can be generalized for datasets that are ascertained in certain ways and can thus also deal with experimental bias.

### Independent but Nonuniform Sampling.

We assume independent sampling of nodes. Let node *i* have a probability *p _{i}* for being included in the subnet. We allow

*p*≠

_{i}*p*

_{j}and only assume that the

*p*values are drawn independently from the same probability distribution,

_{i} where α is a parameter (potentially vector valued). The properties of *F* are not of importance. It follows that is unbiased

and also consistent, because

for large networks. Now consider an edge *e*_{ij} (*i* ≠ *j*); then the probability of observing this edge in the subnet is

and

Likewise ^{2} is consistent (25), hence also unbiased for large networks.

### Dependent Sampling.

Here, we assume as above that *p _{i}* is drawn from some probability distribution,

*p*~

_{i}*F*(α) that might, however, depend on information related to node

_{i}*i*, for example, the degree or functional classification of

*i*; that is,

*F*(α) =

_{i}*F*(α;

*D*), where

_{i}*D*denotes this information. Although measures for expression abundance may be such a factor, this appears not to be the case for the datasets considered here (Fig. S5). Hence, we might take

_{i}*D*as an additional parameter in the function

_{i}*F*.

In addition, we assume the network is uncorrelated with respect to this information, that is, *P*(*D _{i}*,

*D*) =

_{j}*P*(

*D*)

_{i}*P*(

*D*); and, given the probabilities

_{j}*p*, we assume nodes are drawn independently of each other. This assumption is justified for all networks in which the degree–degree correlation of interacting nodes is determined by the degree distribution. This is approximately the case for the networks considered here

_{i}^{‡‡}. It follows that

that is, is unbiased. Note that

which in turn leads to

and consequently consistency. Likewise, it follows that *E*(π* _{ij}*) =

*E*(

*p*) =

_{i}p_{j}*p*

^{2}, and that the edge sampling probability consistently is estimated by

^{2}.

### Effects of Uncertain Data on Estimated Interactome Sizes

So far we have assumed that the interaction data are correct. This is not the case for protein interaction data (10–12, 26–28). Here, we show that it is possible to include noisy data and that the estimates given in Table 1(see also Fig. 2) are not likely to change severely (e.g., by an order of magnitude) for realistic rates of false positives and false negatives. We note that the sampling theory developed in the previous sections needs modification to take false positives and false negatives into account; for example, the sum in Eq. **4** should be over all possible networks and not just those containing the observed PIN data.

_{}, and the horizontal bars

**...**

Let the number of true interactions in a network with *N* nodes be denoted by *M*; if the data collection process is not perfect, then (assuming independence) the number of reported interactions, will generally be different from *M*. Now let *M*_{TP}, *M*_{FN}, *M*_{FP}, and *M*_{TN} denote the true-positive, false-negative, false-positive, and true-negative results, respectively. We trivially have

and

The rates for true positives and false negatives are defined by

Thus, for a given number of reported edges/interactions and estimates of the true-positive and false-negative rates, and , we obtain an estimate for the true number of interactions

Thus, for a fixed network (or subnet) the false-positive and false-negative rates affect the estimates of the true number of interactions in a simple linear manner (see Fig. S6).

## Results

We use Eq. **8** to estimate interactome sizes in humans and three other eukaryotic organisms: *S. cerevisiae* (29–32), *C. elegans* (33), and *D. melanogaster* (34–36). But we begin with an illustration of the power of this simple estimator by applying it to *S. cerevisiae* PIN data; here, we have treated the presently available PIN data as a proxy for a complete “interaction network” whose size we are trying to predict. In Fig. 1*A* we show the distributions of estimates obtained from 1,000 randomly chosen subnets covering 20%, 40%, 60%, and 80% of the available PIN data [taken from the Database of Interacting Proteins (DIP) (37)]. In Fig. 1*B* we show the coverage properties of the bootstrap 95% CIs for sampling the same sampling fractions. Together with the simulation studies discussed in *SI Text*, the results in Fig. 1 suggest that the estimator _{} provides an accurate and reliable way of estimating interactome sizes from present data. Interactome size estimates and their CIs for experimental PIN datasets are shown in Table 1 and Fig. 2 for the organisms considered here. The DIP datasets (always shown in green) are mainly based on high-throughput studies, supplemented by interactions collected from the literature; as such, they generally cannot be treated as independent from the other datasets. For humans, however, there is negligible overlap between the DIP databases and the two recent high-throughput surveys and we can treat the three estimators as approximately independent.

Based on the results in Table 1 and Fig. 2, we would therefore expect—given present experimental methods and ignoring multiple splice variants—the human interactome to contain ≈650,000 protein interactions. Thus, it is approximately an order of magnitude larger than the estimated *D. melanogaster* interactome, and a factor of 3 more complex than the estimated *C. elegans* interactome; this contrasts with relative genome sizes of ≈1.8 and ≈1.2, respectively. The results for the *S. cerevisiae* PIN suggest that it will ultimately contain ≈25,000–35,000 interactions (see also Table 1); this agrees well with previous estimates (22, 23). It also agrees well with estimates obtained from the recent data generated by Reguly *et al.* (15): for the pure literature-curated set we obtained 37,000 interactions; for the complete network data we obtained an estimate of ≈35,000 interactions in the yeast PIN. These two datasets were, however, collected from the literature and the sampling process is thus much harder, perhaps even impossible, to model accurately.

By using Eq. **24** the impacts of false-positive and false-negative rates are easily assessed (see also ref. 38). We find that the linear effect of the error rates on the estimated number of true interactions results in a comparatively modest effect. The estimates of the true-positive rates in PIN datasets range from 35% (33) to 84% (34); there are fewer estimates for the false-negative rate that are on the order of 20–40% (10) obtained for different *S. cerevisiae* datasets. It appears that, for realistic rates of true positive and false positive, the estimate of the human interactome size remains very similar compared to the simple estimate obtained in this article of ≈650,000 protein–protein interactions. Similar curves can be drawn for the other species, too, and in each case we obtain comparable values for most combinations of realistic error rates. Thus, we believe that error rates exert a comparatively moderate effect on the estimator (Eq. **8**).

Overall, it therefore appears that estimates obtained from Eq. **8** should be accurate to within less than an order of magnitude even under the very worst circumstances. A much more realistic estimate, however, can be obtained from comparing the different and essentially independent estimates for *S. cerevisiae*. These findings suggest that an accuracy of approximately a factor of 2 is more realistic. Reassuringly, these results are confirmed when applying a recent multimodel inference procedure (39) that deals with incomplete network data.

## Discussion

We have shown that it is possible to estimate the size of interactomes reliably from present partial interaction data. Our estimator is powerful and robust, relying on assumptions that appear to be met by typical systematic high-throughput studies. Unlike the previous approach of Hart *et al.* (24), who implicitly assume that interactions do not occur between surveyed proteins and those not yet surveyed, our estimate deals with missing data in a coherent and statistically meaningful manner; the route taken by Grigoriev (23) can be understood as a special case of the present approach when two or more datasets are available. Moreover, noise and different sampling/ascertainment strategies are straightforwardly included in the analysis (38, 40). We have illustrated the power of this approach by using simulated sampling processes in *S. cerevisiae* and have found that the estimator, Eq. **8**, and the bootstrap confidence intervals have very good coverage properties. We have then applied this inferential framework to published datasets in four eukaryotic organisms. We found that the predicted interactome sizes differ quite considerably between these species. For example, the human interactome appears to be an order of magnitude larger than the *D. melanogaster* interactome. Unfortunately, for maize and rice, which have comparable or even larger number of genes to humans, only tiny PIN datasets are available and we cannot obtain useful estimates for their respective interactome sizes. If conventional assumptions about the different complexity of organisms are indeed correct, and if interactome size does reflect organismic complexity (1–3, 41), then we would expect these organisms to have smaller interactomes than humans. The increase of interactome size with number of proteins/ORFs should thus not be uniform or even monotonic. We note that the estimate of ≈ 650,000 interactions means that the human PIN will still be relatively sparse: this corresponds to only ≈0.2% of all possible pairwise interactions being present; for most other species, however, the network is even sparser.

There are a number of other factors that may contribute to an explanation of the increase in phenotypic *bauplan* complexity between species: the diversity of the transcriptome (42) and protein-domain architecture (43) have all been implicated in the literature. Here, we have demonstrated that interactome sizes are consistent with biological intuition about the complexity of eukaryotic organisms. We note that our estimator is very flexible and reflects the quality of present data: we predict the number of interactions that are detectable given present experimental technology. For example, we have not considered (physiologically probably very important) transient or condition-specific interactions. Should more sensitive and reliable experimental methodologies or better estimates of experimental error rates become available in the future, then Eq. **8** can, of course, be used to predict an updated number of protein–protein interactions for an organism. Our formalism is also readily extended to directed network data (such as gene-regulation networks).

As a final note, we want to stress that the estimates necessarily reflect experimental technology. Thus, the estimates in Table 1 refer only to the types of interactions that are detectable given present experimental methods and protocols. The estimator for the size of the true network, however, will remain universally correct for suitable datasets and for all types of networks. We will thus be able to use it in the future and apply it to other network datasets as well.

## Acknowledgments.

This work was supported by the Wellcome Trust (M.P.H.S., E.d.S., and T.T.), the Royal Society and the Carlsberg Foundation (M.P.H.S. and C.W.), and an EMBO Young Investigator fellowship (to M.P.H.S.). C.W. is supported by the Danish Research Council.

## Footnotes

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

See Commentary on page 6795.

This article contains supporting information online at www.pnas.org/cgi/content/full/0708078105/DCSupplemental.

^{††}Eq. **8** is a general result for general (random) graphs; it is equally true for all ensembles of random graphs such as Erdös–Rényi and scale-free random graphs. In *SI Text* we further illustrate the simple quadratic relationship by using simulations.

^{‡‡}The degree–degree distribution is not significantly different from the product degree distribution (by using the Kolmogorov–Smirnov test); that is, *P*(*k*, *l*) ≈ *P*(*k*)*P*(*l*) for the datasets considered here.

## References

*Saccharomyces cerevisiae*. J Biol. 2006;5:11. [PMC free article] [PubMed]

*Saccharomyces cerevisiae*. Nature. 2000;403:623–627. [PubMed]

*Saccharomyces cerevisiae*by mass spectrometry. Nature. 2002;415:180–183. [PubMed]

*C. elegans*. Science. 2004;303:540–543. [PMC free article] [PubMed]

*Drosophila melanogaster*. Science. 2003;302:1727–1736. [PubMed]

*Drosophila*protein-interaction map centered on cell-cycle regulators. Genome Biol. 2004;5:R96. [PMC free article] [PubMed]

*Drosophila*case study. Genome Res. 2005;15:376–384. [PMC free article] [PubMed]

**National Academy of Sciences**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (295K) |
- Citation

- Protein interaction networks of Saccharomyces cerevisiae, Caenorhabditis elegans and Drosophila melanogaster: large-scale organization and robustness.[Proteomics. 2006]
*Li D, Li J, Ouyang S, Wang J, Wu S, Wan P, Zhu Y, Xu X, He F.**Proteomics. 2006 Jan; 6(2):456-61.* - The identification of similarities between biological networks: application to the metabolome and interactome.[J Mol Biol. 2007]
*Cootes AP, Muggleton SH, Sternberg MJ.**J Mol Biol. 2007 Jun 15; 369(4):1126-39. Epub 2007 Mar 14.* - Information flow analysis of interactome networks.[PLoS Comput Biol. 2009]
*Missiuro PV, Liu K, Zou L, Ross BC, Zhao G, Liu JS, Ge H.**PLoS Comput Biol. 2009 Apr; 5(4):e1000350. Epub 2009 Apr 10.* - Computational reconstruction of protein-protein interaction networks: algorithms and issues.[Methods Mol Biol. 2009]
*Franzosa E, Linghu B, Xia Y.**Methods Mol Biol. 2009; 541:89-100.* - The Cartographers toolbox: building bigger and better human protein interaction networks.[Brief Funct Genomic Proteomic. 2009]
*Sanderson CM.**Brief Funct Genomic Proteomic. 2009 Jan; 8(1):1-11. Epub 2009 Mar 12.*

- A versatile platform to analyze low affinity and transient protein-protein interactions in living cells in real time[Cell reports. 2014]
*Li YC, Rodewald LW, Hoppmann C, Wong ET, Lebreton S, Safar P, Patek M, Wang L, Wertman KF, Wahl GM.**Cell reports. 2014 Dec 11; 9(5)1946-1958* - Reconstruction of large-scale regulatory networks based on perturbation graphs and transitive reduction: improved methods and their evaluation[BMC Systems Biology. ]
*Pinna A, Heise S, Flassig RJ, Fuente AD, Klamt S.**BMC Systems Biology. 773* - The role of protein interaction networks in systems biomedicine[Computational and Structural Biotechnology ...]
*Sevimoglu T, Arga KY.**Computational and Structural Biotechnology Journal. 11(18)22-27* - The Proteome of a Healthy Human during Physical Activity under Extreme Conditions[Acta Naturae. 2014]
*Larina IM, Ivanisenko VA, Nikolaev EN, Grigorev AI.**Acta Naturae. 2014; 6(3)66-75* - Estimation of Global Network Statistics from Incomplete Data[PLoS ONE. ]
*Bliss CA, Danforth CM, Dodds PS.**PLoS ONE. 9(10)e108471*

- PubMedPubMedPubMed citations for these articles

- From the Cover: Estimating the size of the human interactomeFrom the Cover: Estimating the size of the human interactomeProceedings of the National Academy of Sciences of the United States of America. May 13, 2008; 105(19)6959

Your browsing activity is empty.

Activity recording is turned off.

See more...