###### From the Cover

# Inferring network mechanisms: The *Drosophila melanogaster* protein interaction network

^{†}Department of Physics,

^{‡}College of Physicians and Surgeons,

^{§}Department of Applied Physics and Applied Mathematics, and

^{¶}Center for Computational Biology and Bioinformatics, Columbia University, New York, NY 10027

^{}To whom correspondence should be addressed at: Department of Applied Physics and Applied Mathematics, Columbia University, 500 West 120th Street, New York, NY 10027. E-mail: ude.aibmuloc@sniggiw.sirhc.

## Abstract

Naturally occurring networks exhibit quantitative features revealing underlying growth mechanisms. Numerous network mechanisms have recently been proposed to reproduce specific properties such as degree distributions or clustering coefficients. We present a method for inferring the mechanism most accurately capturing a given network topology, exploiting discriminative tools from machine learning. The *Drosophila melanogaster* protein network is confidently and robustly (to noise and training data subsampling) classified as a duplication–mutation–complementation network over preferential attachment, small-world, and a duplication–mutation mechanism without complementation. Systematic classification, rather than statistical study of specific properties, provides a discriminative approach to understand the design of complex networks.

**Keywords:**machine learning, systems biology, motifs, classification, evolution

Recent research activity in biological networks has often focused on understanding the emergence of specific features such as scale-free degree distributions (1–3), short mean geodesic lengths, or clustering coefficients (4). The insights gained into the topological patterns have motivated various network growth and evolution models to determine what simple mechanisms can reproduce the features observed. Among these are the preferential attachment model (3, 5), exhibiting scale-free degree distributions, and the small-world model (4), exhibiting high clustering coefficients despite short mean geodesics. Additionally, various duplication–mutation mechanisms have been proposed to describe biological networks (6–11) and the World Wide Web (12). However, in most cases model parameters can be tuned such that multiple models of widely varying mechanisms perfectly fit the motivating real network in terms of single selected features such as the scale-free exponent and the clustering coefficient (compare Fig. 1). Because networks with several thousands of vertices and edges are highly complex, it is also clear that these statistics can capture only limited structural information.

*a*) Cumulative degree distribution

*p*(

*k*>

*k*

_{0}), average clustering

**...**

Here, we make use of *discriminative classification* techniques recently developed in machine learning (13, 14) to classify a given real network as one of many proposed network mechanisms by enumerating local substructures. Determining what simple mechanism is responsible for a natural network's architecture (*i*) facilitates the development of correct priors for constraining network inference and reverse engineering (15–18); (*ii*) specifies the appropriate null model relative to which one evaluates statistical significance (19–29); (*iii*) guides the development of improved network models; and (*iv*) reveals underlying design principles of evolved biological networks. It is therefore desirable to develop a method to determine which proposed mechanism models a given complex network without prior selection of features or null models.

Enumeration of subgraphs has been successfully used in the past few years to find network motifs (19, 20, 23–29) and is historically a well established method in the sociology community (30–32). Recently, the idea of clustering real networks based on their “significance profiles” has been proposed (33). The method assesses significance of given subgraphs relative to an assumed null model, generated by Monte Carlo sampling of networks with a degree distribution identical to that of the network of interest. The significance profiles are then shown to be similar for various groups of naturally occurring networks.

Both clustering and assessing statistically significant motifs can be characterized as schemes to identify reduced-complexity descriptions of the networks. We here present an approach that is instead *predictive,* using labeled graphs of known growth mechanisms as training data for a discriminative classifier. This classifier, then, presented with a new graph of interest, can reliably and robustly predict the growth mechanism that gave rise to that graph. Within the machine learning community, such predictive, *supervised learning*, techniques are differentiated from descriptive, *unsupervised learning*, techniques such as clustering.

We apply our method to the recently published *Drosophila melanogaster* protein–protein interaction network (34) and find that a duplication–mutation–complementation (DMC) mechanism (6) best reproduces *Drosophila*'s network. The prediction is robust against noise, even after random rewiring of up to 45% of the network edges. To validate, we also show that beyond 80% random rewiring the correct (Erdös–Rényi) classification is obtained.

## Methods

**The Data Set.** We use a protein–protein interaction map based on yeast two-hybrid screening (34). Because the data are subject to numerous false positives, Giot *et al.* (34) assign a confidence score *P* [0, 1], measuring how likely the interaction occurs *in vivo*. To exclude unlikely interactions and focus on a core network that retains significant global features, we determine a confidence threshold *p*^{*} based on percolation: measurements of the size of the components for all possible values of *p*^{*} show that the two largest components are connected for *p*^{*} = 0.65 (see the supporting information, which is published on the PNAS web site). Edges in the graph correspond to interactions for which *p* > *p*^{*}. To reveal possible structural changes in *Drosophila* for less stringent thresholds, we also present results for *p*^{*} = 0.5 as suggested in ref. 34. We remove self-interactions from the network because none of the proposed mechanisms allow for them. After eliminating isolated vertices the resulting networks consist of 3,359 (4,625) vertices and 2,795 (4,683) edges for *p*^{*} = 0.65 (0.5).

**Network Mechanisms.** We generate 7,000 graphs, 1,000 for each of seven different models drawn from the literature, as training data. Every graph is generated with the same number of edges and number of vertices as measured in *Drosophila*; all other existing parameters are sampled uniformly (see supporting information). The models, many of which were explicitly intended to model protein interaction networks, manifest various simple network growth mechanisms. As an example, the DMC algorithm (6) is inspired by an evolutionary model of the genome (35, 36) proposing that most of the duplicate genes observed today have been preserved by functional complementation. If either copy of the gene loses one of its functions (edges), the other becomes essential in ensuring the organism's survival. There is thus an increased preservation of duplicate genes induced by null mutations. The algorithm features a duplication step followed by mutations that preserve functional complementarity. At every iteration one chooses a vertex *v* at random. A twin vertex *v*_{twin} is then introduced, copying all of *v*'s edges. For each edge of *v*, one deletes with probability *q*_{del} either the original edge or its corresponding edge of *v*_{twin}. The twins themselves are conjoined with an independent probability *q*_{con}, representing an interaction of a protein with its own copy. Note that no new edges are created by mutations. The DMC mechanism thus assumes that the probability of creating new advantageous functions by random mutations is negligible.

A slightly different implementation of duplication–mutation is realized in ref. 7 by using random mutations (DMR). Possible interactions between twins are neglected. Instead, edges between *v*_{twin} and the neighbors of *v* can be removed with a probability *q*_{del} and new edges can be created at random between *v*_{twin} and any other vertices with a probability *q*_{new}/*N*, where *N* is the current total number of vertices. DMR thus emphasizes the creation of new advantageous functions by mutation.

In addition to (*i*) DMC and (*ii*) DMR, we generate training data for (*iii*) linear preferential attachment (LPA) networks (3, 5) (growing graphs with a probability of attaching new vertices to existing vertices proportional to *k* + *a, a* being a constant parameter and *k* being the degree of the existing vertex); (*iv*) random static (RDS) networks (37) (also known as Erdös–Rényi graphs; vertices are connected randomly); (*v*) random growing (RDG) networks (38) (growing graphs where new edges are created randomly between existing vertices); (*vi*) aging vertex (AGV) networks (39) (growing graphs modeling citation networks, where the probability for new edges decreases with the age of the vertex); and (*vii*) small-world (SMW) networks (4) (an interpolation between regular ring lattices and randomly connected graphs). For descriptions of the specific algorithms we refer the reader to the supporting information.

**Subgraph Census.** We quantify the topology of a network by exhaustive subgraph census (31) up to a given subgraph size; note that we do *not* assume a specific network randomization or test for statistical significance as in refs. 19, 20, 23–29, 31, and 32, but we instead *classify* network mechanisms by using the raw subgraph counts. Rather than choosing most important features *a priori*, we count all possible subgraphs up to a given cut-off, which can be made in the number of vertices, number of edges, or the length of a given walk. To show robustness to this choice, we present results for two different cut-offs. We first count all subgraphs that can be constructed by a walk of length eight (148 nonisomorphic^{††} subgraphs); second, we consider all subgraphs up to a total number of seven edges (130 nonisomorphic subgraphs). Their counts are the input features for our classifier. It is worth noting that the mean geodesic length (average shortest path between two vertices) of the *Drosophila* network's giant component is 11.6 (9.4) for *p*^{*} = 0.65 (0.5). Walks of length eight are therefore able to traverse large parts of the network and can also reveal global structures.

**Learning Algorithm.** Our classifier is a generalized decision tree called an *alternating decision tree* (ADT) (40) by using the Adaboost (41) algorithm, which is related to additive logistic regression (42). Adaboost is a general discriminative learning algorithm proposed in 1997 by Freund and Schapire (41, 43) and has since been successfully used in numerous and varied applications [e.g., in text categorization (44, 45) and gene expression prediction (46)].

An example of an ADT is shown in Fig. 2. A given network's subgraph counts determine paths in the ADT dictated by inequalities specified by the *decision nodes* (rectangles) (subgraphs associated with Fig. 2 are shown in Fig. 3). For each class, the ADT outputs a real-valued *prediction score*, which is the sum of all weights over all paths. The class with the highest score wins. The prediction score *y*(*c*) for class *c* is related to the probability *p*(*c*) for the tested network to be in class *c* by *p*(*c*) = *e*^{2}^{y}^{(}^{c}^{)}/(1 + *e*^{2}^{y}^{(}^{c}^{)}) (42). (The supporting information gives additional details on the exact learning algorithm. Source code is available from C.H.W. on request.)

**...**

An advantage of ADTs is that they do not assume a specific geometry of the input space; that is, features are not coordinates in a metric space (as in support vector machines or *k*-nearest-neighbors classifiers), and the classification is thus independent of normalization. The algorithm assumes neither independence nor dependence among subgraph counts. The subgraphs reveal their importance themselves solely by their abilities to discriminate among different classes.

## Results

We perform cross-validation (ref. 13 and supporting information) with multiclass ADTs, thus determining an empirical estimate of the generalization error, i.e., the probability of mislabeling an unseen test datum. Table 1 relates truth and prediction for the test sets. Five of seven classes have nearly perfect prediction accuracy. Because AGV is constructed to be an interpolation between LPA and a ring lattice, the AGV, LPA, and SMW mechanisms are equivalent in specific parameter regimes and correspondingly show a nonnegligible overlap. Nevertheless, the overall prediction accuracy on the test sets still lies between 94.6% and 95.8% for different choices of *p*^{*} and subgraph size cut-off. Note that preferential attachment is completely distinguishable from duplication–mutation despite the fact that a duplication mechanism is sometimes described as an *effective* preferential attachment (ref. 47 and supporting information). Even models that are based on the same fundamental mechanism, such as duplication–mutation in DMC and DMR, are perfectly separable. Even small algorithmic changes in network mechanisms can thus give rise to easily detectable differences in substructures. Our results (see Fig. 1) confirm that although many of these models have similar degree distributions, clustering coefficients, or mean geodesic lengths, they have indeed distinguishable topologies.

Fig. 2 shows the first few decision nodes of a resulting ADT. The prediction scores reveal that a high count of 3-cycles suggests a DMC network (node 3). The DMC mechanism indeed facilitates the creation of many 3-cycles by allowing two copies to attach to each other, thus creating 3-cycles with their common neighbors. In particular a few combinations are good predictors for some classes. For example, a low count in 3-cycles combined with a high count in 8-edge linear chains is a good predictor for LPA and DMR networks (nodes 3 and 4). Because of the sparseness of the networks preferential attachment does not lead to a clustered structure. While LPA readily yields hubs, cycles are less probable. (Larger ADTs can be viewed in the supporting information.)

Having built a classifier enjoying good prediction accuracy, we can now determine the network mechanism that best reproduces the *Drosophila* protein network (or in principle any network of the same size) by using the trained ADTs for classification. Table 2 gives the prediction scores of the *Drosophila* network for each of the seven classes, averaged over folds.

**Prediction scores for the**

*Drosophila*protein network for different confidence thresholds*p*^{*}and different cut-offs in subgraph sizeThe DMC mechanism is the only class having a positive prediction score in every case. In particular, for *p*^{*} = 0.65 the DMC classification has a high score of 8.2 ± 1.0 for eight-step subgraphs and 8.6 ± 1.1 for subgraphs with up to seven edges. Also, the comparatively small standard deviations over different folds indicate robustness of the classification against data sub-sampling. While the high rankings of both duplication–mutation classes confirm our biological understanding of protein network evolution, our findings strongly support an evolution restricted by functional complementarity over an evolution that creates and deletes functions at random.

Notably, for *p*^{*} = 0.65 the RDG mechanism of random growth (edges are connected randomly between existing vertices) has a higher prediction score than the LPA or AGV growing graph mechanisms. Growth without any underlying mechanism other than chance therefore generates networks closer in topology to the core network (*p*^{*} = 0.65) of *Drosophila* than growth governed by preferential attachment. We also emphasize that even though *Drosophila* exhibits the SMW *character* of high clustering and short mean geodesic length (34), the SMW *model* (4) (an interpolation between regular ring lattices and randomly connected graphs) does not accurately reproduce the *Drosophila* network. The classification for *p*^{*} = 0.5 is less confident, probably because of the additional noise present in the data when including low *p* value (improbable) interactions, as we discuss below.

Although not necessary for the classification itself, visualizing the distribution for each model and each subgraph, compared with that subgraph's census in *Drosophila,* can give a qualitative and more intuitive way of interpreting the classification result and a better understanding of the topological differences between *Drosophila* and each of the seven mechanisms. To this end we determine *rank scores* for every subgraph and mechanism, defined as the percentages of sampled networks that have a subgraph count above *Drosophila*'s count. A rank score of 50% corresponds to a distribution whose median is equal to *Drosophila*'s subgraph count. Fig. 4 shows the color-coded rank scores for every mechanism and every subgraph (only the subset of 51 subgraphs, which appear in the learned ADT, is shown here; see the supporting information for the full set). The subgraphs are ordered by similarity in rank scores (see caption of Fig. 4). A few subgraphs (S36–S51) featuring hubs without cycles are best modeled by the LPA mechanism; i.e., these subgraphs have rank scores close to 50%. For almost all other subgraphs, both duplication–mutation mechanisms (DMC and DMR) consistently have better rank scores than the other models. Notably, the SMW and RDS mechanisms have rank scores 0 for all subgraphs; i.e., all sampled networks have lower subgraph counts than *Drosophila*. For a few subgraphs that feature long linear chains (S27–S33), the DMR model has better rank scores than DMC, whereas for almost all other subgraphs DMC has the best rank scores. In particular, DMC is the only model that can reach *Drosphila*'s counts for subgraphs S1–S26, which show complex cyclic structure.

*Drosophila*and each of the seven mechanisms. Color-coded rank scores are shown for a representative set of 51 subgraphs and every mechanism. The rank score

*r*

_{i}_{α}for model

*i*and subgraph α

**...**

Because yeast two-hybrid data are known to be susceptible to numerous errors (34), network analyses are reliable only if they are robust against noise. To confirm that our method shows this robustness, we classify the *Drosophila* network for various levels of artificially introduced noise by replacing existing edges with edges chosen at random. Fig. 5 shows the prediction scores for all seven classes as functions of the fraction of edges replaced. As validation, the network is correctly and confidently (*p* value > 1 - 10^{-3}) classified as an RDS graph when >80% of the edges are randomized. About 30% of *Drosophila*'s edges can be replaced without seeing any significant change in all seven prediction scores, and 40% can be replaced before *Drosophila* is no longer classified confidently as a DMC network. At this point the prediction scores of DMC, DMR, and AGV are very close, which is also observed for the prediction scores for *p*^{*} = 0.5 (see Table 2), where they rank top three in this order. The results therefore suggest that the less confident classification for *p*^{*} = 0.5 could be mainly due to the presence of more noise in the data after inclusion of low confidence-value edges.

*Drosophila*are randomly replaced, and the network is reclassified. Plotted are prediction scores for each of the seven classes as more and more edges are replaced. Each point is an average over 200 independent random

**...**

We have presented a method to infer growth mechanisms for naturally occurring networks. Advantageous properties include robustness against both noise and data subsampling, and the absence of any prior assumptions about which network features are important. Moreover, because the learning algorithm does not assume any relationships among features, the input space can be generalized to include any additional statistics as potentially discriminative features. We find that the *Drosophila* protein interaction network is confidently classified as a DMC network, a result that strongly supports ideas presented by Vazquez *et al.* (6) and Force *et al.* (36) about the nature of genetic evolution, as well as recent direct experimental evidence presented by Wang *et al.* (48) for a single DMC event in *Drosophila melanogaster*. We also showed that different mechanisms, such as DMR, LPA, and RDG, model *Drosophila* well for different sets of subgraphs, a result which suggests that a model that mixes several mechanisms might be able to reproduce *Drosophila* even more accurately. Preliminary studies on the yeast protein–protein interaction network, as produced by an analysis that integrates multiple data sources (49), also strongly favors the DMC mechanism. We anticipate that further use of machine learning techniques will answer a number of such questions of interest in systems biology.

## Acknowledgments

We acknowledge insightful discussions with Christina Leslie and Yoav Freund and key suggestions on the manuscript by G. Stolovitzky. This work was supported by National Science Foundation Grants ECS-0332479, ECS-0425850, and DMS-9810750 and National Institutes of Health Grant GM036277 (to C.H.W.).

## Notes

Author contributions: M.M., E.Z., and C.H.W. designed research; M.M., E.Z., and C.H.W. performed research; and M.M. wrote the paper.

Abbreviations: DMC, duplication–mutation–complementation; DMR, duplication–mutation using random mutations; LPA, linear preferential attachment; RDS, random static; RDG, random growing; AGV, aging vertex; SMW, small-world; ADT, alternating decision tree.

See Commentary on page 3173.

## Footnotes

^{††}Two graphs are isomorphic if there exists a relabeling of their vertices such that the two graphs are identical.

## References

**,**268-276. [PubMed]

**,**167-256.

**,**509-512. [PubMed]

**,**202-204.

**,**510-515. [PubMed]

**,**38-44.

**,**43-54.

**,**988-996. [PubMed]

**,**673-681. [PubMed]

**,**1486-1493. [PubMed]

**,**756-763. [PubMed]

**,**4372-4376. [PMC free article] [PubMed]

**,**64-68. [PubMed]

**,**824-827. [PubMed]

**,**1107. [PubMed]

**,**1107.

**,**224-230. [PubMed]

*et al.*(2002) Science 298

**,**799-804. [PubMed]

**,**176-179. [PubMed]

**,**118-119. [PubMed]

**,**785-793. [PubMed]

**,**11980-11985. [PMC free article] [PubMed]

**,**1-45.

**,**1132-1140.

**,**1538-1542. [PubMed]

*et al.*(2003) Science 302

**,**1727-1736. [PubMed]

**,**119-124. [PubMed]

**,**1531-1545. [PMC free article] [PubMed]

**,**290-297.

**,**041902-041908. [PubMed]

**,**036123-036127. [PubMed]

**,**337-407.

**,**119-139.

**,**135-168.

**,**711-780.

**,**Suppl. 1, I232-I240. [PubMed]

**,**056104-056118. [PubMed]

**,**523-527. [PubMed]

**,**8348-8353. [PMC free article] [PubMed]

**,**298-305.

**National Academy of Sciences**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (562K) |
- Citation

- A curve shaped description of large networks, with an application to the evaluation of network models.[PLoS One. 2011]
*Su X, Jin X, Min Y, Mo L, Yang J.**PLoS One. 2011; 6(5):e19784. Epub 2011 May 17.* - Discriminative topological features reveal biological network mechanisms.[BMC Bioinformatics. 2004]
*Middendorf M, Ziv E, Adams C, Hom J, Koytcheff R, Levovitz C, Woods G, Chen L, Wiggins C.**BMC Bioinformatics. 2004 Nov 22; 5:181. Epub 2004 Nov 22.* - Protein interaction networks of Saccharomyces cerevisiae, Caenorhabditis elegans and Drosophila melanogaster: large-scale organization and robustness.[Proteomics. 2006]
*Li D, Li J, Ouyang S, Wang J, Wu S, Wan P, Zhu Y, Xu X, He F.**Proteomics. 2006 Jan; 6(2):456-61.* - Inferring protein-protein interactions from multiple protein domain combinations.[Methods Mol Biol. 2009]
*Kanaan SP, Huang C, Wuchty S, Chen DZ, Izaguirre JA.**Methods Mol Biol. 2009; 541:43-59.* - Contrasting mechanisms of regulating translation of specific Drosophila germline mRNAs at the level of 5'-cap structure binding.[Biochem Soc Trans. 2005]
*Lasko P, Cho P, Poulin F, Sonenberg N.**Biochem Soc Trans. 2005 Dec; 33(Pt 6):1544-6.*

- Macrostructure from Microstructure: Generating Whole Systems from Ego Networks[Sociological methodology. 2012]
*Smith JA.**Sociological methodology. 2012 Aug 1; 42(1)155-205* - Alignment-free protein interaction network comparison[Bioinformatics. 2014]
*Ali W, Rito T, Reinert G, Sun F, Deane CM.**Bioinformatics. 2014 Sep 1; 30(17)i430-i437* - A Machine Learning Approach to Automated Structural Network Analysis: Application to Neonatal Encephalopathy[PLoS ONE. ]
*Ziv E, Tymofiyeva O, Ferriero DM, Barkovich AJ, Hess CP, Xu D.**PLoS ONE. 8(11)e78824* - Evolution After Whole-Genome Duplication: A Network Perspective[G3: Genes|Genomes|Genetics. ]
*Zhu Y, Lin Z, Nakhleh L.**G3: Genes|Genomes|Genetics. 3(11)2049-2057* - Malaria transmission modelling: a network perspective[Infectious Diseases of Poverty. ]
*Liu J, Yang B, Cheung WK, Yang G.**Infectious Diseases of Poverty. 111*

- Cited in BooksCited in BooksPubMed Central articles cited in books
- PubMedPubMedPubMed citations for these articles
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree

- From the Cover: Inferring network mechanisms: The Drosophila melanogaster protei...From the Cover: Inferring network mechanisms: The Drosophila melanogaster protein interaction networkProceedings of the National Academy of Sciences of the United States of America. Mar 1, 2005; 102(9)3192

Your browsing activity is empty.

Activity recording is turned off.

See more...