- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- NIHPA Author Manuscripts
- PMC3132400

# A comprehensive assessment of methods for de-novo reverse-engineering of genome-scale regulatory networks

^{1}Nikita I. Lytkin,

^{1}Constantin F. Aliferis,

^{1,}

^{2,}

^{3}and Alexander Statnikov

^{1,}

^{4,}

^{*}

^{1}Center for Health Informatics and Bioinformatics, New York University School of Medicine, New York, NY 10016, USA

^{2}Department of Pathology, New York University School of Medicine, New York, NY 10016, USA

^{3}Department of Biostatistics, Vanderbilt University, Nashville, TN, 37232, USA

^{4}Department of Medicine, New York University School of Medicine, New York, NY 10016, USA

^{*}Correspondence: to Dr. Alexander Statnikov (Email: ude.uyn.dem@vokintatS.rednaxelA); Phone: 1(212)263-3641; Fax: 1(212) 263-5995; Address: Center for Health Informatics and Bioinformatics, New York University Langone Medical Center, 227 E30th Street, 7th Floor, Office #736, New York, NY 10016, USA

## Abstract

De-novo reverse-engineering of genome-scale regulatory networks is an increasingly important objective for biological and translational research. While many methods have been recently developed for this task, their absolute and relative performance remains poorly understood. The present study conducts a rigorous performance assessment of 32 computational methods/variants for de-novo reverse-engineering of genome-scale regulatory networks by benchmarking these methods in 15 high-quality datasets and gold-standards of experimentally verified mechanistic knowledge. The results of this study show that some methods need to be substantially improved upon, while others should be used routinely. Our results also demonstrate that several univariate methods provide a “gatekeeper” performance threshold that should be applied when method developers assess the performance of their novel multivariate algorithms. Finally, the results of this study can be used to show practical utility and to establish guidelines for everyday use of reverse-engineering algorithms, aiming towards creation of automated data-analysis protocols and software systems.

**Keywords:**Regulatory network de-novo reverse-engineering, computational methods, evaluation, gene expression microarray analysis

## Introduction

The cell is a dynamic system of molecules that interact and regulate each other. Discovering these regulatory interactions is essential to expanding our understanding of normal and pathologic cellular physiology, and can lead to the development of drugs that manipulate cellular pathways to fight disease. A global model of gene regulation will also be essential for the design of synthetic genomes with targeted properties, such as the production of biofuels and medically relevant molecules [11]. There exist many databases that encapsulate biological pathways (e.g., KEGG and BioCarta); however, these databases are often inaccurate and incomplete and do not correspond to the studied biological system and experimental conditions [1;29;42;45]. Therefore, there is a strong need for the reverse-engineering of genome-scale regulatory networks using de-novo methods.

There is no doubt that data from targeted knockout/overexpression and/or longitudinal experiments provide the richest information about gene interactions that can be used by de-novo reverse-engineering methods. Unfortunately, such data is not currently abundant enough to enable discovery of regulatory networks, whereas there are thousands of available observational datasets from non-longitudinal case-control or case-series studies [7;39]. In addition, obtaining data from targeted knockout/overexpression experiments can be more expensive, unethical and/or infeasible for many biological systems and conditions. Thus, current methods are forced to utilize non-longitudinal case-control or case-series genome-scale data to reverse-engineer regulatory networks.

Over the last decade, many methods have been developed to reverse-engineer regulatory networks from observational data. However, their absolute and relative performance remains poorly understood [34]. Typically, a study that introduces a novel method performs only a small-scale evaluation using one or two datasets [21], and without comprehensive benchmarking against the best-performing techniques [38]. Such studies can show that the novel method is promising, but cannot demonstrate its empirical superiority and utility in general. Similarly, the past international competitions for reverse-engineering of regulatory networks have not provided a definitive answer as to what the best performing techniques are for genome-scale non-longitudinal observational data. The only competition that used real gene expression data for the inference of genome-scale network was DREAM2 [46]. However, since this competition involved a single dataset to which many methods had been applied, the results may be overfitted and thus may not generalize to other datasets [18].

The present study, for the first time, conducts a rigorous performance assessment of methods for reverse-engineering of genome-scale regulatory networks by benchmarking state-of-the-art methods (from bioinformatics/systems biology and quantitative disciplines such as computer science and biostatistics) in multiple high-quality datasets and gold-standards of experimentally verified mechanistic knowledge. The results of this study can be used to show practical utility and to establish guidelines for everyday use of network reverse-engineering algorithms, with ancillary benefits providing guidance about “best of breed” inference engines suitable for automated data-analysis protocols and software systems.

## Results

This work assessed the accuracy of 32 state-of-the-art network reverse-engineering methods/variants in 15 genome-scale real and simulated datasets/gold-standards. Since reverse-engineering methods are used in a variety of contexts, a single metric cannot be used to assess their accuracy. In order to capture the broad applicability of reverse-engineering algorithms, four benchmarks were conducted in this study, and each of them used a different metric to evaluate accuracy of reverse-engineering (details about metrics are provided in the Materials and Methods section). In each benchmark, methods were ranked based on their accuracy, and the top-five scoring methods were considered “best of breed”. Methods that were winners in at least one of the four benchmarks should be used routinely by biologists and bioinformaticians for reverse-engineering purposes, while other methods should be substantially improved upon.

### Benchmark #1: Which methods have the best combined positive predictive value (PPV) and negative predictive value (NPV)?

Implementations of LGL-Bach, regardless of parameters, constituted all of the top-five performing techniques (Tables 1 and and5).5). This method output few regulatory interactions relative to the size of the gold-standard. However, a larger percentage of these were true-positive interactions than for any other algorithm. Moreover, for most datasets >98%–99% of the regulatory interactions not predicted by LGL-Bach did not exist in the gold-standard. Such a relatively accurate list of putative interactions can be fruitful for biologists because it limits the case of experimentally validating the false-positive interactions of a reverse-engineering method. Of note, Graphical Lasso performed the best on the GNW(A), GNW(B), and ECOLI(D) datasets. However, it performed poorly on all other datasets, and therefore ranks only seventh among all methods.

**...**

### Benchmark #2: Which methods have the best combined sensitivity and specificity?

The methods that produced the best combined sensitivity and specificity were Relevance Networks 2, CLR (Stouffer MI estimator; α = 0.05), Fisher (FDR = 0.05), SA-CLR (α = 0.05), and CLR (Normal MI estimator; α = 0.05) (Tables 2 and and5).5). These methods discovered more true regulatory interactions than LGL-Bach did. However, this came at the expense of larger proportion of false-positive interactions. Biologists with limited resources may prefer results from methods such as LGL-Bach that are less complete (i.e., with smaller sensitivity), but more accurate (i.e., with larger PPV). Of note, Relevance Networks 1 produced the best performing results on three of the four GNW datasets and ECOLI(D). However, its poor performance on the other ECOLI datasets and REGED lowered its overall ranking to seventh. LGL-Bach and Aracne had the best performance among all methods on REGED, but performed poorly on all other datasets.

### Benchmark #3: Which methods have the best area under the ROC (AUROC) curve?

The area under the ROC curve was measured for the 12 methods/variants that produce scores for graph edges (Table 3), and it provides a threshold-independent metric of the classification power^{1} of each method. In order from first to fifth place, the best performing algorithms were qp-graphs (q = 200), CLR (Normal MI estimator), CLR (Stouffer MI estimator), qp-graphs (q = 20), and MI 2 (Table 5). Notably, the Fisher method produced top-5 AUROC scores over all REGED, GNW, and ECOLI datasets, but performed statistically indistinguishably from random on YEAST datasets. It is important to note that qp-graphs performed very well with respect to the threshold-independent AUROC metric, but very poorly in terms of the combined sensitivity and specificity. This discrepancy accentuates the difficulty in choosing an optimal threshold for this method as discussed below.

### Benchmark #4: Which methods have the best area under the precision-recall (AUPR) curve?

The area under the precision-recall curve was also measured for all 12 score-based methods/variants (Table 4). Methods that perform well according to this metric produce a list of putative interactions that strike a balance between recall (or sensitivity) and precision (or PPV). CLR (Stouffer MI estimator) and CLR (Normal MI estimator) were the best performing methods, occupying first and second places, respectively. qp-graphs (q=200) ranked third, SA-CLR ranked fourth, and MI 2 ranked fifth (Table 5). Notably, MI 2 turned out to be among the top performing methods because of its performance in REGED, GNW, and ECOLI datasets; its performance in YEAST datasets was statistically indistinguishable from random.

### Some methods often outperform other techniques, while others are consistent underperformers

Operationally we define a method to be an underperformer if it did not score in the top-5 methods/variants for at least one of the four performance metrics. According to our study, the underperforming methods are Aracne, Relevance Networks 1, Hierarchical Clustering, Graphical Lasso, GeneNet, and MI 1. This implies that other state-of-the-art algorithms can produce better results across a wide range of gold-standards/datasets and performance metrics. Hence, the underperforming algorithms should be revisited and substantially improved upon.

Since there is no single performance metric that fully captures the power of a method in all conceivable contexts of application, all algorithms that scored well with respect to at least one metric should be used in the context in which they performed best. Our analysis shows that CLR is a top performer for three metrics, qp-graphs and SA-CLR are top performers for two metrics, while the LGL-Bach, Relevance Networks 2, Fisher, and MI 2 are top performers with respect to one metric.

### Univariate methods^{2} provide a “gatekeeper” performance and should be used when method developers assess performance of their novel algorithms

In the last several years there has been an emergence of mathematically and computationally complex novel methods for reverse-engineering of regulatory networks [34;45;46]. We believe that the cost of added complexity should be offset by an increased performance of the method. Hence, the simplest (univariate) methods should provide a “gatekeeper” performance threshold, above which all novel complex algorithms should perform.

With respect to the combined positive and negative predictive value metric, the added complexity of the winning LGL-Bach method is justified by its superior performance compared to the highest-ranking univariate method (CLR, only ninth place). Similarly, qp-graphs (q = 200) achieve a better AUROC than any univariate method, and should be used despite its increased complexity. On the other hand, the three top performing methods with respect to the combined sensitivity and specificity are all univariate methods. Similarly, the univariate CLR method performs optimally with respect to AUPR. Therefore, researchers interested in methods that currently produce the best results with respect to the above two metrics do not need to use computationally more expensive multivariate methods.

### It is challenging to select an optimal threshold for a method that outputs scores for edges rather than a network graph

Recall that the score-based methods output scores for all possible edges in a graph. A regulatory network is then obtained by choosing a threshold and pruning all edges whose scores are below the threshold. Therefore, the quality of the produced network largely depends on the choice of a threshold. *However, finding a threshold that optimizes either combined sensitivity and specificity or combined PPV and NPV is challenging*.

If one has access to a partial gold-standard, it may be feasible to optimize the threshold for the combined sensitivity and specificity because this metric often has a single (global) minimum (see Figure S1 in the Online Supplement). In general, this result follows from the fact that sensitivity and specificity are monotonically decreasing and increasing functions of the threshold, respectively. Thus, one can apply a greedy search procedure to find a threshold value corresponding to the optimal combined sensitivity and specificity.

However, the combined PPV and NPV and in general all metrics that incorporate PPV and NPV, do not increase or decrease monotonically with the threshold (see Online Supplement for an explanation). Figure S2 in the Online Supplement depicts the highly oscillatory nature of the combined PPV and NPV metric as a function of the threshold. In this case, a greedy search procedure that has access to a partial gold-standard would only find a local minimum.

On the other hand, if one does not have access to a partial gold-standard, finding an optimal threshold is infeasible for both combined sensitivity and specificity, and combined PPV and NPV metrics. These nuances in the interpretation of metric-specific performance are critical for practical applications of the methods.

## Discussion

This benchmarking study shows the absolute and comparative performance of 32 network reverse-engineering methods/variants in 15 genome-scale real and simulated datasets/gold-standards using several metrics for assessing the accuracy of reverse-engineering. The methods used in this study include a broad array of state-of-the-art algorithms from bioinformatics and systems biology. In addition, algorithms from quantitative disciplines such as statistics and computer science were used. The results of this study show that some methods need to be substantially improved upon, while others should be used routinely. Those that should be improved are Aracne, Relevance Networks 1, Hierarchical Clustering, Graphical Lasso, GeneNet, and MI 1. The following methods should be routinely used: CLR, SA-CLR, qp-graphs, LGL-Bach, Relevance Networks 2, Fisher, and MI 2. Among the latter group of methods are LGL-Bach and qp-graphs, both of which are state-of-the-art techniques from computer science that deserve routine use in network inference tasks in bioinformatics and systems biology.

Our results also show that several univariate methods provide a “gatekeeper” performance threshold that should be used when method developers assess the performance of their novel algorithms. Furthermore, our analysis highlights the difficulty in determining optimal thresholds for algorithms that output scores for network edges rather than a network graph. The thresholds reported in primary publications of the score-based methods may be overfitted to the specific datasets used and therefore may not be universally applicable. Moreover, our results show that there is often no systematic way of searching for the best threshold over various performance metrics. Finally, our findings articulate the need for comprehensive benchmarking studies of future network reverse-engineering algorithms.

### Comparison to prior research in evaluation of network reverse-engineering algorithms

The need for a comprehensive evaluation of reverse-engineering algorithms is well understood by the scientific community. This led to the formation of the DREAM project – a series of four competitions designed to assess the accuracy of network reverse-engineering [40;45;46]. With only one exception, none of the DREAM challenges addressed the specific problem of de-novo reverse-engineering of genome-scale regulatory networks from real non-longitudinal observational microarray data. Instead, the challenges used data that was in-silico, non-genome-scale, and/or from longitudinal or controlled experiments. Moreover, the data often incorporated partial biological knowledge. Thus, the findings of the DREAM challenges are outside the scope of this work and of many practical applications of reverse-engineering methods in real datasets. An exception is the DREAM2 challenge that included a task to reverse-engineer a network from a single E.coli microarray dataset^{3}. Six algorithms were submitted, and the best performing method SA-CLR was considered to be a winner. However, as was mentioned previously, a winning performance in a single dataset may be a result of overfitting. Thus, one really has to assess algorithms over several datasets to reach reproducible conclusions. In addition to using 15 gold-standards/datasets in our study, we improve on the DREAM2 genome-scale challenge by using more methods for reverse-engineering, including newer methods that either were not available at the time of the DREAM2 challenge or did not participate in that competition.

To investigate the possibility of overfitting of SA-CLR to DREAM2 results, we included this method and the original DREAM2 E.Coli dataset (labeled as “ECOLI(D)”) in our evaluation and obtained the same AUPR and AUROC scores as in the DREAM2 challenge (see results for the ECOLI(D) dataset in Tables 3 and and4).4). However, SA-CLR was not a top-5 method across all 15 gold-standards/datasets in our study according to the AUROC metric (Table 5). This suggests possible overfitting of this method to the DREAM2 dataset and highlights the need for multiple datasets in the evaluation of methods.

It is also worthwhile mentioning the study of Bansal et al. who performed an evaluation of reverse-engineering methods on both real and simulated microarray datasets and ran algorithms de-novo using non-longitudinal observational data [6]. Our work significantly extends this prior work. First, the authors of that work assessed only 2 methods on real non-longitudinal genome-scale data, whereas our study compared 32 methods/variants. Second, the work of [6] involved only 2 gold-standards of genome-scale sizes: one for the Yeast regulatory network [31] and the other for the 26-gene local pathway of *MYC* gene [8]. However, the latter gold-standard is incomplete (see Online Supplement), whereas the former one is outdated and not comprehensive in comparison with the most recent version of the Yeast regulatory map used in our evaluation [33]. Third, unlike [6], the synthetic gold-standards and data used in our study were generated to resemble real biological data (see Materials and Methods section), and can therefore provide better estimates of anticipated performance of the methods in real data. Lastly, our study utilized a suite of more sophisticated and informative performance metrics than sensitivity and PPV in order to evaluate the output of reverse-engineering algorithms from multiple perspectives.

Other recent efforts in comprehensive evaluation of reverse-engineering methods are typically limited to simulated data with a small number of genes, e.g. [26].

### How accurately can the employed gold-standards be inferred from real gene expression microarray data?

Despite our rigor in using the correct implementation/application of each method and the most comprehensive gold-standards available to date, there are currently limits to the predictive power of reverse-engineering methods. Suppose there exists an optimal algorithm that could accurately discover all existing regulatory mechanisms from data using tests of statistical independence/association or functional equivalents. Unfortunately, the network produced by this algorithm would remain different from our gold-standards for a number of reasons. First, the Yeast and E.coli gold-standards are largely produced from experiments that show the physical binding of a transcription factor (TF) to DNA. However, such a binding event often does not lead to a functional change in gene expression, and hence one may not observe a corresponding statistical dependence in the microarray data [32]. Second, the regulatory network learned will likely be significantly dependent on the set of microarray experiments available. Often a transcription factor will affect the expression of different sets of genes in a condition-dependent manner. If certain TF-gene interactions only occur under conditions that are missing or underrepresented in the data, then a significant statistical dependence between the variables will be “drowned out” by the other samples. Third, some TFs or genes may have inherently low expression values that cannot be measured accurately by microarrays. They might be normalized out, as small changes in expression could be masked by noise in the data. Fourth, cellular aggregation and sampling from mixtures of distributions that are abundant in microarray data can also hide some statistical independence/association relations [15]. The above limitations are not specific to reverse-engineering algorithms, but are specific to assays and experimental design. Therefore, the performance results obtained in our study can be considered as lower bounds on performance achievable by these algorithms.

### On performance metrics for assessing accuracy of network reverse-engineering algorithms

Since there is no single context-independent metric to assess the accuracy of reverse-engineering methods, we used four different metrics to evaluate the results from different “angles.” The combined PPV and NPV represents a measure of how precisely positive and negative interactions are predicted. The combined sensitivity and specificity favors methods that find an equally balanced trade-off between false-positive and false-negative interactions. AUROC represents the probability that a method ranks a true edge higher than a false edge, and hence quantifies the classification power of an algorithm. Finally, AUPR provides threshold-independent assessment of both the completeness (recall) and precision of a method.

One of the goals of reverse-engineering methods is to present experimentalists with a manageable list of putative regulatory interactions associated with a biological context. Many experimentalists are only concerned with the pathway (local network) around a single transcription factor, while others may be interested in broader network motifs. As a result, certain statistical metrics are more applicable to a specific biological context than others. For example, a biologist with limited resources who is interested in learning a pathway should use a method that scores well with respect to the combined PPV and NPV metric, such as LGL-Bach. On the other hand, biologists more interested in general regulatory patterns in a network, or with the resources to perform large-scale silencing experiments or binding analysis, might be more interested in an algorithm that scores well with respect to the combined sensitivity and specificity or AUROC metric. Part of our ongoing work is to elucidate the biological context specificity of reverse-engineering algorithms and derive context-specific performance metrics.

## Materials and Methods

### Real datasets and gold-standards

Real gold-standards and microarray datasets were obtained for both Yeast and E.coli. The Yeast gold-standard was built by identifying the promoter sequences that are both bound by TFs (according to ChIP-on-chip data) and conserved within the Saccharomyces genus [28;33]. Binding information is essential because TFs must first bind to a gene to induce or supress expression, while conservation information is important because true-positive TF-DNA interactions are often conserved within a genus. This study used a set of Yeast gold-standard networks that ranged from conservative to liberal. To obtain this range, networks were chosen with different ChIP-on-chip binding significance levels α = 0.001 or 0.005, and were required to have a binding sequence that is conserved in C = 0, 1 or 2 of the related Saccharomyces species (Table 6). Hence, the most conservative gold-standard, YEAST(C), was built from TF-DNA interactions with α = 0.001, such that bound DNA sequence was conserved in at least 2 Yeast relatives. A compendium of 530 Yeast microarray samples was taken from the Many Microbe Microarray Database [20].

The E.coli gold-standard network was obtained from RegulonDB (version 6.4), a manually curated database of regulatory interactions obtained mainly through a literature search [24]. ChIP-qPCR data has shown RegulonDB to be approximately 85% complete [46]. Evidence for each regulatory interaction in RegulonDB is classified as “strong” or “weak”, depending on the type of experiment used to predict the interaction. For example, binding of a TF to a promoter is considered strong evidence, whereas gene-expression based computational predictions are considered weak evidence. For the purposes of our study, we created a conservative gold-standard of only strong interactions, and a liberal gold-standard that includes both strong and weak interactions. To ensure that our results are directly comparable with the DREAM2 challenge, we also included an earlier version of the RegulonDB gold-standard (see Table 6). A compendium of 907 E.coli microarray samples was taken from the Many Microbe Microarray Database [20]. We also used gene expression data from the DREAM2 challenge that was a subset of the previous dataset.

### Simulated datasets and gold-standards

In addition to using real gene expression data with approximate gold-standards, we also used simulated data where gold-standards are known exactly (Table 7). We focused here exclusively on cutting-edge simulation methods that produce artificial data that resembles real biological data.

The Resimulated Gene Expression Dataset (REGED) is based on a high-fidelity resimulation technique for generating synthetic data that is statistically indistinguishable from real expression data [25;44]. The REGED network was induced from 1,000 randomly selected genes in a lung cancer gene expression dataset [9]. This network displays a power-law connectivity [30] and generates data that is statistically indistinguishable from real data according to an SVM classifier [47]. Moreover, statistical dependencies and independencies are consistent between the real and synthetic data according to the Fisher’s Z test. Note that the REGED dataset was used in the Causality and Prediction Challenge [25].

The GeneNetWeaver (GNW) simulation method attempts to mimic real biological data by using topology of known regulatory networks [34;35]. Stochastic dynamics that are meant to model transcriptional regulation were applied to the extracted networks to generate simulated data.

### Network reverse-engineering methods

This study used both univariate and four classes of multivariate network reverse-engineering methods: correlation-based, mutual information-based, causal graph-based and Gaussian graphical models (Table 8). While Aracne, Relevance Networks, LGL-Bach, Graphical Lasso, and Hierarchical Clustering output a graph (adjacency matrix), other methods output a symmetric matrix of scores that represent the relative likelihood of a regulatory interaction between any two genes. To obtain a graph for the latter methods, a threshold was chosen and an edge was formed between genes that have a score larger than the threshold. Methods MI 1 and MI 2 were used without thresholding, because otherwise they become equivalent to Relevance Networks that were already included in the study.

Since the problem of regulatory network reverse-engineering is NP-hard, only an algorithm that is worst-case exponential in the number of genes in the dataset can be both sound and complete^{4} [14]. With the exception of LGL-Bach that is sound and complete^{5}, all other algorithms used in this study have by design low-order polynomial complexity (Table 8) and therefore cannot possibly be sound and complete. Notably, LGL-Bach is not always exponential, but rather adjusts its complexity to the network that produced the data, and in many distributions runs faster than other tested methods.

We used the original author implementations of all methods except for Relevance Networks 2 and Fisher (see Table S3 in the Online Supplement). The original implementations for the latter two methods were not available, and we programmed them in Matlab. We used default author-recommended parameters for all methods whenever they were programmed in the software, stated in the original manuscript, or provided by the authors. In addition, we used popular statistical thresholds, as described in Table S3 in the Online Supplement. This allowed us to explore different configurations of the algorithms and assess their performance characteristics. All methods except for SA-CLR were run on a high performance computing facility in the Center of Health Informatics and Bioinformatics (CHIBI) at New York University Langone Medical Center. SA-CLR was run by its creators on a Columbia University cluster.

### Performance assessment metrics

For the methods that directly output a network graph, we first computed positive predictive value (PPV), negative predictive value (NPV), sensitivity, and specificity. PPV measures the probability that a regulatory interaction discovered by the algorithm exists in the gold-standard (i.e., the precision of the output graph), while NPV measures the probability that an interaction *not* predicted by the algorithm does *not* exist in the gold-standard. Sensitivity measures the proportion of interactions in the gold-standard that are discovered by the algorithm (i.e., the completeness of the output graph), whereas specificity measures the proportion of interactions absent in the gold-standard that are *not* predicted by the algorithm. Then, PPV and NPV were combined in a single metric by computing the Euclidean distance from the optimal algorithm with PPV = 1 and $\text{NPV}=1:\sqrt{{(1-\text{PPV})}^{2}+{(1-\text{NPV})}^{2}}$. Similarly, we combined sensitivity and specificity by computing the Euclidean distance to the optimal algorithm with sensitivity = 1 and specificity = 1: $\sqrt{{(1-\text{sensitivity})}^{2}+{(1-\text{specificity})}^{2}}$ [22]. These metrics take values between 0 and $\sqrt{2}$, where 0 denotes performance of the optimal algorithm and $\sqrt{2}$ denotes performance of the worst possible algorithm. A *smaller* value for either of these two metrics implies a more accurate algorithm.

For the methods that do not directly output a network graph, but rather output scores for the edges, we computed the area under the ROC curve (AUROC) and the area under the precision-recall curve (AUPR) [16;27]. These metrics take values between 0 and 1, where 0 denotes performance of the worst possible algorithm and 1 denotes performance of the optimal algorithm. For AUROC, 0.5 denotes performance of an algorithm that randomly scores edges. A *larger* value for either of these two metrics implies a more accurate algorithm.

We note that all of the above metrics were used in this study to measure the performance based on the *undirected* graphs output by each algorithm. Inference of *directed* graphs from data remains a more challenging problem that is beyond the scope of the present study.

### Statistical analysis

The performance ranks of all algorithms were computed taking into consideration 95% confidence intervals around all point estimates. For example, if some method is the best performing one with AUROC = 0.98 and 95% confidence interval = [0.95, 1], then a method with AUROC = 0.96 is assigned the same rank as the best performing method. The confidence intervals were obtained using the methods of [17] for AUROC; [43] for AUPR; and the hyper-geometric test for combined sensitivity and specificity and combined PPV and NPV.

Because the maximum rank may differ from dataset to dataset (e.g., due to ties), the obtained “raw” ranks were normalized and averaged over all datasets^{6}. These average ranks were then used to obtain the final ranking of all methods (Table 5) according to the given performance metric.

## Acknowledgements

Alexander Statnikov and Constantin F. Aliferis are acknowledging support from grants R56 LM007948-04A1 from the National Library of Medicine, National Institute of Health and 1UL1RR029893 from the National Center for Research Resources, National Institutes of Health. Varun Narendra was supported by the New York University Medical Science Training Program. We would also like to acknowledge Dimitris Anastassiou and John Watkinson for modifying the SA-CLR algorithm for our experiments and for running it on the Columbia University high performance computing facility; Boris Hayete for providing us with details on reproducing results of [21]; Robert Castelo for providing us with details on reproducing results of [13]; Peng Qiu for providing us with codes for fast computation of pairwise mutual information as in [41]; and Thomas Schaffter and Daniel Marbach for assistance with GeneNetWeaver gene network simulator.

## Footnotes

**Publisher's Disclaimer: **This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

^{1}In this context, classification power refers to the ability to correctly classify each pair of genes as having a direct regulatory interaction, or not.

^{2}A univariate method refers to a method that only tests for pairwise association between a target gene and a single gene.

^{3}To be precise, this DREAM2 challenge was not completely de-novo because a list of 152 transcription factors was given to each participant.

^{4}An algorithm is considered “sound” if it outputs only true-positive gene-interactions. An algorithm is “complete” if it produces all true-positive gene-interactions, i.e. the entire network.

^{5}LGL-Bach is provably sound and complete for learning the graph skeleton (undirected graph) [2;3].

^{6}The ranks were first averaged over multiple versions of datasets (e.g., GNW(A), GNW(B), GNW(C), GNW(D)), and then the grand average was obtained over 4 dataset types (REGED, GNW, ECOLI, and YEAST).

## References

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (409K)

- Generating realistic in silico gene networks for performance assessment of reverse engineering methods.[J Comput Biol. 2009]
*Marbach D, Schaffter T, Mattiussi C, Floreano D.**J Comput Biol. 2009 Feb; 16(2):229-39.* - Benchmarking regulatory network reconstruction with GRENDEL.[Bioinformatics. 2009]
*Haynes BC, Brent MR.**Bioinformatics. 2009 Mar 15; 25(6):801-7. Epub 2009 Feb 2.* - TimeDelay-ARACNE: Reverse engineering of gene networks from time-course data by an information theoretic approach.[BMC Bioinformatics. 2010]
*Zoppoli P, Morganella S, Ceccarelli M.**BMC Bioinformatics. 2010 Mar 25; 11:154. Epub 2010 Mar 25.* - Computational methods for discovering gene networks from expression data.[Brief Bioinform. 2009]
*Lee WP, Tzou WS.**Brief Bioinform. 2009 Jul; 10(4):408-23.* - Current challenges in de novo plant genome sequencing and assembly.[Genome Biol. 2012]
*Schatz MC, Witkowski J, McCombie WR.**Genome Biol. 2012; 13(4):243.*

- NIMEFI: Gene Regulatory Network Inference using Multiple Ensemble Feature Importance Algorithms[PLoS ONE. ]
*Ruyssinck J, Huynh-Thu VA, Geurts P, Dhaene T, Demeester P, Saeys Y.**PLoS ONE. 9(3)e92709* - Microbiomic Signatures of Psoriasis: Feasibility and Methodology Comparison[Scientific Reports. ]
*Statnikov A, Alekseyenko AV, Li Z, Henaff M, Perez-Perez GI, Blaser MJ, Aliferis CF.**Scientific Reports. 32620* - Assessment of Network Inference Methods: How to Cope with an Underdetermined Problem[PLoS ONE. ]
*Siegenthaler C, Gunawan R.**PLoS ONE. 9(3)e90481* - Bridging the gap between gene expression and metabolic phenotype via kinetic models[BMC Systems Biology. ]
*Vital-Lopez FG, Wallqvist A, Reifman J.**BMC Systems Biology. 763* - Differential combinatorial regulatory network analysis related to venous metastasis of hepatocellular carcinoma[BMC Genomics. ]
*Zeng L, Yu J, Huang T, Jia H, Dong Q, He F, Yuan W, Qin L, Li Y, Xie L.**BMC Genomics. 13(Suppl 8)S14*

- PubMedPubMedPubMed citations for these articles
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree