Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Genomics. Author manuscript; available in PMC 2012 Jan 1.
Published in final edited form as:
PMCID: PMC3132400

A comprehensive assessment of methods for de-novo reverse-engineering of genome-scale regulatory networks


De-novo reverse-engineering of genome-scale regulatory networks is an increasingly important objective for biological and translational research. While many methods have been recently developed for this task, their absolute and relative performance remains poorly understood. The present study conducts a rigorous performance assessment of 32 computational methods/variants for de-novo reverse-engineering of genome-scale regulatory networks by benchmarking these methods in 15 high-quality datasets and gold-standards of experimentally verified mechanistic knowledge. The results of this study show that some methods need to be substantially improved upon, while others should be used routinely. Our results also demonstrate that several univariate methods provide a “gatekeeper” performance threshold that should be applied when method developers assess the performance of their novel multivariate algorithms. Finally, the results of this study can be used to show practical utility and to establish guidelines for everyday use of reverse-engineering algorithms, aiming towards creation of automated data-analysis protocols and software systems.

Keywords: Regulatory network de-novo reverse-engineering, computational methods, evaluation, gene expression microarray analysis


The cell is a dynamic system of molecules that interact and regulate each other. Discovering these regulatory interactions is essential to expanding our understanding of normal and pathologic cellular physiology, and can lead to the development of drugs that manipulate cellular pathways to fight disease. A global model of gene regulation will also be essential for the design of synthetic genomes with targeted properties, such as the production of biofuels and medically relevant molecules [11]. There exist many databases that encapsulate biological pathways (e.g., KEGG and BioCarta); however, these databases are often inaccurate and incomplete and do not correspond to the studied biological system and experimental conditions [1;29;42;45]. Therefore, there is a strong need for the reverse-engineering of genome-scale regulatory networks using de-novo methods.

There is no doubt that data from targeted knockout/overexpression and/or longitudinal experiments provide the richest information about gene interactions that can be used by de-novo reverse-engineering methods. Unfortunately, such data is not currently abundant enough to enable discovery of regulatory networks, whereas there are thousands of available observational datasets from non-longitudinal case-control or case-series studies [7;39]. In addition, obtaining data from targeted knockout/overexpression experiments can be more expensive, unethical and/or infeasible for many biological systems and conditions. Thus, current methods are forced to utilize non-longitudinal case-control or case-series genome-scale data to reverse-engineer regulatory networks.

Over the last decade, many methods have been developed to reverse-engineer regulatory networks from observational data. However, their absolute and relative performance remains poorly understood [34]. Typically, a study that introduces a novel method performs only a small-scale evaluation using one or two datasets [21], and without comprehensive benchmarking against the best-performing techniques [38]. Such studies can show that the novel method is promising, but cannot demonstrate its empirical superiority and utility in general. Similarly, the past international competitions for reverse-engineering of regulatory networks have not provided a definitive answer as to what the best performing techniques are for genome-scale non-longitudinal observational data. The only competition that used real gene expression data for the inference of genome-scale network was DREAM2 [46]. However, since this competition involved a single dataset to which many methods had been applied, the results may be overfitted and thus may not generalize to other datasets [18].

The present study, for the first time, conducts a rigorous performance assessment of methods for reverse-engineering of genome-scale regulatory networks by benchmarking state-of-the-art methods (from bioinformatics/systems biology and quantitative disciplines such as computer science and biostatistics) in multiple high-quality datasets and gold-standards of experimentally verified mechanistic knowledge. The results of this study can be used to show practical utility and to establish guidelines for everyday use of network reverse-engineering algorithms, with ancillary benefits providing guidance about “best of breed” inference engines suitable for automated data-analysis protocols and software systems.


This work assessed the accuracy of 32 state-of-the-art network reverse-engineering methods/variants in 15 genome-scale real and simulated datasets/gold-standards. Since reverse-engineering methods are used in a variety of contexts, a single metric cannot be used to assess their accuracy. In order to capture the broad applicability of reverse-engineering algorithms, four benchmarks were conducted in this study, and each of them used a different metric to evaluate accuracy of reverse-engineering (details about metrics are provided in the Materials and Methods section). In each benchmark, methods were ranked based on their accuracy, and the top-five scoring methods were considered “best of breed”. Methods that were winners in at least one of the four benchmarks should be used routinely by biologists and bioinformaticians for reverse-engineering purposes, while other methods should be substantially improved upon.

Benchmark #1: Which methods have the best combined positive predictive value (PPV) and negative predictive value (NPV)?

Implementations of LGL-Bach, regardless of parameters, constituted all of the top-five performing techniques (Tables 1 and and5).5). This method output few regulatory interactions relative to the size of the gold-standard. However, a larger percentage of these were true-positive interactions than for any other algorithm. Moreover, for most datasets >98%–99% of the regulatory interactions not predicted by LGL-Bach did not exist in the gold-standard. Such a relatively accurate list of putative interactions can be fruitful for biologists because it limits the case of experimentally validating the false-positive interactions of a reverse-engineering method. Of note, Graphical Lasso performed the best on the GNW(A), GNW(B), and ECOLI(D) datasets. However, it performed poorly on all other datasets, and therefore ranks only seventh among all methods.

Table 1
Combined PPV and NPV metric (Euclidean distance from the optimal algorithm with PPV = 1 and NPV = 1) for 30 methods/variants over 15 datasets. Methods denoted “Full Graph” and “Empty Graph” output the fully connected and ...
Table 5
Final ranking of methods according to each of the four performance metrics (benchmarks). The top-5 ranking methods for each benchmark are highlighted with red; other methods are highlighted with blue. Methods that are top-5 performers in at least one ...

Benchmark #2: Which methods have the best combined sensitivity and specificity?

The methods that produced the best combined sensitivity and specificity were Relevance Networks 2, CLR (Stouffer MI estimator; α = 0.05), Fisher (FDR = 0.05), SA-CLR (α = 0.05), and CLR (Normal MI estimator; α = 0.05) (Tables 2 and and5).5). These methods discovered more true regulatory interactions than LGL-Bach did. However, this came at the expense of larger proportion of false-positive interactions. Biologists with limited resources may prefer results from methods such as LGL-Bach that are less complete (i.e., with smaller sensitivity), but more accurate (i.e., with larger PPV). Of note, Relevance Networks 1 produced the best performing results on three of the four GNW datasets and ECOLI(D). However, its poor performance on the other ECOLI datasets and REGED lowered its overall ranking to seventh. LGL-Bach and Aracne had the best performance among all methods on REGED, but performed poorly on all other datasets.

Table 2
Combined sensitivity and specificity metric (Euclidean distance from the optimal algorithm with sensitivity = 1 and specificity = 1) for 30 methods/variants over 15 datasets. Methods denoted as “Full Graph” and “Empty Graph” ...

Benchmark #3: Which methods have the best area under the ROC (AUROC) curve?

The area under the ROC curve was measured for the 12 methods/variants that produce scores for graph edges (Table 3), and it provides a threshold-independent metric of the classification power1 of each method. In order from first to fifth place, the best performing algorithms were qp-graphs (q = 200), CLR (Normal MI estimator), CLR (Stouffer MI estimator), qp-graphs (q = 20), and MI 2 (Table 5). Notably, the Fisher method produced top-5 AUROC scores over all REGED, GNW, and ECOLI datasets, but performed statistically indistinguishably from random on YEAST datasets. It is important to note that qp-graphs performed very well with respect to the threshold-independent AUROC metric, but very poorly in terms of the combined sensitivity and specificity. This discrepancy accentuates the difficulty in choosing an optimal threshold for this method as discussed below.

Table 3
Area under ROC curve (AUROC) for 12 methods/variants over 15 datasets. Cells with bold values correspond to AUROC estimates that are statistically different from random (AUROC = 0.5) according to the method of [17]. Details about methods and their parameters ...

Benchmark #4: Which methods have the best area under the precision-recall (AUPR) curve?

The area under the precision-recall curve was also measured for all 12 score-based methods/variants (Table 4). Methods that perform well according to this metric produce a list of putative interactions that strike a balance between recall (or sensitivity) and precision (or PPV). CLR (Stouffer MI estimator) and CLR (Normal MI estimator) were the best performing methods, occupying first and second places, respectively. qp-graphs (q=200) ranked third, SA-CLR ranked fourth, and MI 2 ranked fifth (Table 5). Notably, MI 2 turned out to be among the top performing methods because of its performance in REGED, GNW, and ECOLI datasets; its performance in YEAST datasets was statistically indistinguishable from random.

Table 4
Area under precision-recall curve (AUPR) for 12 methods/variants over 15 datasets. Cells with bold values correspond to AUPR estimates that are statistically different from random according to the method of [43]. Details about methods and their parameters ...

Some methods often outperform other techniques, while others are consistent underperformers

Operationally we define a method to be an underperformer if it did not score in the top-5 methods/variants for at least one of the four performance metrics. According to our study, the underperforming methods are Aracne, Relevance Networks 1, Hierarchical Clustering, Graphical Lasso, GeneNet, and MI 1. This implies that other state-of-the-art algorithms can produce better results across a wide range of gold-standards/datasets and performance metrics. Hence, the underperforming algorithms should be revisited and substantially improved upon.

Since there is no single performance metric that fully captures the power of a method in all conceivable contexts of application, all algorithms that scored well with respect to at least one metric should be used in the context in which they performed best. Our analysis shows that CLR is a top performer for three metrics, qp-graphs and SA-CLR are top performers for two metrics, while the LGL-Bach, Relevance Networks 2, Fisher, and MI 2 are top performers with respect to one metric.

Univariate methods2 provide a “gatekeeper” performance and should be used when method developers assess performance of their novel algorithms

In the last several years there has been an emergence of mathematically and computationally complex novel methods for reverse-engineering of regulatory networks [34;45;46]. We believe that the cost of added complexity should be offset by an increased performance of the method. Hence, the simplest (univariate) methods should provide a “gatekeeper” performance threshold, above which all novel complex algorithms should perform.

With respect to the combined positive and negative predictive value metric, the added complexity of the winning LGL-Bach method is justified by its superior performance compared to the highest-ranking univariate method (CLR, only ninth place). Similarly, qp-graphs (q = 200) achieve a better AUROC than any univariate method, and should be used despite its increased complexity. On the other hand, the three top performing methods with respect to the combined sensitivity and specificity are all univariate methods. Similarly, the univariate CLR method performs optimally with respect to AUPR. Therefore, researchers interested in methods that currently produce the best results with respect to the above two metrics do not need to use computationally more expensive multivariate methods.

It is challenging to select an optimal threshold for a method that outputs scores for edges rather than a network graph

Recall that the score-based methods output scores for all possible edges in a graph. A regulatory network is then obtained by choosing a threshold and pruning all edges whose scores are below the threshold. Therefore, the quality of the produced network largely depends on the choice of a threshold. However, finding a threshold that optimizes either combined sensitivity and specificity or combined PPV and NPV is challenging.

If one has access to a partial gold-standard, it may be feasible to optimize the threshold for the combined sensitivity and specificity because this metric often has a single (global) minimum (see Figure S1 in the Online Supplement). In general, this result follows from the fact that sensitivity and specificity are monotonically decreasing and increasing functions of the threshold, respectively. Thus, one can apply a greedy search procedure to find a threshold value corresponding to the optimal combined sensitivity and specificity.

However, the combined PPV and NPV and in general all metrics that incorporate PPV and NPV, do not increase or decrease monotonically with the threshold (see Online Supplement for an explanation). Figure S2 in the Online Supplement depicts the highly oscillatory nature of the combined PPV and NPV metric as a function of the threshold. In this case, a greedy search procedure that has access to a partial gold-standard would only find a local minimum.

On the other hand, if one does not have access to a partial gold-standard, finding an optimal threshold is infeasible for both combined sensitivity and specificity, and combined PPV and NPV metrics. These nuances in the interpretation of metric-specific performance are critical for practical applications of the methods.


This benchmarking study shows the absolute and comparative performance of 32 network reverse-engineering methods/variants in 15 genome-scale real and simulated datasets/gold-standards using several metrics for assessing the accuracy of reverse-engineering. The methods used in this study include a broad array of state-of-the-art algorithms from bioinformatics and systems biology. In addition, algorithms from quantitative disciplines such as statistics and computer science were used. The results of this study show that some methods need to be substantially improved upon, while others should be used routinely. Those that should be improved are Aracne, Relevance Networks 1, Hierarchical Clustering, Graphical Lasso, GeneNet, and MI 1. The following methods should be routinely used: CLR, SA-CLR, qp-graphs, LGL-Bach, Relevance Networks 2, Fisher, and MI 2. Among the latter group of methods are LGL-Bach and qp-graphs, both of which are state-of-the-art techniques from computer science that deserve routine use in network inference tasks in bioinformatics and systems biology.

Our results also show that several univariate methods provide a “gatekeeper” performance threshold that should be used when method developers assess the performance of their novel algorithms. Furthermore, our analysis highlights the difficulty in determining optimal thresholds for algorithms that output scores for network edges rather than a network graph. The thresholds reported in primary publications of the score-based methods may be overfitted to the specific datasets used and therefore may not be universally applicable. Moreover, our results show that there is often no systematic way of searching for the best threshold over various performance metrics. Finally, our findings articulate the need for comprehensive benchmarking studies of future network reverse-engineering algorithms.

Comparison to prior research in evaluation of network reverse-engineering algorithms

The need for a comprehensive evaluation of reverse-engineering algorithms is well understood by the scientific community. This led to the formation of the DREAM project – a series of four competitions designed to assess the accuracy of network reverse-engineering [40;45;46]. With only one exception, none of the DREAM challenges addressed the specific problem of de-novo reverse-engineering of genome-scale regulatory networks from real non-longitudinal observational microarray data. Instead, the challenges used data that was in-silico, non-genome-scale, and/or from longitudinal or controlled experiments. Moreover, the data often incorporated partial biological knowledge. Thus, the findings of the DREAM challenges are outside the scope of this work and of many practical applications of reverse-engineering methods in real datasets. An exception is the DREAM2 challenge that included a task to reverse-engineer a network from a single E.coli microarray dataset3. Six algorithms were submitted, and the best performing method SA-CLR was considered to be a winner. However, as was mentioned previously, a winning performance in a single dataset may be a result of overfitting. Thus, one really has to assess algorithms over several datasets to reach reproducible conclusions. In addition to using 15 gold-standards/datasets in our study, we improve on the DREAM2 genome-scale challenge by using more methods for reverse-engineering, including newer methods that either were not available at the time of the DREAM2 challenge or did not participate in that competition.

To investigate the possibility of overfitting of SA-CLR to DREAM2 results, we included this method and the original DREAM2 E.Coli dataset (labeled as “ECOLI(D)”) in our evaluation and obtained the same AUPR and AUROC scores as in the DREAM2 challenge (see results for the ECOLI(D) dataset in Tables 3 and and4).4). However, SA-CLR was not a top-5 method across all 15 gold-standards/datasets in our study according to the AUROC metric (Table 5). This suggests possible overfitting of this method to the DREAM2 dataset and highlights the need for multiple datasets in the evaluation of methods.

It is also worthwhile mentioning the study of Bansal et al. who performed an evaluation of reverse-engineering methods on both real and simulated microarray datasets and ran algorithms de-novo using non-longitudinal observational data [6]. Our work significantly extends this prior work. First, the authors of that work assessed only 2 methods on real non-longitudinal genome-scale data, whereas our study compared 32 methods/variants. Second, the work of [6] involved only 2 gold-standards of genome-scale sizes: one for the Yeast regulatory network [31] and the other for the 26-gene local pathway of MYC gene [8]. However, the latter gold-standard is incomplete (see Online Supplement), whereas the former one is outdated and not comprehensive in comparison with the most recent version of the Yeast regulatory map used in our evaluation [33]. Third, unlike [6], the synthetic gold-standards and data used in our study were generated to resemble real biological data (see Materials and Methods section), and can therefore provide better estimates of anticipated performance of the methods in real data. Lastly, our study utilized a suite of more sophisticated and informative performance metrics than sensitivity and PPV in order to evaluate the output of reverse-engineering algorithms from multiple perspectives.

Other recent efforts in comprehensive evaluation of reverse-engineering methods are typically limited to simulated data with a small number of genes, e.g. [26].

How accurately can the employed gold-standards be inferred from real gene expression microarray data?

Despite our rigor in using the correct implementation/application of each method and the most comprehensive gold-standards available to date, there are currently limits to the predictive power of reverse-engineering methods. Suppose there exists an optimal algorithm that could accurately discover all existing regulatory mechanisms from data using tests of statistical independence/association or functional equivalents. Unfortunately, the network produced by this algorithm would remain different from our gold-standards for a number of reasons. First, the Yeast and E.coli gold-standards are largely produced from experiments that show the physical binding of a transcription factor (TF) to DNA. However, such a binding event often does not lead to a functional change in gene expression, and hence one may not observe a corresponding statistical dependence in the microarray data [32]. Second, the regulatory network learned will likely be significantly dependent on the set of microarray experiments available. Often a transcription factor will affect the expression of different sets of genes in a condition-dependent manner. If certain TF-gene interactions only occur under conditions that are missing or underrepresented in the data, then a significant statistical dependence between the variables will be “drowned out” by the other samples. Third, some TFs or genes may have inherently low expression values that cannot be measured accurately by microarrays. They might be normalized out, as small changes in expression could be masked by noise in the data. Fourth, cellular aggregation and sampling from mixtures of distributions that are abundant in microarray data can also hide some statistical independence/association relations [15]. The above limitations are not specific to reverse-engineering algorithms, but are specific to assays and experimental design. Therefore, the performance results obtained in our study can be considered as lower bounds on performance achievable by these algorithms.

On performance metrics for assessing accuracy of network reverse-engineering algorithms

Since there is no single context-independent metric to assess the accuracy of reverse-engineering methods, we used four different metrics to evaluate the results from different “angles.” The combined PPV and NPV represents a measure of how precisely positive and negative interactions are predicted. The combined sensitivity and specificity favors methods that find an equally balanced trade-off between false-positive and false-negative interactions. AUROC represents the probability that a method ranks a true edge higher than a false edge, and hence quantifies the classification power of an algorithm. Finally, AUPR provides threshold-independent assessment of both the completeness (recall) and precision of a method.

One of the goals of reverse-engineering methods is to present experimentalists with a manageable list of putative regulatory interactions associated with a biological context. Many experimentalists are only concerned with the pathway (local network) around a single transcription factor, while others may be interested in broader network motifs. As a result, certain statistical metrics are more applicable to a specific biological context than others. For example, a biologist with limited resources who is interested in learning a pathway should use a method that scores well with respect to the combined PPV and NPV metric, such as LGL-Bach. On the other hand, biologists more interested in general regulatory patterns in a network, or with the resources to perform large-scale silencing experiments or binding analysis, might be more interested in an algorithm that scores well with respect to the combined sensitivity and specificity or AUROC metric. Part of our ongoing work is to elucidate the biological context specificity of reverse-engineering algorithms and derive context-specific performance metrics.

Materials and Methods

Real datasets and gold-standards

Real gold-standards and microarray datasets were obtained for both Yeast and E.coli. The Yeast gold-standard was built by identifying the promoter sequences that are both bound by TFs (according to ChIP-on-chip data) and conserved within the Saccharomyces genus [28;33]. Binding information is essential because TFs must first bind to a gene to induce or supress expression, while conservation information is important because true-positive TF-DNA interactions are often conserved within a genus. This study used a set of Yeast gold-standard networks that ranged from conservative to liberal. To obtain this range, networks were chosen with different ChIP-on-chip binding significance levels α = 0.001 or 0.005, and were required to have a binding sequence that is conserved in C = 0, 1 or 2 of the related Saccharomyces species (Table 6). Hence, the most conservative gold-standard, YEAST(C), was built from TF-DNA interactions with α = 0.001, such that bound DNA sequence was conserved in at least 2 Yeast relatives. A compendium of 530 Yeast microarray samples was taken from the Many Microbe Microarray Database [20].

Table 6
Description of the real gold-standards used in this study, along with the gene-expression data used for reverse-engineering the transcriptional network. See text for detailed description of gold-standards and datasets.

The E.coli gold-standard network was obtained from RegulonDB (version 6.4), a manually curated database of regulatory interactions obtained mainly through a literature search [24]. ChIP-qPCR data has shown RegulonDB to be approximately 85% complete [46]. Evidence for each regulatory interaction in RegulonDB is classified as “strong” or “weak”, depending on the type of experiment used to predict the interaction. For example, binding of a TF to a promoter is considered strong evidence, whereas gene-expression based computational predictions are considered weak evidence. For the purposes of our study, we created a conservative gold-standard of only strong interactions, and a liberal gold-standard that includes both strong and weak interactions. To ensure that our results are directly comparable with the DREAM2 challenge, we also included an earlier version of the RegulonDB gold-standard (see Table 6). A compendium of 907 E.coli microarray samples was taken from the Many Microbe Microarray Database [20]. We also used gene expression data from the DREAM2 challenge that was a subset of the previous dataset.

Simulated datasets and gold-standards

In addition to using real gene expression data with approximate gold-standards, we also used simulated data where gold-standards are known exactly (Table 7). We focused here exclusively on cutting-edge simulation methods that produce artificial data that resembles real biological data.

Table 7
Description of the simulated gold-standards and dataset used in this study. See text for detailed description of gold-standards and datasets.

The Resimulated Gene Expression Dataset (REGED) is based on a high-fidelity resimulation technique for generating synthetic data that is statistically indistinguishable from real expression data [25;44]. The REGED network was induced from 1,000 randomly selected genes in a lung cancer gene expression dataset [9]. This network displays a power-law connectivity [30] and generates data that is statistically indistinguishable from real data according to an SVM classifier [47]. Moreover, statistical dependencies and independencies are consistent between the real and synthetic data according to the Fisher’s Z test. Note that the REGED dataset was used in the Causality and Prediction Challenge [25].

The GeneNetWeaver (GNW) simulation method attempts to mimic real biological data by using topology of known regulatory networks [34;35]. Stochastic dynamics that are meant to model transcriptional regulation were applied to the extracted networks to generate simulated data.

Network reverse-engineering methods

This study used both univariate and four classes of multivariate network reverse-engineering methods: correlation-based, mutual information-based, causal graph-based and Gaussian graphical models (Table 8). While Aracne, Relevance Networks, LGL-Bach, Graphical Lasso, and Hierarchical Clustering output a graph (adjacency matrix), other methods output a symmetric matrix of scores that represent the relative likelihood of a regulatory interaction between any two genes. To obtain a graph for the latter methods, a threshold was chosen and an edge was formed between genes that have a score larger than the threshold. Methods MI 1 and MI 2 were used without thresholding, because otherwise they become equivalent to Relevance Networks that were already included in the study.

Table 8
The list of reverse-engineering methods along with a brief description, computational complexity, and references.

Since the problem of regulatory network reverse-engineering is NP-hard, only an algorithm that is worst-case exponential in the number of genes in the dataset can be both sound and complete4 [14]. With the exception of LGL-Bach that is sound and complete5, all other algorithms used in this study have by design low-order polynomial complexity (Table 8) and therefore cannot possibly be sound and complete. Notably, LGL-Bach is not always exponential, but rather adjusts its complexity to the network that produced the data, and in many distributions runs faster than other tested methods.

We used the original author implementations of all methods except for Relevance Networks 2 and Fisher (see Table S3 in the Online Supplement). The original implementations for the latter two methods were not available, and we programmed them in Matlab. We used default author-recommended parameters for all methods whenever they were programmed in the software, stated in the original manuscript, or provided by the authors. In addition, we used popular statistical thresholds, as described in Table S3 in the Online Supplement. This allowed us to explore different configurations of the algorithms and assess their performance characteristics. All methods except for SA-CLR were run on a high performance computing facility in the Center of Health Informatics and Bioinformatics (CHIBI) at New York University Langone Medical Center. SA-CLR was run by its creators on a Columbia University cluster.

Performance assessment metrics

For the methods that directly output a network graph, we first computed positive predictive value (PPV), negative predictive value (NPV), sensitivity, and specificity. PPV measures the probability that a regulatory interaction discovered by the algorithm exists in the gold-standard (i.e., the precision of the output graph), while NPV measures the probability that an interaction not predicted by the algorithm does not exist in the gold-standard. Sensitivity measures the proportion of interactions in the gold-standard that are discovered by the algorithm (i.e., the completeness of the output graph), whereas specificity measures the proportion of interactions absent in the gold-standard that are not predicted by the algorithm. Then, PPV and NPV were combined in a single metric by computing the Euclidean distance from the optimal algorithm with PPV = 1 and NPV=1:(1PPV)2+(1NPV)2. Similarly, we combined sensitivity and specificity by computing the Euclidean distance to the optimal algorithm with sensitivity = 1 and specificity = 1: (1sensitivity)2+(1specificity)2 [22]. These metrics take values between 0 and 2, where 0 denotes performance of the optimal algorithm and 2 denotes performance of the worst possible algorithm. A smaller value for either of these two metrics implies a more accurate algorithm.

For the methods that do not directly output a network graph, but rather output scores for the edges, we computed the area under the ROC curve (AUROC) and the area under the precision-recall curve (AUPR) [16;27]. These metrics take values between 0 and 1, where 0 denotes performance of the worst possible algorithm and 1 denotes performance of the optimal algorithm. For AUROC, 0.5 denotes performance of an algorithm that randomly scores edges. A larger value for either of these two metrics implies a more accurate algorithm.

We note that all of the above metrics were used in this study to measure the performance based on the undirected graphs output by each algorithm. Inference of directed graphs from data remains a more challenging problem that is beyond the scope of the present study.

Statistical analysis

The performance ranks of all algorithms were computed taking into consideration 95% confidence intervals around all point estimates. For example, if some method is the best performing one with AUROC = 0.98 and 95% confidence interval = [0.95, 1], then a method with AUROC = 0.96 is assigned the same rank as the best performing method. The confidence intervals were obtained using the methods of [17] for AUROC; [43] for AUPR; and the hyper-geometric test for combined sensitivity and specificity and combined PPV and NPV.

Because the maximum rank may differ from dataset to dataset (e.g., due to ties), the obtained “raw” ranks were normalized and averaged over all datasets6. These average ranks were then used to obtain the final ranking of all methods (Table 5) according to the given performance metric.

Supplementary Material



Alexander Statnikov and Constantin F. Aliferis are acknowledging support from grants R56 LM007948-04A1 from the National Library of Medicine, National Institute of Health and 1UL1RR029893 from the National Center for Research Resources, National Institutes of Health. Varun Narendra was supported by the New York University Medical Science Training Program. We would also like to acknowledge Dimitris Anastassiou and John Watkinson for modifying the SA-CLR algorithm for our experiments and for running it on the Columbia University high performance computing facility; Boris Hayete for providing us with details on reproducing results of [21]; Robert Castelo for providing us with details on reproducing results of [13]; Peng Qiu for providing us with codes for fast computation of pairwise mutual information as in [41]; and Thomas Schaffter and Daniel Marbach for assistance with GeneNetWeaver gene network simulator.


Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

1In this context, classification power refers to the ability to correctly classify each pair of genes as having a direct regulatory interaction, or not.

2A univariate method refers to a method that only tests for pairwise association between a target gene and a single gene.

3To be precise, this DREAM2 challenge was not completely de-novo because a list of 152 transcription factors was given to each participant.

4An algorithm is considered “sound” if it outputs only true-positive gene-interactions. An algorithm is “complete” if it produces all true-positive gene-interactions, i.e. the entire network.

5LGL-Bach is provably sound and complete for learning the graph skeleton (undirected graph) [2;3].

6The ranks were first averaged over multiple versions of datasets (e.g., GNW(A), GNW(B), GNW(C), GNW(D)), and then the grand average was obtained over 4 dataset types (REGED, GNW, ECOLI, and YEAST).


1. Adriaens ME, Jaillard M, Waagmeester A, Coort SL, Pico AR, Evelo CT. The public road to high-quality curated biological pathways. Drug Discov.Today. 2008;13:856–862. [PMC free article] [PubMed]
2. Aliferis CF, Statnikov A, Tsamardinos I, Mani S, Koutsoukos XD. Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification. Part I: Algorithms and Empirical Evaluation. Journal of Machine Learning Research. 2010;11:171–234.
3. Aliferis CF, Statnikov A, Tsamardinos I, Mani S, Koutsoukos XD. Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification. Part II: Analysis and Extensions. Journal of Machine Learning Research. 2010;11:235–284.
4. Anderson TW. An introduction to multivariate statistical analysis. Hoboken, N.J: Wiley-Interscience; 2003.
5. Bach FR, Jordan MI. Learning graphical models with Mercer kernels. Advances in Neural Information Processing Systems (NIPS) 2003;15:1009–1016.
6. Bansal M, Belcastro V, Ambesi-Impiombato A, di BD. How to infer gene networks from expression profiles. Mol.Syst.Biol. 2007;3:78. [PMC free article] [PubMed]
7. Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Muertter RN, Edgar R. NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Res. 2009;37:D885–D890. [PMC free article] [PubMed]
8. Basso K, Margolin AA, Stolovitzky G, Klein U, la-Favera R, Califano A. Reverse engineering of regulatory networks in human B cells. Nat.Genet. 2005;37:382–390. [PubMed]
9. Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark EJ, Lander ES, Wong W, Johnson BE, Golub TR, Sugarbaker DJ, Meyerson M. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc.Natl.Acad.Sci.U.S.A. 2001;98:13790–13795. [PMC free article] [PubMed]
10. Butte AJ, Kohane IS. Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac.Symp.Biocomput. 2000:418–429. [PubMed]
11. Carrera J, Rodrigo G, Jaramillo A. Towards the automated engineering of a synthetic genome. Mol.Biosyst. 2009;5:733–743. [PubMed]
12. Castelo R, Roverato A. A robust procedure for Gaussian graphical model search from microarray data with p larger than n. Journal of Machine Learning Research. 2006;7:2650.
13. Castelo R, Roverato A. Reverse engineering molecular regulatory networks from microarray data with qp-graphs. J.Comput.Biol. 2009;16:213–227. [PubMed]
14. Chickering DM, Heckerman D, Meek C. Large-sample learning of Bayesian networks is NP-hard. The Journal of Machine Learning Research. 2004;5:1287–1330.
15. Chu T, Glymour C, Scheines R, Spirtes P. A statistical problem for inference to regulatory structure from associations of gene expression measurements with microarrays. Bioinformatics. 2003;19:1147–1152. [PubMed]
16. Davis J, Goadrich M. The relationship between precision-recall and roc curves; Proceedings of the 23rd international conference on Machine learning; 2006. p. 240.
17. DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44:837–845. [PubMed]
18. Duda RO, Hart PE, G D. Stork Pattern classification. New York: Wiley; 2001.
19. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc.Natl.Acad.Sci.U.S.A. 1998;95:14863–14868. [PMC free article] [PubMed]
20. Faith JJ, Driscoll ME, Fusaro VA, Cosgrove EJ, Hayete B, Juhn FS, Schneider SJ, Gardner TS. Many Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured experimental metadata. Nucleic Acids Res. 2008;36:D866–D870. [PMC free article] [PubMed]
21. Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins JJ, Gardner TS. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol. 2007;5:e8. [PMC free article] [PubMed]
22. Frey L, Fisher D, Tsamardinos I, Aliferis CF, Statnikov A. Identifying Markov blankets with decision tree induction; Proceedings of the Third IEEE International Conference on Data Mining (ICDM).2003.
23. Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9:432–441. [PMC free article] [PubMed]
24. Gama-Castro S, Jimenez-Jacinto V, Peralta-Gil M, Santos-Zavaleta A, Penaloza-Spinola MI, Contreras-Moreira B, Segura-Salazar J, Muniz-Rascado L, Martinez-Flores I, Salgado H, Bonavides-Martinez C, Abreu-Goodger C, Rodriguez-Penagos C, Miranda-Rios J, Morett E, Merino E, Huerta AM, Trevino-Quintanilla L, Collado-Vides J. RegulonDB (version 6.0): gene regulation model of Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and Textpresso navigation. Nucleic Acids Res. 2008;36:D120–D124. [PMC free article] [PubMed]
25. Guyon I, Aliferis C, Cooper G, Elisseeff A, Pellet JP, Spirtes P, Statnikov A. Design and analysis of the causation and prediction challenge. Journal of Machine Learning Research, Workshop and Conference Proceedings. 2008;3:1–33.
26. Hache H, Lehrach H, Herwig R. Reverse engineering of gene regulatory networks: a comparative study. EURASIP.J.Bioinform.Syst.Biol. 2009:617281. [PMC free article] [PubMed]
27. Hand DJ, Till RJ. A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning. 2001;45:171–186.
28. Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, MacIsaac KD, Danford TW, Hannett NM, Tagne JB, Reynolds DB, Yoo J, Jennings EG, Zeitlinger J, Pokholok DK, Kellis M, Rolfe PA, Takusagawa KT, Lander ES, Gifford DK, Fraenkel E, Young RA. Transcriptional regulatory code of a eukaryotic genome. Nature. 2004;431:99–104. [PMC free article] [PubMed]
29. Huttenhower C, Hibbs MA, Myers CL, Caudy AA, Hess DC, Troyanskaya OG. The impact of incomplete knowledge on evaluation: an experimental benchmark for protein function prediction. Bioinformatics. 2009;25:2404–2410. [PMC free article] [PubMed]
30. Jeong H, Tombor B, Albert R, Oltvai ZN, Barabasi AL. The large-scale organization of metabolic networks. Nature. 2000;407:651–654. [PubMed]
31. Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, Zeitlinger J, Jennings EG, Murray HL, Gordon DB, Ren B, Wyrick JJ, Tagne JB, Volkert TL, Fraenkel E, Gifford DK, Young RA. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science. 2002;298:799–804. [PubMed]
32. Li XY, MacArthur S, Bourgon R, Nix D, Pollard DA, Iyer VN, Hechmer A, Simirenko L, Stapleton M, Luengo Hendriks CL, Chu HC, Ogawa N, Inwood W, Sementchenko V, Beaton A, Weiszmann R, Celniker SE, Knowles DW, Gingeras T, Speed TP, Eisen MB, Biggin MD. Transcription factors bind thousands of active and inactive regions in the Drosophila blastoderm. PLoS.Biol. 2008;6:e27. [PMC free article] [PubMed]
33. MacIsaac KD, Wang T, Gordon DB, Gifford DK, Stormo GD, Fraenkel E. An improved map of conserved regulatory sites for Saccharomyces cerevisiae. BMC.Bioinformatics. 2006;7:113. [PMC free article] [PubMed]
34. Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, Stolovitzky G. Revealing strengths and weaknesses of methods for gene network inference. Proc.Natl.Acad.Sci.U.S.A. 2010;107:6286–6291. [PMC free article] [PubMed]
35. Marbach D, Schaffter T, Mattiussi C, Floreano D. Generating realistic in silico gene networks for performance assessment of reverse engineering methods. J.Comput.Biol. 2009;16:229–239. [PubMed]
36. Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla FR, Califano A. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics. 2006;7 Suppl 1:S7. [PMC free article] [PubMed]
37. Meinshausen N, Buhlmann P. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics. 2006;34:1436–1462.
38. Opgen-Rhein R, Strimmer K. From correlation to causation networks: a simple approximate learning algorithm and its application to high-dimensional plant gene expression data. BMC.Syst.Biol. 2007;1:37. [PMC free article] [PubMed]
39. Parkinson H, Kapushesky M, Kolesnikov N, Rustici G, Shojatalab M, Abeygunawardena N, Berube H, Dylag M, Emam I, Farne A, Holloway E, Lukk M, Malone J, Mani R, Pilicheva E, Rayner TF, Rezwan F, Sharma A, Williams E, Bradley XZ, Adamusiak T, Brandizi M, Burdett T, Coulson R, Krestyaninova M, Kurnosov P, Maguire E, Neogi SG, Rocca-Serra P, Sansone SA, Sklyar N, Zhao M, Sarkans U, Brazma A. ArrayExpress update--from an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Res. 2009;37:D868–D872. [PMC free article] [PubMed]
40. Prill RJ, Marbach D, Saez-Rodriguez J, Sorger PK, Alexopoulos LG, Xue X, Clarke ND, Altan-Bonnet G, Stolovitzky G. Towards a rigorous assessment of systems biology models: the DREAM3 challenges. PLoS.One. 2010;5:e9202. [PMC free article] [PubMed]
41. Qiu P, Gentles AJ, Plevritis SK. Fast calculation of pairwise mutual information for gene regulatory network reconstruction. Comput.Methods Programs Biomed. 2009;94:177–180. [PubMed]
42. Rhodes DR, Chinnaiyan AM. Integrative analysis of the cancer transcriptome. Nat.Genet. 2005;37 Suppl:S31–S37. [PubMed]
43. Richardson M, Domingos P. Markov logic networks. Machine Learning. 2006;62:107–136.
44. Statnikov A, Aliferis CF. Analysis and Computational Dissection of Molecular Signature Multiplicity. PLoS Compuutational Biology. 2010;6:e1000790. [PMC free article] [PubMed]
45. Stolovitzky G, Monroe D, Califano A. Dialogue on reverse-engineering assessment and methods: the DREAM of high-throughput pathway inference. Ann.N.Y.Acad.Sci. 2007;1115:1–22. [PubMed]
46. Stolovitzky G, Prill RJ, Califano A. Lessons from the DREAM2 Challenges. Ann.N.Y.Acad.Sci. 2009;1158:159–195. [PubMed]
47. Vapnik VN. Statistical learning theory. New York: Wiley; 1998.
48. Watkinson J, Liang KC, Wang X, Zheng T, Anastassiou D. Inference of regulatory gene interactions from expression data using three-way mutual information. Ann.N.Y.Acad.Sci. 2009;1158:302–313. [PubMed]
PubReader format: click here to try


Save items

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...