# Validation of Inference Procedures for Gene Regulatory Networks

^{*}Address correspondence to this author at the Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843-3128, USA; E-mail: ude.umat.ece@drawde

This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.5/) which permits unrestrictive use, distribution, and reproduction in any medium, provided the original work is properly cited.

## Abstract

The availability of high-throughput genomic data has motivated the development of numerous algorithms to infer gene regulatory networks. The validity of an inference procedure must be evaluated relative to its ability to infer a model network close to the ground-truth network from which the data have been generated. The input to an inference algorithm is a sample set of data and its output is a network. Since input, output, and algorithm are mathematical structures, the validity of an inference algorithm is a mathematical issue. This paper formulates validation in terms of a semi-metric distance between two networks, or the distance between two structures of the same kind deduced from the networks, such as their steady-state distributions or regulatory graphs. The paper sets up the validation framework, provides examples of distance functions, and applies them to some discrete Markov network models. It also considers approximate validation methods based on data for which the generating network is not known, the kind of situation one faces when using real data.

**Key Words:**Epistemology, gene network, inference, validation.

## 1. INTRODUCTION

The construction of gene regulatory networks is among the most important problems in systems biology [1-2]. Network models provide quantitative knowledge concerning gene regulation and, from a translational perspective, they provide a basis for mathematical analyses leading to systems based therapeutic strategies [3]. Network models run the gamut from coarse-grained discrete networks to the detailed description of stochastic differential equations. The availability of high-throughput genomic data has motivated the development of numerous inference algorithms. The performance, or validity, of these algorithms must be quantified. An inference algorithm takes a sample set of data as input and outputs a model network. Its validity must be evaluated relative to its ability to infer a model network close to the ground-truth network from which the data have been generated. Given a hypothetical model, generate data from the model, apply the inference procedure to construct an inferred model, and compare the hypothetical and inferred models *via *some objective function.

This paper mathematically formulates validation in terms of the distance between two networks, or the distance between two structures of the same kind deduced from the networks, such as their steady-state distributions. As a function from a sample set to a class of network models, an inference procedure is a mathematical operator and its performance must be evaluated within a mathematical framework, in this case, distance functions. The paper sets up the validation framework in general terms, provides examples of distance functions, and applies them to some basic network models. It also considers approximate validation methods based on data for which the generating network is not known, the situation one faces when using real data.

It is hoped that this paper will help to motivate the study of network validation procedures. While we believe it describes the general setting and the basic requirements for validation, as will be pointed out, to date, there has been very little study devoted to network validation. There are many subtle statistical issues. If we are to be able to judge the worth of proposed algorithms, then these issues need to be addressed within a formal mathematical framework.

## 2. BACKGROUND: NETWORK MODELS

Although our aim is to consider network inference from a fairly general perspective, to give concrete examples we require some specific models. Thus, we assume the underlying network structure is composed of a finite node (gene) set, *V* = {*X*_{1}, *X*_{2},…, *X _{n}*}, with each node taking discrete values in [0,

*d*– 1]. The corresponding state space possesses

*N*=

*d*states, which we denote by

^{n}**x**

_{1},

**x**

_{2},…,

**x**

_{N}. We express the state

**x**

_{j}in vector form by

**x**

_{j}= (

*x*

_{j}_{1},

*x*

_{j}_{2},…,

*x*). For notational convenience we write vectors in row form but treat them as columns when multiplied by a matrix. The corresponding dynamical system is based on discrete time,

_{jn}*t*= 0, 1, 2,…, with the state-vector transition

**X**(

*t*) →

**X**(

*t*+ 1) at each time instant. The state

**X**= (

*X*

_{1},

*X*

_{2},…,

*X*) is often referred to as a gene activity profile (GAP).

_{n}### Markov Chains

We assume that the process **X**(*t*) is a *Markov chain*, meaning that the probability of **X**(*t*) conditioned on **X** at *t*_{1} < *t*_{2} < … < *t _{s}* <

*t*is equal to the probability of

**X**(

*t*) conditioned on

**X**(

*t*). We also assume the process is

_{s}*homogeneous*, meaning that the transition probabilities depend only on the time difference, that is, for any

*t*and

*u*, the

*u*-step transition probability,

depends only on *u*. We are not asserting that the Markov property and homogeneity are necessary assumptions for gene regulatory networks. We make these assumptions to facilitate mathematically tractable modeling for the current study. Under these assumptions, we need only consider the one-step transition probability matrix,

where the one-step transition probability, *p _{jk}*, is given by

*p*=

_{jk}*p*(1). We refer to

_{jk}**P**simply as the

*transition probability matrix*. For

*t*= 0, 1, 2,..., the state probability structure of the network is given by the

*t*-state probability vector

where *p _{j}*(

*t*) =

*P*(

**X**(

*t*) =

**x**

_{j}).

**p**(0) is the initial-state probability vector.

Besides the state one-step probabilities, we can consider the gene one-step probabilities,

If, given the GAP at *t*, the conditional probabilities of the genes are independent, then

Suppose that gene *X _{i}* at time

*t*+ 1 depends only on values of genes (predictors) in a regulatory set, ${R}_{i}\subset V$ at time

*t*, the dependency being independent of

*t*. Then the gene one-step probabilities are given by

In this form, we see that the Markov dependencies are restricted to regulatory genes.

The network has a *regulatory graph* consisting of the *n* genes and a directed edge from gene *x*_{i} to gene *x*_{j} if
${x}_{i}\in {R}_{j}$.
There is also a *state-transition graph* whose nodes are the *N* state vectors. There is a directed edge from state **x**_{j} to state **x**_{k} if and only if **x**_{j} = **X**(*t*) implies **x**_{k} = **X**(*t* + 1).

A homogeneous, discrete-time Markov chain with state space {**x**_{1}, **x**_{2},…, **x**_{N}} possesses a steady-state distribution (π_{1}, π_{2},…, π_{N}) if, for all pairs of states **x**_{k} and **x**_{j}, *p _{jk}*(

*u*) → π

_{k}as

*u*→ ∞. If there exists a steady-state distribution, then, regardless of the state

**x**

_{k}, the probability of the Markov chain being in state

**x**

_{k}in the long run is π

_{k}. In particular, for any initial distribution

**p**(0),

*p*(

_{k}*t*) → π

_{k}as

*t*→ ∞. Not all Markov chains possess steady-state distributions.

### Rule-Based Networks

A basic type of regulatory model occurs when the transition **X**(*t*) → **X**(*t* + 1) is governed by a rule-based structure, meaning there exists a state function **f** = (*f*_{1}, *f*_{2},…, *f _{n}*) such that

*X*(

_{i}*t*+ 1) =

*f*(

_{i}*R*(

_{i}*t*)). A classical example of a rule-based network is a Boolean network (BN), where the values are binary, 0 or 1, and the function

*f*can be defined

_{i}*via*a logic expression or a truth table consisting of 2

^{n}rows, with each row assigning a 0 or 1 as the value for the GAP defined by the row [4-5]. As defined, the BN is deterministic and the entries in its transition probability matrix are either 0 or 1. The

*connectivity*of the BN is the maximum number of predictors for a gene. If each has the same number of predictors, then we say that the network has uniform connectivity.

The model becomes stochastic if the BN is subject to perturbation, meaning that at any time point, instead of necessarily being governed by the state function **f** = (*f*_{1}, *f*_{2},…, *f _{n}*), there is a positive probability

*p*< 1 that the GAP may randomly switch to another GAP. There are more refined ways of characterizing perturbations, such as defining perturbations at the gene level rather than the state level, but state-level perturbation is easy to describe and is sufficient for our purposes here. For a BN with perturbation, the corresponding Markov chain possesses a steady-state distribution.

The long-run behavior of a deterministic BN depends on the initial state and the network will eventually settled down and cycle endlessly through a set of states called an *attractor cycle*. The set of all initial states that reach a particular attractor cycle forms the *basin of attraction* for the cycle. Attractor cycles are disjoint. With perturbation, in the long run the network may randomly escape an attractor cycle, be reinitialized, and then begin its transition process anew.

## 3. QUANTIFYING THE DIFFERENCE BETWEEN NETWORKS

### Distance Functions

To discuss validity, we must first discuss the manner in which we are to compare two networks. Given networks *H* and *M*, we need a function, µ(*M*, *H*), quantifying the difference between them. We require that µ be a *semi-metric*, meaning that it satisfies the following four properties:

- $\mathrm{\mu}\left(\mathrm{M},\mathrm{H}\right)\ge 0,$
- $\mathrm{\mu}\left(\mathrm{M},\mathrm{M}\right)=0,$
- $\mathrm{\mu}\left(\mathrm{M},\mathrm{H}\right)=\mathrm{\mu}\left(\mathrm{H},\mathrm{M}\right)\left[\mathit{symmetry}\right],$
- $\mathrm{\mu}\left(\mathrm{M},\mathrm{H}\right)\le \mathrm{\mu}\left(\mathrm{M},\mathrm{N}\right)+\mathrm{\mu}\left(\mathrm{N},\mathrm{H}\right)\left[\mathit{triangle\; inequality}\right]$.As a semi-metric, µ is called a
*distance function*. If µ should satisfy a fifth condition, - $\mathrm{\mu}\left(\mathrm{M},\mathrm{H}\right)=0\Rightarrow \mathrm{M}=\mathrm{H},$

then it is a *metric*. A distance function is often defined in terms of some characteristic, by which we mean some structure associated with a network, such as its regulatory graph, steady-state distribution, or probability transition matrix. This is why we do not require the fifth condition,
$\mathrm{\mu}\left(\mathrm{M},\mathrm{H}\right)=0\Rightarrow \mathrm{M}=\mathrm{H},$ for a network distance function.

If we want to approximate one network by another, say for reasons of computational complexity, then a distance function can be used to measure the goodness of the approximation. If *M _{1}* and

*M*are two approximations of network

_{2}*H*, then

*M*is a better approximation than

_{1}*M*relative to µ if µ $\mathrm{\mu}\left({\mathrm{M}}_{1},\mathrm{H}\right)<\mathrm{\mu}\left({\mathrm{M}}_{2},\mathrm{H}\right)$.

_{2}Because a network distance function need only be a semi-metric, one must be careful in applying propositions from the theory of metric spaces. For instance, in a metric space, if a sequence of points in the space is convergent, then the limit of the sequence is unique. When the points are networks, this is not necessarily true. A sequence of networks can converge to two distinct networks: {*H _{i}*} can converge to both

*M*and

*N*, with $\mathrm{M}\ne \mathrm{N}$.

### Rule-Based Distance

For Boolean networks (with or without perturbation) possessing the same gene set, a distance is given by the proportion of incorrect rows in the function-defining truth tables. Denoting the state functions for networks *H* and *M* by **f** = (*f*_{1}, *f*_{2},…, *f _{n}*) and

**g**= (

*g*

_{1},

*g*

_{2},…,

*g*), respectively, since there are

_{n}*n*truth tables consisting of 2

^{n}rows each, this distance is given by

where *I* denotes the indicator function, *I*[*A*] = 1 if *A* is a true statement and *I*[*A*] = 0 otherwise [6]. If we wish to give more weight to those states more likely to be observed in the steady state, then we can weight the inner sums in Eq. 7 by the corresponding terms in the steady-state distribution, **π** = (π_{1}, π_{2},…, π_{N}). For Boolean networks without perturbation, µ_{fun} is a metric. If there is perturbation, then µ_{fun} is not a metric because two distinct networks may be identical with regard to the rules but possess different perturbation probabilities.

### Topology-Based Distance

If one’s focus is on the topology of a network, then a straightforward approach is to construct the adjacency matrix. Given an *n*-gene network, for *i*, *j* = 1, 2,…, *n*, the (*i*, *j*) entry in the matrix is 1 if there is a directed edge from the *i*th to the *j*th gene; otherwise, the (*i*, *j*) entry is 0. If **A** = (*a _{ij}*) and

**B**= (

*b*) are the adjacency matrices for networks

_{ij}*H*and

*M*, respectively, where

*H*and

*M*possess the same gene set, then the

*hamming*distance between the networks is defined by

Alternatively, the hamming distance may be computed by normalizing the sum, such as by the number of genes or the number of edges in one of the networks, for instance, when one of the networks is considered as representing ground truth. The hamming distance is a coarse measure since it contains no steady-state or dynamic information. Two networks can be very different and yet have ${\mathrm{\mu}}_{\mathit{ham}}\left(\mathrm{M},\mathrm{H}\right)=0$ .

If one of the networks in Eq. 8 is considered as ground truth, then the hamming distance can be reformulated in terms of the numbers of false-negative and false-positive edges. If *H* is the ground-truth network, then a false-negative edge is a directed edge not in *M* that is in *H* and a false-positive edge is directed edge in *M* that is not in *H*. Letting *FN* and *FP* be the numbers of false-negative and false-positive edges, respectively, the hamming distance is given by *FN* + *FP*. Because we are considering directed graphs, an incorrectly oriented edge in *M* between two genes is both a false-negative and false-positive edge, although one can slightly alter the definitions to avoid this kind of double counting. If we were to consider undirected graphs, then this anomaly would not occur because an edge would either be present or absent. In this case, the hamming distance is still defined by Eq. 8 but the adjacency matrix is symmetric.

Since our interest is measuring the closeness of an inferred network to the network generating the data, we concentrate on distance functions, in particular, the hamming distance, which has been used for this purpose [7, 8]. Non-distance measures related to the hamming distance have been used in the context of regulatory graphs. Again let *H* denote the ground-truth network. A true-positive edge is a directed edge in both *H* and *M*, and a true-negative edge is a directed edge in neither *H* nor *M* (with analogous definitions holding for undirected graphs). Let *TP* and *TN* be the numbers of true-positive and true-negative edges, respectively. The *positive predictive value* is defined by *TP*/(*TP* + *FP*), the *sensitivity* is defined by *TP*/(*TP* + *F*N), and the *specificity* is defined by *TN*(*TN* + *FP*). These kinds of measures have been used in several regulatory-graph inference papers [8-12] and a study using these measures has been performed to evaluate a number of inference procedures [13].

### Transition-Probability-Based Distance

Distances for Markov networks can be defined *via *their probability transition matrices by considering matrix norms. A *norm* is a function
$\Vert \u2022\Vert $
on a linear (vector) space, *L*, such that:

- $\Vert v\Vert \ge 0,$
- $\Vert v\Vert =0\Rightarrow v=0,$
- $\Vert \mathit{av}\Vert =\left|a\right|\cdot \Vert v\Vert \left[\mathit{homogeneity}\right],$
- $\Vert v+w\Vert \le \Vert v\Vert +\Vert w\Vert \left[\mathit{triangle\; inequality}\right].$

Given a norm on *L*, a metric is defined on *L* by
$\Vert v-w\Vert $
.

For an *n* x *n* matrix and *r* ≥ 1, the *r*-norm is defined by

The *supremum* norm is defined by

These norms are well-studied in linear algebra. Each yields a metric defined by
${\Vert P-Q\Vert}_{r\cdot}$
If
$P=\left({p}_{\mathit{ij}}\right)$ and
$Q=\left({q}_{\mathit{ij}}\right)$
are the probability transition matrices for networks *H* and *M*, respectively, then a network distance function is defined by

Whereas ${\Vert \u2022\Vert}_{r}$ defines a matrix metric, ${\mathrm{\mu}}_{\mathit{prob}}^{r}$ is only a network semi-metric because two distinct networks may have the same transition probability matrix.

### Long-Run Distance

Since steady-state behavior is of particular interest, for instance, being associated with phenotypes, a natural choice for a network distance is to measure the difference between steady-state distributions [14]. If **π **= (π_{1}, π_{2},…, π_{N}) is a probability vector, then its *r*-norm is defined by

for *r* ≥ 1, and its supremum norm is defined by

If **π **= (π_{1}, π_{2},…, π_{N}) and **ω **= (ω_{1}, ω_{2},…, ω_{N}) are the steady-state distributions for networks *H* and *M*, respectively, then a network distance is defined by

Other norms can be used to define the distance function.

Not all networks possess steady-state distributions. The long-run behavior of a deterministic rule-based network, such as a Boolean network, depends on the initial state. A rule-based finite-value network possesses attractor cycles that characterize its long-run behavior and we can consider comparing this long-run behavior. This can be done by considering the proportion of time spent in a state once an attractor cycle has been entered. For any initial state **x**_{k}, the network eventually enters the attractor cycle, *C _{k}*, whose basin contains

**x**

_{k}. An arbitrary state

**x**

_{j}either lies in

*C*or it does not. Let

_{k}*m*denote the number of states in

_{k}*C*and

_{k}*p*be the probability that the initial state is

_{k}**x**

_{k}. We define the long-run probability of

**x**

_{j}by

Letting **ζ** = (ζ_{1}, ζ_{2}, …ζ_{N}), we can proceed analogously to the steady-state case by replacing **π** by **ζ** to define the *r*-norm, and then define the distance function
${\mathrm{\mu}}_{\mathit{long}}^{r}\left(\mathrm{M},\mathrm{H}\right)$
in the usual way.

Suppose all attractor cycles are singletons, so that *m _{k}* = 1. Moreover, suppose we do not know the initial-state probabilities and we set

*p*= 1/

_{k}*N*. If

**x**

_{k}is an attractor, let

*b*denote the number of states in its basin; if

_{k}**x**

_{k}is not an attractor, let

*b*= 0. Then Eq. 15 reduces to ζ

_{k}_{j}=

*b*/

_{j}*N*. To this point,

**ζ**= (ζ

_{1}, ζ

_{2},…, ζ

_{N}) describes a probability density because its components sum to 1. Now suppose we ignore the basin sizes so that ζ

*= 1/*

_{j}*N*if

**x**

*j*is an attractor and ζ

*= 0 otherwise. If*

_{j}**ζ**= (ζ

_{1}, ζ

_{2},…, ζ

*) and*

_{N}**ξ**= (ξ

_{1}, ξ

_{2},…, ξ

_{N}) correspond to networks

*H*and

*M*, respectively, then the network distance induced by the 1-norm is given by

where *A _{H}* and

*A*are the attractor sets for

_{M}*H*and

*M*, respectively,

is the symmetric difference of *A _{H}* and

*A*, and $\left|\u2022\right|$denotes the number of elements in a set. The distance μ

_{M}_{att}(

*M*,

*H*) compares the attractor sets of the two networks. μ

_{att}(

*M*,

*H*) = 0 if and only the attractor sets are the same. We have derived μ

_{att}(

*M*,

*H*) from μ

_{long}(

*M*,

*H*) assuming only singleton attractors, but μ

_{att}(

*M*,

*H*) can be applied to any rule-based discrete network.

### Trajectory-Based Distance

Continuing with rule-based finite-value networks, rather than simply focusing on the long-run probabilities, one can take a more refined perspective by considering differences in the trajectories. Continue to let *m _{k}* denote the number of states in the cycle

*C*for initial state

_{k}**x**

_{k}and let

*t*be the time it takes

_{k}**x**

_{k}to reach

*C*. The time trajectory of the network is given by

_{k}**X**(

*t*) = (

*X*

_{1}(

*t*),

*X*

_{2}(

*t*),…,

*X*(

_{n}*t*)). For a given initial state this trajectory is deterministic. For initial state

**x**

_{k}, denote the trajectory by

Given the initial state is **x**_{k}, we define the *amplitude cumulative distribution* of gene *X _{i}* by

This increasing function of *z* counts the fraction of time that
${x}_{i}^{\left(k\right)}\left(t\right)\le z$
in the cycle *C _{k}*.

Given two attractor cycles, *C _{k}* and

*C*, resulting from initializations

_{j}**x**

_{k}and

**x**

_{j}, respectively, we define a distance between the cycles relative to gene

*X*using the amplitude cumulative distributions, ${F}_{i}\left(\u2022|k\right)$ and ${F}_{i}\left(\u2022|j\right)$ , by

_{i}

for some function norm
$\Vert \u2022\Vert $
. For example, we could use the *L*_{1} norm

The *L*_{1} norm possesses an interesting interpretation if gene *X _{i}* has constant amplitude values,

*a*and

*b*, on cycles

*C*and

_{k}*C*, respectively. In this case, ${F}_{i}\left(\u2022|k\right)$ and ${F}_{i}\left(\u2022|j\right)$ are unit step functions with steps at

_{j}*a*and

*b*, respectively. Hence, in this case the

*L*

_{1}norm reduces to

and gives the distance, in amplitude, between the values of gene *X _{i}* on the two cycles. For a Boolean network,
${\mathrm{\delta}}_{i}\left({C}_{k},{C}_{j}\right)=0$
if the gene is either ON or OFF on both cycles and
${\mathrm{\delta}}_{i}\left({C}_{k},{C}_{j}\right)=1$
if

*X*is ON for one cycle and OFF for the other (assuming

_{i}*X*is constant on both cycles).

_{i}Considering the full set of genes, we define a distance between two attractor cycles, *C _{k}* and

*C*by

_{j}

Now consider two networks, *M* and *H*, having the same genes. We define the distance between M and H as the expected distance between attractor cycles over all possible initial states:

where
${C}_{k}^{\mathrm{M}}$
and
${C}_{k}^{\mathrm{H}}$
are the attractor cycles corresponding to initialization by state **x**_{k} in networks *M* and *H*, respectively [15].

### Equivalence Classes of Networks

The previous examples of network distance functions demonstrate a common scenario: a network semi-metric is defined by a metric on some network characteristic, for instance, its regulatory graph, its transition probability matrix, etc. The metric requirement,
$\mathrm{\mu}\left(\mathrm{M},\mathrm{H}\right)=0\Rightarrow \mathrm{M}=\mathrm{H}$
fails because distinct networks possess the same characteristic. To formalize the situation, let λ_{M} and λ_{H} denote the characteristic λ corresponding to networks *M* and *H*, respectively. If ν is a metric on a space of characteristics (directed graphs, matrices, probability densities, etc.), then a semi-metric μ_{ν} is induced on the network space according to

This is quite natural if our main interest is with the characteristic, not the specific network itself.

Focus on network characteristics leads to the identification of networks possessing the same characteristic. Given any set, *U*, a relation ~ between elements of *U* is called an *equivalence relation* if it satisfies the following three properties for $a,b,c\in U:$

- $a\sim a\left[\text{reflexivity}\right],$
- $a\sim b\Rightarrow b\sim a\left[\mathit{symmetry}\right],$
- $a\sim b\text{and}b\sim c\Rightarrow a\sim c\left[\mathit{transitivity}\right].$

If *a* ~ *b*, then *a* and *b* are said to be *equivalent.* An equivalence relation on *U* induces a partition of *U*. The subsets forming the partition are defined according to *a* and *b* lie in the same subset if and only if *a ~ b*. The subsets are called *equivalence classes*. The equivalence class of elements equivalent to *a* is denoted by [*a*]^{~}. According to the definitions, ${\left[a\right]}^{\sim}={\left[b\right]}^{\sim}$
if and only if *a* ~ *b* .

If ν is a semi-metric on a set *U* and we define *a* ~ *b* if and only if ν(*a*, *b*) = 0, then

defines a metric on the space of equivalence classes because $\mathrm{\mu}\left({\left[a\right]}^{\sim},{\left[b\right]}^{\sim}\right)=0\iff \mathrm{\nu}\left(a,b\right)=0\iff a\sim b\iff {\left[a\right]}^{\sim}={\left[b\right]}^{\sim}$.

If we define *M* ~ *H* if λ* _{M}* = λ

*, then this is a network equivalence relation. If we focus on equivalence classes of networks rather than the networks themselves, we are in effect identifying equivalent networks. For instance, if we are only interested in steady-state distributions, then it may be advantageous to identify networks possessing the same steady-state distribution.*

_{H}## 4. INFERENCE PERFORMANCE

An inference procedure operates on data generated by a network *H* and constructs an inferred network *M* to serve as an estimate of *H*, or it constructs a characteristic to serve as an estimate of the corresponding characteristic of *H*. For instance, the data may be used to infer a distribution that estimates the steady-state distribution of *H*. The data could be dynamical, consisting of time-course observations, or it might be taken from the steady state, as with microarray measurements assumed to come from the steady state of some phenotypic class. In the latter case, it makes sense to consider inference accuracy relative to the steady-state distribution of *H*, rather than *H* itself. For full network inference, the inference procedure is a mathematical operation, a mapping from a space of samples to a space of networks, and it must be evaluated as such. There is a generated data set *S* and the inference procedure is of the form ψ(*S*) = *M*. If a characteristic is being estimated, then ψ(*S*) is a characteristic, for instance, ψ*S* = *F*, a probability distribution.

### Measuring Inference Performance Using Distance Functions

Focusing on full network inference, the goodness of an inference procedure ψ is measured relative to some distance, μ, specifically, µ
$\mathrm{\mu}\left(\mathrm{M},\mathrm{H}\right)=\mathrm{\mu}\left(\mathrm{\psi}\left(S\right),\mathrm{H}\right)$
, which is a function of the sample *S*. In fact, *S* is a realization of a random set process, Σ, governing data generation from *H*. In general, there is no assumption on the nature of Σ. It might be directly generated by *H* or it might result from directly generated data corrupted by noise of some sort.
$\mathrm{\mu}\left(\mathrm{\psi}\left(\mathrm{\Sigma}\right),\mathrm{H}\right)$
is a random variable and the performance of ψ is characterized by the distribution of
$\mathrm{\mu}\left(\mathrm{\psi}\left(\mathrm{\Sigma}\right),\mathrm{H}\right)$
, which depends on the distribution of Σ. The salient statistic regarding the distribution of
$\mathrm{\mu}\left(\mathrm{\psi}\left(\mathrm{\Sigma}\right),\mathrm{H}\right)$
is its mean,
${E}_{\mathit{\Sigma}}\left[\mathrm{\mu}\left(\mathrm{\psi}\left(\mathrm{\Sigma}\right),\mathrm{H}\right)\right]$
, where the expectation is taken with respect to Σ.

Rather than considering a single network, we can consider a distribution, **H**, of random networks, where, by definition, the occurrences of realizations *H* of **H** are governed by a probability distribution. This is precisely the situation with regard to the classical study of random Boolean networks. Averaging over the class of random networks, our interest focuses on

It is natural to define the inference procedure ψ_{1} better than the inference procedure ψ_{2} relative to the distance μ, the random network H, and the sampling procedure Σ if

Whether an inference procedure is “good” is not only relative to the distance function, it is relative to how one views the value of the expected distance. Indeed, it is not really possible to determine an absolute notion of goodness.

In practice, the expectation is estimated by an average,

where *S*_{1}, *S*_{2},..., *S _{m}* are sample point sets generated according to Σ from networks

*H*

_{1},

*H*

_{2},…,

*H*randomly chosen from H.

_{m}The preceding analysis applies virtually unchanged when a characteristic is being estimated. One need only replace *H* and H by λ and Λ, where λ and Λ are a characteristic and a random characteristic, respectively, and replace the network distance μ by the characteristic distance.

We next present three examples using previously introduced distance functions to measure inference performance. Algorithm description will be sketchy in order to avoid long digressions from the issue of distance illustration. We defer to the cited literature for details.

#### Example 1.

The Boolean network model has been in existence for a long time and various inference procedures have been proposed [16-18]. One proposed method for Boolean networks with perturbation is based on the observation of a single dynamic realization of the network [6]. This method will be discussed in some detail in Section 5 in regard to consistent inference; for now, we are only concerned with the distance between the inferred network and the original network generating the data, where the distance function is given by µ_{fun}(*M*, *H*) in Eq. 7. Fig. (**11**) shows the average (in percentage) of the distance function using 80 data sequences generated from 16 randomly generated Boolean networks with 7 genes, perturbation probability *p* = 0.01, uniform connectivity *k* = 2 or *k* = 3, and data sequence lengths varying from 500 to 40,000. The reduction in performance from connectivity 2 to connectivity 3 is not surprising because the number of truth-table lines jumps dramatically.

#### Example 2.

There have been a number of papers addressing the inference of connectivity graphs using information-theoretic approaches [9, 10, 19]. In a study proposing using the minimum description length (MDL) principle to infer regulatory graphs [8], the hamming distance was used to compare the performance of the newly proposed algorithm with an earlier information-theoretic algorithm, called REVEAL [9]. Fig. (**22**) compares the hamming distances between the inferred networks and the corresponding synthetic networks that generated the data relative to increasing sample size. It does so for the REVEAL algorithm and the MDL algorithm using three different settings for a user-defined parameter. The performance measures are obtained by averaging over 30 randomly generated networks, each containing 20 genes and 30 edges, with the distance function being normalized over 30, the number of edges in the synthetic networks.

#### Example 3.

A probabilistic Boolean network (PBN) is a network defined as a collection of discrete-valued networks such that at any point in time one of the constituent networks is controlling the network dynamics [20]. In a context-sensitive PBN there is a binary random variable determining whether there should be a switch of constituent networks at that time point, the modeling assumption being that there are latent variables outside the model network whose changes induce stochasticity into the PBN [21]. Typically, there is also a probability of permutation. This example considers a Bayesian connectivity-based inference procedure for designing PBNs from steady-state data [22]. A synthetic PBN, *H*, composed of two constituent Boolean networks is used to generate a random sample of size 60 from its steady-state distribution and the inference procedure is used to construct a designed PBN composed of ten constituent PBNs (note that the inference procedure does not have input relating to the number of constituent BNs of the generating network). According to definition, the attractors of a PBN are the attractors of its constituent BNs. *H* has six singleton attractors, two of which, call them **x**_{a} and **x**_{b}, contain 0.99 of the steady-state mass. The designed PBN has more attractors, which is not uncommon, but **x**_{a} and **x**_{b} appear in all ten constituent networks as singleton attractors and they contain 0.78 of the steady-state mass. Since, for PBNs with low probability of network switching almost all of the steady-state mass lies in its attractors [21],
${\mathrm{\mu}}_{\mathit{stead}}^{1}\left(\mathrm{\psi}\left(S\right),\mathrm{H}\right)\le 0.21$
(or approximately so), the maximum 1-norm being 2.

## 5. CONSISTENCY

The greater the amount of data, the better inference one can expect. The hope is that, for large data sets, the inferred network will be close to the generating network. We define an inference procedure, ψ, to be *consistent* if
${\mathrm{\mu}}^{\ast}\left(\text{H},\mathrm{\Sigma},\mathrm{\psi}\right)\to 0$
as
$\left|\mathrm{\Sigma}\right|\to \mathrm{\infty}$
. We illustrate consistency using Boolean networks with perturbation. We use the inference procedure referred to in Example 1 that applies to a single observed time series [6] and the distance function µ_{fun} of Eq. 7.

Owing to perturbation, the network has a steady-state distribution and all states communicate with each other. Hence, given a long time series we are likely to observe most of the states and their corresponding state-to-state transitions **x**_{k} → **x**_{(k)}, for *k* = 1, 2,…, *N*, where **x**_{(k)} denotes the next state following **x**_{k} under the network state function. If we ignore perturbation, then using the observed state-to-state transitions we can construct a table of state-to-gene transitions of the form **x**_{k} → *x _{i}*, for

*k*= 1, 2,…,

*N*and

*i*= 1, 2,…,

*n*. These define the functions

*f*

_{1},

*f*

_{2},…,

*f*accordingly. Because the truth table for function

_{n}*f*has 2

_{i}^{n}rows of the form

*f*(

_{i}**x**

_{k}), some rows may be empty owing to insufficient observations and these rows can be filled in randomly. As the length of the time series increases, the probability of not observing the state

**x**

_{k}goes to 0. Indeed, for any positive integer

*c*, if we let η(

**x**

_{k}) denote the number of times

**x**

_{k}is observed in the time series, then $P\left(\mathrm{\eta}\left({\text{x}}_{\text{k}}\right)\ge c\right)\to 1$ as $\left|\mathrm{\Sigma}\right|\to \mathrm{\infty}$ , where the probability is with respect to the time series Σ.

With perturbation, the state-to-state transitions do not directly define functions because state **x**_{k} may transition to more than one state. However, assuming a perturbation probability less than 0.5, the transitions from **x**_{k} will be dominated by the single transition determined by the state function **f** and this dominating choice can be used for inference. Letting η_{j}(**x**_{k}) denote the number times we observe the transition
${\text{x}}_{k}\to {\text{x}}_{j}$
if
$\text{f}\left({\text{x}}_{k}\right)={\text{x}}_{k}$
is the function-defined transition, then

Thus, if
$\stackrel{\u02c6}{\text{f}}$
denotes the inferred state function, then
$P\left(\stackrel{\u02c6}{\text{f}}=\text{f}\right)\to 1\mathrm{as}\left|\mathrm{\Sigma}\right|\to \mathrm{\infty}$.Similar asymptotic statements hold for *f*_{1}, *f*_{2},…, *f _{n}*. This insures that, for any
$\mathrm{\tau}>0,P\left({\mathrm{\mu}}_{\mathit{fun}}\left(\mathrm{\psi}\left(\Sigma \right),\mathrm{H}\right)<\mathrm{\tau}\right)\to 1\mathrm{as}\left|\mathrm{\Sigma}\right|\to \mathrm{\infty}$
for any Boolean network

*H*. Since µ

_{fun}(ψ(Σ),

*H*) ≤ 1, this is equivalent to ${E}_{\mathit{\Sigma}}\left[{\mathrm{\mu}}_{\mathit{fun}}\left(\mathrm{\psi}\left(\Sigma \right),\mathrm{H}\right)\right]\to 0\mathrm{as}\left|\mathrm{\Sigma}\right|\to \mathrm{\infty}$ Finally, if H is the class of all Boolean networks on

*n*genes with perturbation probability

*p*, then, since H is a finite set,

and the inference procedure is consistent relative to µ_{fun}.

The preceding argument assumes that the perturbation probability is known. A modification of the inference procedure yields an estimator for *p* [6]; however, if *p* is also being estimated, then the model space H is no longer finite and the consistency proof has to be modified. We do not believe this is the proper place to go into such mathematical issues.

## 6. APPROXIMATION

Inference performance is evaluated based on the ability of an inference procedure to identify the network from which the data have been derived. This can only be done exactly if the data-generating network is known. Suppose we do not know the random network, H, generating the data for which we want to evaluate the inference procedure, ψ, but know a network *N* that we believe to be a good approximation to the networks in H. We might then compare the inferred network to *N*. In effect, such a comparison is approximating
${\mathrm{\mu}}^{\ast}\left(\text{H},\mathrm{\Sigma},\mathrm{\psi}\right)$
by
${E}_{\mathit{\Sigma}}\left[\mathrm{\mu}\left(\mathrm{\psi}\left(\mathrm{\Sigma}\right),\mathrm{N}\right)\right]$
.

The key issue is approximation accuracy. The triangle inequality implies

for any sample set *S* and $\mathrm{H}\in \text{H}$
. Hence,

If *E*_{H}[µ(*N*, H)] ≈ 0, meaning that *E*_{H}[µ(*N*, H)] is small, then the preceding inequality leads to the approximate inequality

Thus, if *E*_{H}[µ(*N*, H)] ≈ 0, then

and it is reasonable to judge the performance of ψ relative to H by
${E}_{\mathit{\Sigma}}\left[\mathrm{\mu}\left(\mathrm{\psi}\left(\mathrm{\Sigma}\right),\mathrm{N}\right)\right]$
. On the other hand, if *E*_{H}[µ(*N*, H)] is not small, then both bounds in Eq. 33 are loose and nothing can be asserted regarding the performance of ψ relative to the data sets on which it is being applied. Therefore, unless *E*_{H}[µ(*N*, H)] is small, the entire validation procedure is flawed because the approximation of H by *N* is confounding the procedure. In addition, if
${E}_{\text{H}}\left[\mathrm{\mu}\left(\mathrm{N},\text{H}\right)\right]\approx 0,$
one still has to estimate
${E}_{\mathit{\Sigma}}\left[\mathrm{\mu}\left(\mathrm{\psi}\left(\mathrm{\Sigma}\right),\mathrm{N}\right)\right],$
which generally means that the number of sample sets is sufficiently large that the expectation is well-estimated by the average distance.

The preceding approximation methodology is common in the literature. A proposed inference procedure is applied to one or more real data sets. The inferred network is compared, not to the unknown random network generating the data, but to a model network that has been human-constructed from the literature (and implicitly assumed to approximate the data-generating network). For instance, a directed graph (adjacency matrix), **A**, is constructed from relations found in the literature and the hamming distance is used in the approximating expectation,
${E}_{\mathit{\Sigma}}\left[{\mathrm{\mu}}_{\mathit{ham}}\left(\mathrm{\psi}\left(\mathrm{\Sigma}\right),\text{A}\right)\right],$
in Eq. 35. The aim is to compare the result of the inference procedure to some characteristic related to existing biological knowledge. The problem is that the constructed regulatory graph may not be a good approximation to the regulatory graph for the system generating the data. This can happen because the literature is incomplete, there are insufficiently validated connections reported in the literature, or the conditions under which connections have been discovered, or not discovered, in certain papers are not compatible with the conditions under which the current data have been derived. As a result of any of these situations, the overall validation procedure is confounded by the precision (or lack thereof) of the approximation.

## 7. VALIDATION FROM EXPERIMENTAL DATA

Another form of approximation results from using experimental data for validation rather than synthetic data generated from a known, ground-truth model. In this situation, there is a *test-data *sampling procedure generating data from which an estimate of the desired characteristic corresponding to the underlying physical network is formed. Validation is then *via *the random variable
$\mathrm{\mu}\left(\mathrm{\psi}\left(\mathrm{\Sigma}\right),\mathrm{\xi}\left(\mathrm{\Omega}\right)\right),$
where Σ
is the *training-data* sampling procedure used to design the network and Ω is a real-data *test-sampling* procedure to validate the designed network by direct construction of the characteristic *via *independent sampling. To simplify the notation we consider a single underlying network *H* rather than a random network H. In this situation,
${E}_{\text{H}}\left[\mathrm{\mu}\left(\mathrm{N},\text{H}\right)\right]$
in Eq. 33 is replaced by
$\mathrm{\mu}\left(\mathrm{\xi}\left(\mathrm{\Omega}\right),{\mathrm{\lambda}}_{\mathrm{H}}\right)$
, where λ* _{H}* is the characteristic for

*H*, and Eq. 33 takes the form

If *E*_{Ω}[µ(ξ(Ω), λ* _{H}*)] ≈ 0, then

If ξ is a consistent estimator of λ* _{H}*, so that
${E}_{\mathrm{\Omega}}\left[\mathrm{\mu}\left(\mathrm{\xi}\left(\mathrm{\Omega}\right),{\mathrm{\lambda}}_{\mathrm{H}}\right)\right]\approx 0$
for large test samples, then, on average, the approximation is good.

Consider what happens if one only has data to estimate (train) the model, which may happen when data are limited on account of cost or the availability of samples. In this case, one tests on the same data, thereby having Ω = Σ in Eq. 36 and the *resubstitution *estimate,
${E}_{\mathit{\Sigma}}\left[\mathrm{\mu}\left(\mathrm{\psi}\left(\mathrm{\Sigma}\right),\mathrm{\xi}\left(\mathrm{\Sigma}\right)\right)\right]$
, in Eq. 37. If ξ is a consistent estimator of λ* _{H}* and the single training sample is large, then the conclusion of Eq. 37 again holds. But we do not have a large sample. Hence, Eq. 36 cannot be used to insure good average performance. But it also cannot be used to insure good performance when there is a small independent test-data sample. In the independent case, we are concerned with the absolute difference

When the same data are used for training and testing, our interest is with

As with classification, where resubstitution error estimation is usually biased low owing to overfitting by the classification rule, in the case of network validation, resubstitution is risky because the characteristic of the designed network is being compared to a characteristic inferred from the same data with which the network has been designed. According to Eq. 36, as in the case of classification, this is not a problem for large samples, but it can be a serious problem for small samples because overfitting can cause Δ_{train} to be much less than Δ_{test}. Whereas substantial effort has gone into studying these kinds of problems in pattern recognition, there appears to be an absence of the analogous study for network validation.

### Example 4.

An attractor-preserving inference method for PBNs based on steady-state data has been proposed and applied to PBNs [23]. A PBN has been designed from cDNA microarray data using 7 genes: WNT5A, pirin, S100P, RET1, MART1, HADHB, and STC2. The steady-state distribution of the designed network has been compared to the histogram of the data, the histogram serving as an estimate of the steady-state distribution of the underlying physical network. Fig. (**33**) illustrates the comparison of the portion of the steady-state distribution corresponding to the data states with the data histogram. Referring to Eq. 14, the 1-norm and 2-norm yield the resubstitution error distances ${\mathrm{\mu}}_{\mathit{stead}}^{1}\left(\mathrm{\psi}\left(S\right),\mathrm{\xi}\left(S\right)\right)=0.45$
(out of a maximum of 2) and
${\mathrm{\mu}}_{\mathit{stead}}^{2}\left(\mathrm{\psi}\left(S\right),\mathrm{\xi}\left(S\right)\right)=0.1262,$
respectively, the latter being the root-mean-square error.

## 8. CONCLUSION

This paper has proposed a mathematically rigorous framework for the validation of inference procedures for gene regulatory networks and has illustrated this framework employing validation methods used in the literature. Owing to the central role of regulatory networks in systems biology and the need to apply inference procedures to the massive data sets resulting from high-throughput technologies, validation cannot be left to *ad hoc* methods whose own performances are not understood. A formal framework is necessary. As should be clear from the paper, a great deal of work needs to be done to establish the properties of inference procedures under various conditions, such as the sampling procedure, model class, and validation criterion (distance function). Absent rigorous results in this regard, proposed inference procedures will remain speculative and the quality of their performances unknown. A sound epistemology will be lacking.

## ACKNOWLEDGEMENTS

I would like to acknowledge the National Science Foundation (CCF-0514644) and the National Cancer Institute (2R25CA090301-06) for supporting this work. I also wish to thank Wentao Zhao for assisting with the literature search and Barak Faryabi for proof reading the manuscript.

## REFERENCES

*via*steady state trajectories. EURASIP J. Bioinformat. Syst. Biol. 2007:11 pages. Article ID 82702. [PMC free article] [PubMed]

*via*best-fit extensions. In: W Zhang, I Shmulevich., editors. Computat. Statis. Approaches Genom. Boston: Kluwer Academic Publishers; 2002. pp. 197–210.

**Bentham Science Publishers**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (349K) |
- Citation

- Validation of gene regulatory networks: scientific and inferential.[Brief Bioinform. 2011]
*Dougherty ER.**Brief Bioinform. 2011 May; 12(3):245-52. Epub 2010 Dec 22.* - Validation of gene regulatory network inference based on controllability.[Front Genet. 2013]
*Qian X, Dougherty ER.**Front Genet. 2013; 4:272. Epub 2013 Dec 12.* - Gene expression complex networks: synthesis, identification, and analysis.[J Comput Biol. 2011]
*Lopes FM, Cesar RM, Costa Lda F.**J Comput Biol. 2011 Oct; 18(10):1353-67. Epub 2011 May 6.* - Inference from clustering with application to gene-expression microarrays.[J Comput Biol. 2002]
*Dougherty ER, Barrera J, Brun M, Kim S, Cesar RM, Chen Y, Bittner M, Trent JM.**J Comput Biol. 2002; 9(1):105-26.* - On protocols and measures for the validation of supervised methods for the inference of biological networks.[Front Genet. 2013]
*Schrynemackers M, Küffner R, Geurts P.**Front Genet. 2013 Dec 3; 4:262. Epub 2013 Dec 3.*

- Validation of gene regulatory network inference based on controllability[Frontiers in Genetics. ]
*Qian X, Dougherty ER.**Frontiers in Genetics. 4272* - Integrated Analysis of Transcriptomic and Proteomic Data[Current Genomics. 2013]
*Haider S, Pal R.**Current Genomics. 2013 Apr; 14(2)91-110* - Optimal reference sequence selection for genome assembly using minimum description length principle[EURASIP Journal on Bioinformatics and Syste...]
*Wajid B, Serpedin E, Nounou M, Nounou H.**EURASIP Journal on Bioinformatics and Systems Biology. 2012; 2012(1)18* - A statistical approach to selecting and confirming validation targets in -omics experiments[BMC Bioinformatics. ]
*Leek JT, Taub MA, Rasgon JL.**BMC Bioinformatics. 13150* - Assessing the gain of biological data integration in gene networks inference[BMC Genomics. ]
*Vicente FF, Lopes FM, Hashimoto RF, Cesar RM Jr.**BMC Genomics. 13(Suppl 6)S7*

- Validation of Inference Procedures for Gene Regulatory NetworksValidation of Inference Procedures for Gene Regulatory NetworksCurrent Genomics. Sep 2007; 8(6)351

Your browsing activity is empty.

Activity recording is turned off.

See more...