• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of currgenoLink to Publisher's site
Curr Genomics. Sep 2007; 8(6): 351–359.
PMCID: PMC2671720

Validation of Inference Procedures for Gene Regulatory Networks

Abstract

The availability of high-throughput genomic data has motivated the development of numerous algorithms to infer gene regulatory networks. The validity of an inference procedure must be evaluated relative to its ability to infer a model network close to the ground-truth network from which the data have been generated. The input to an inference algorithm is a sample set of data and its output is a network. Since input, output, and algorithm are mathematical structures, the validity of an inference algorithm is a mathematical issue. This paper formulates validation in terms of a semi-metric distance between two networks, or the distance between two structures of the same kind deduced from the networks, such as their steady-state distributions or regulatory graphs. The paper sets up the validation framework, provides examples of distance functions, and applies them to some discrete Markov network models. It also considers approximate validation methods based on data for which the generating network is not known, the kind of situation one faces when using real data.

Key Words: Epistemology, gene network, inference, validation.

1. INTRODUCTION

The construction of gene regulatory networks is among the most important problems in systems biology [1-2]. Network models provide quantitative knowledge concerning gene regulation and, from a translational perspective, they provide a basis for mathematical analyses leading to systems based therapeutic strategies [3]. Network models run the gamut from coarse-grained discrete networks to the detailed description of stochastic differential equations. The availability of high-throughput genomic data has motivated the development of numerous inference algorithms. The performance, or validity, of these algorithms must be quantified. An inference algorithm takes a sample set of data as input and outputs a model network. Its validity must be evaluated relative to its ability to infer a model network close to the ground-truth network from which the data have been generated. Given a hypothetical model, generate data from the model, apply the inference procedure to construct an inferred model, and compare the hypothetical and inferred models via some objective function.

This paper mathematically formulates validation in terms of the distance between two networks, or the distance between two structures of the same kind deduced from the networks, such as their steady-state distributions. As a function from a sample set to a class of network models, an inference procedure is a mathematical operator and its performance must be evaluated within a mathematical framework, in this case, distance functions. The paper sets up the validation framework in general terms, provides examples of distance functions, and applies them to some basic network models. It also considers approximate validation methods based on data for which the generating network is not known, the situation one faces when using real data.

It is hoped that this paper will help to motivate the study of network validation procedures. While we believe it describes the general setting and the basic requirements for validation, as will be pointed out, to date, there has been very little study devoted to network validation. There are many subtle statistical issues. If we are to be able to judge the worth of proposed algorithms, then these issues need to be addressed within a formal mathematical framework.

2. BACKGROUND: NETWORK MODELS

Although our aim is to consider network inference from a fairly general perspective, to give concrete examples we require some specific models. Thus, we assume the underlying network structure is composed of a finite node (gene) set, V = {X1, X2,…, Xn}, with each node taking discrete values in [0, d – 1]. The corresponding state space possesses N = dn states, which we denote by x1, x2,…, xN. We express the state xj in vector form by xj = (xj1, xj2,…, xjn). For notational convenience we write vectors in row form but treat them as columns when multiplied by a matrix. The corresponding dynamical system is based on discrete time, t = 0, 1, 2,…, with the state-vector transition X(t) → X(t + 1) at each time instant. The state X = (X1, X2,…, Xn) is often referred to as a gene activity profile (GAP).

Markov Chains

We assume that the process X(t) is a Markov chain, meaning that the probability of X(t) conditioned on X at t1 < t2 < … < ts < t is equal to the probability of X(t) conditioned on X(ts). We also assume the process is homogeneous, meaning that the transition probabilities depend only on the time difference, that is, for any t and u, the u-step transition probability,

pjku=PXt+u=xk|Xt=xj
(1)

depends only on u. We are not asserting that the Markov property and homogeneity are necessary assumptions for gene regulatory networks. We make these assumptions to facilitate mathematically tractable modeling for the current study. Under these assumptions, we need only consider the one-step transition probability matrix,

equation image
(2)

where the one-step transition probability, pjk, is given by pjk = pjk(1). We refer to P simply as the transition probability matrix. For t = 0, 1, 2,..., the state probability structure of the network is given by the t-state probability vector

pt=p1t,p2t,...,pNt
(3)

where pj(t) = P(X(t) = xj). p(0) is the initial-state probability vector.

Besides the state one-step probabilities, we can consider the gene one-step probabilities,

pij,r=PXit+1=r|Xt=xj
(4)

If, given the GAP at t, the conditional probabilities of the genes are independent, then

pjk=p1xj1,xkp2xj2,xk...pnxjn,xk
(5)

Suppose that gene Xi at time t + 1 depends only on values of genes (predictors) in a regulatory set, RiV at time t, the dependency being independent of t. Then the gene one-step probabilities are given by

pij,r=PXit+1=r|Xlt=xjl for XlRi
(6)

In this form, we see that the Markov dependencies are restricted to regulatory genes.

The network has a regulatory graph consisting of the n genes and a directed edge from gene xi to gene xj if xiRj. There is also a state-transition graph whose nodes are the N state vectors. There is a directed edge from state xj to state xk if and only if xj = X(t) implies xk = X(t + 1).

A homogeneous, discrete-time Markov chain with state space {x1, x2,…, xN} possesses a steady-state distribution (π1, π2,…, πN) if, for all pairs of states xk and xj, pjk(u) → πk as u → ∞. If there exists a steady-state distribution, then, regardless of the state xk, the probability of the Markov chain being in state xk in the long run is πk. In particular, for any initial distribution p(0), pk(t) → πk as t → ∞. Not all Markov chains possess steady-state distributions.

Rule-Based Networks

A basic type of regulatory model occurs when the transition X(t) → X(t + 1) is governed by a rule-based structure, meaning there exists a state function f = (f1, f2,…, fn) such that Xi(t + 1) = fi(Ri(t)). A classical example of a rule-based network is a Boolean network (BN), where the values are binary, 0 or 1, and the function fi can be defined via a logic expression or a truth table consisting of 2n rows, with each row assigning a 0 or 1 as the value for the GAP defined by the row [4-5]. As defined, the BN is deterministic and the entries in its transition probability matrix are either 0 or 1. The connectivity of the BN is the maximum number of predictors for a gene. If each has the same number of predictors, then we say that the network has uniform connectivity.

The model becomes stochastic if the BN is subject to perturbation, meaning that at any time point, instead of necessarily being governed by the state function f = (f1, f2,…, fn), there is a positive probability p < 1 that the GAP may randomly switch to another GAP. There are more refined ways of characterizing perturbations, such as defining perturbations at the gene level rather than the state level, but state-level perturbation is easy to describe and is sufficient for our purposes here. For a BN with perturbation, the corresponding Markov chain possesses a steady-state distribution.

The long-run behavior of a deterministic BN depends on the initial state and the network will eventually settled down and cycle endlessly through a set of states called an attractor cycle. The set of all initial states that reach a particular attractor cycle forms the basin of attraction for the cycle. Attractor cycles are disjoint. With perturbation, in the long run the network may randomly escape an attractor cycle, be reinitialized, and then begin its transition process anew.

3. QUANTIFYING THE DIFFERENCE BETWEEN NETWORKS

Distance Functions

To discuss validity, we must first discuss the manner in which we are to compare two networks. Given networks H and M, we need a function, µ(M, H), quantifying the difference between them. We require that µ be a semi-metric, meaning that it satisfies the following four properties:

  1. μM,H0,
  2. μM,M=0,
  3. μM,H=μH,Msymmetry,
  4. μM,HμM,N+μN,Htriangle inequality.
    As a semi-metric, µ is called a distance function. If µ should satisfy a fifth condition,
  5. μM,H=0M=H,

then it is a metric. A distance function is often defined in terms of some characteristic, by which we mean some structure associated with a network, such as its regulatory graph, steady-state distribution, or probability transition matrix. This is why we do not require the fifth condition, μM,H=0M=H, for a network distance function.

If we want to approximate one network by another, say for reasons of computational complexity, then a distance function can be used to measure the goodness of the approximation. If M1 and M2 are two approximations of network H, then M1 is a better approximation than M2 relative to µ if µ μM1,H<μM2,H.

Because a network distance function need only be a semi-metric, one must be careful in applying propositions from the theory of metric spaces. For instance, in a metric space, if a sequence of points in the space is convergent, then the limit of the sequence is unique. When the points are networks, this is not necessarily true. A sequence of networks can converge to two distinct networks: {Hi} can converge to both M and N, with MN.

Rule-Based Distance

For Boolean networks (with or without perturbation) possessing the same gene set, a distance is given by the proportion of incorrect rows in the function-defining truth tables. Denoting the state functions for networks H and M by f = (f1, f2,…, fn) and g = (g1, g2,…, gn), respectively, since there are n truth tables consisting of 2n rows each, this distance is given by

μfunM,H=1n2ni=1nk=1NIgixkfixk
(7)

where I denotes the indicator function, I[A] = 1 if A is a true statement and I[A] = 0 otherwise [6]. If we wish to give more weight to those states more likely to be observed in the steady state, then we can weight the inner sums in Eq. 7 by the corresponding terms in the steady-state distribution, π = (π1, π2,…, πN). For Boolean networks without perturbation, µfun is a metric. If there is perturbation, then µfun is not a metric because two distinct networks may be identical with regard to the rules but possess different perturbation probabilities.

Topology-Based Distance

If one’s focus is on the topology of a network, then a straightforward approach is to construct the adjacency matrix. Given an n-gene network, for i, j = 1, 2,…, n, the (i, j) entry in the matrix is 1 if there is a directed edge from the ith to the jth gene; otherwise, the (i, j) entry is 0. If A = (aij) and B = (bij) are the adjacency matrices for networks H and M, respectively, where H and M possess the same gene set, then the hamming distance between the networks is defined by

μhamM,H=i,j=1naijbij
(8)

Alternatively, the hamming distance may be computed by normalizing the sum, such as by the number of genes or the number of edges in one of the networks, for instance, when one of the networks is considered as representing ground truth. The hamming distance is a coarse measure since it contains no steady-state or dynamic information. Two networks can be very different and yet have μhamM,H=0 .

If one of the networks in Eq. 8 is considered as ground truth, then the hamming distance can be reformulated in terms of the numbers of false-negative and false-positive edges. If H is the ground-truth network, then a false-negative edge is a directed edge not in M that is in H and a false-positive edge is directed edge in M that is not in H. Letting FN and FP be the numbers of false-negative and false-positive edges, respectively, the hamming distance is given by FN + FP. Because we are considering directed graphs, an incorrectly oriented edge in M between two genes is both a false-negative and false-positive edge, although one can slightly alter the definitions to avoid this kind of double counting. If we were to consider undirected graphs, then this anomaly would not occur because an edge would either be present or absent. In this case, the hamming distance is still defined by Eq. 8 but the adjacency matrix is symmetric.

Since our interest is measuring the closeness of an inferred network to the network generating the data, we concentrate on distance functions, in particular, the hamming distance, which has been used for this purpose [7, 8]. Non-distance measures related to the hamming distance have been used in the context of regulatory graphs. Again let H denote the ground-truth network. A true-positive edge is a directed edge in both H and M, and a true-negative edge is a directed edge in neither H nor M (with analogous definitions holding for undirected graphs). Let TP and TN be the numbers of true-positive and true-negative edges, respectively. The positive predictive value is defined by TP/(TP + FP), the sensitivity is defined by TP/(TP + FN), and the specificity is defined by TN(TN + FP). These kinds of measures have been used in several regulatory-graph inference papers [8-12] and a study using these measures has been performed to evaluate a number of inference procedures [13].

Transition-Probability-Based Distance

Distances for Markov networks can be defined via their probability transition matrices by considering matrix norms. A norm is a function on a linear (vector) space, L, such that:

  1. v0,
  2. v=0v=0,
  3. av=avhomogeneity,
  4. v+wv+wtriangle inequality.

Given a norm on L, a metric is defined on L by vw .

For an n x n matrix and r ≥ 1, the r-norm is defined by

Pr=i,j=1npijr1/r
(9)

The supremum norm is defined by

P=maxpij:i,j=1,2,...,n
(10)

These norms are well-studied in linear algebra. Each yields a metric defined by PQr If P=pij and Q=qij are the probability transition matrices for networks H and M, respectively, then a network distance function is defined by

μprobrM,H=PQr
(11)

Whereas r defines a matrix metric, μprobr is only a network semi-metric because two distinct networks may have the same transition probability matrix.

Long-Run Distance

Since steady-state behavior is of particular interest, for instance, being associated with phenotypes, a natural choice for a network distance is to measure the difference between steady-state distributions [14]. If π = (π1, π2,…, πN) is a probability vector, then its r-norm is defined by

πr=i=1Nπir1/r
(12)

for r ≥ 1, and its supremum norm is defined by

π=maxπi:i=1,2,...,N
(13)

If π = (π1, π2,…, πN) and ω = (ω1, ω2,…, ωN) are the steady-state distributions for networks H and M, respectively, then a network distance is defined by

μsteadrM,H=πωr
(14)

Other norms can be used to define the distance function.

Not all networks possess steady-state distributions. The long-run behavior of a deterministic rule-based network, such as a Boolean network, depends on the initial state. A rule-based finite-value network possesses attractor cycles that characterize its long-run behavior and we can consider comparing this long-run behavior. This can be done by considering the proportion of time spent in a state once an attractor cycle has been entered. For any initial state xk, the network eventually enters the attractor cycle, Ck, whose basin contains xk. An arbitrary state xj either lies in Ck or it does not. Let mk denote the number of states in Ck and pk be the probability that the initial state is xk. We define the long-run probability of xj by

ζj=K=1NpkmkIxjCk
(15)

Letting ζ = (ζ1, ζ2, …ζN), we can proceed analogously to the steady-state case by replacing π by ζ to define the r-norm, and then define the distance function μlongrM,H in the usual way.

Suppose all attractor cycles are singletons, so that mk = 1. Moreover, suppose we do not know the initial-state probabilities and we set pk = 1/N. If xk is an attractor, let bk denote the number of states in its basin; if xk is not an attractor, let bk = 0. Then Eq. 15 reduces to ζj = bj/N. To this point, ζ = (ζ1, ζ2,…, ζN) describes a probability density because its components sum to 1. Now suppose we ignore the basin sizes so that ζj = 1/N if xj is an attractor and ζj = 0 otherwise. If ζ = (ζ1, ζ2,…, ζN) and ξ = (ξ1, ξ2,…, ξN) correspond to networks H and M, respectively, then the network distance induced by the 1-norm is given by

μattM,H=j1N|ζjξj|=1N|AHΔAM|
(16)

where AH and AM are the attractor sets for H and M, respectively,

AHΔAM=AHAMAMAH
(17)

is the symmetric difference of AH and AM, and denotes the number of elements in a set. The distance μatt(M, H) compares the attractor sets of the two networks. μatt(M,H) = 0 if and only the attractor sets are the same. We have derived μatt(M, H) from μlong(M, H) assuming only singleton attractors, but μatt(M, H) can be applied to any rule-based discrete network.

Trajectory-Based Distance

Continuing with rule-based finite-value networks, rather than simply focusing on the long-run probabilities, one can take a more refined perspective by considering differences in the trajectories. Continue to let mk denote the number of states in the cycle Ck for initial state xk and let tk be the time it takes xk to reach Ck. The time trajectory of the network is given by X(t) = (X1(t), X2(t),…, Xn(t)). For a given initial state this trajectory is deterministic. For initial state xk, denote the trajectory by

xkt=x1kt,x2kt,...,xnkt
(18)

Given the initial state is xk, we define the amplitude cumulative distribution of gene Xi by

Fizk=1mkt=tktk+mk1Ixiktz
(19)

This increasing function of z counts the fraction of time that xiktz in the cycle Ck.

Given two attractor cycles, Ck and Cj, resulting from initializations xk and xj, respectively, we define a distance between the cycles relative to gene Xi using the amplitude cumulative distributions, Fi|k and Fi|j , by

δiCk,Cj=FikFij
(20)

for some function norm . For example, we could use the L1 norm

FikFij1=0Fiz|kFiz|jdz
(21)

The L1 norm possesses an interesting interpretation if gene Xi has constant amplitude values, a and b, on cycles Ck and Cj, respectively. In this case, Fi|k and Fi|j are unit step functions with steps at a and b, respectively. Hence, in this case the L1 norm reduces to

δiCk,Cj=ab
(22)

and gives the distance, in amplitude, between the values of gene Xi on the two cycles. For a Boolean network, δiCk,Cj=0 if the gene is either ON or OFF on both cycles and δiCk,Cj=1 if Xi is ON for one cycle and OFF for the other (assuming Xi is constant on both cycles).

Considering the full set of genes, we define a distance between two attractor cycles, Ck and Cj by

δCk,Cj=1ni=1nδiCk,Cj
(23)

Now consider two networks, M and H, having the same genes. We define the distance between M and H as the expected distance between attractor cycles over all possible initial states:

μampM,H=EδCkM,CkH=k=1NpkδCkM,CkH
(24)

where CkM and CkH are the attractor cycles corresponding to initialization by state xk in networks M and H, respectively [15].

Equivalence Classes of Networks

The previous examples of network distance functions demonstrate a common scenario: a network semi-metric is defined by a metric on some network characteristic, for instance, its regulatory graph, its transition probability matrix, etc. The metric requirement, μM,H=0M=H fails because distinct networks possess the same characteristic. To formalize the situation, let λM and λH denote the characteristic λ corresponding to networks M and H, respectively. If ν is a metric on a space of characteristics (directed graphs, matrices, probability densities, etc.), then a semi-metric μν is induced on the network space according to

μvM,H=vλM,λH
(25)

This is quite natural if our main interest is with the characteristic, not the specific network itself.

Focus on network characteristics leads to the identification of networks possessing the same characteristic. Given any set, U, a relation ~ between elements of U is called an equivalence relation if it satisfies the following three properties for a,b,cU:

  1. aareflexivity,
  2. abbasymmetry,
  3. ab and bcactransitivity.

If a ~ b, then a and b are said to be equivalent. An equivalence relation on U induces a partition of U. The subsets forming the partition are defined according to a and b lie in the same subset if and only if a ~ b. The subsets are called equivalence classes. The equivalence class of elements equivalent to a is denoted by [a]~. According to the definitions, a=b if and only if a ~ b .

If ν is a semi-metric on a set U and we define a ~ b if and only if ν(a, b) = 0, then

μa,b=νa,b
(26)

defines a metric on the space of equivalence classes because μa,b=0νa,b=0aba=b.

If we define M ~ H if λM = λH, then this is a network equivalence relation. If we focus on equivalence classes of networks rather than the networks themselves, we are in effect identifying equivalent networks. For instance, if we are only interested in steady-state distributions, then it may be advantageous to identify networks possessing the same steady-state distribution.

4. INFERENCE PERFORMANCE

An inference procedure operates on data generated by a network H and constructs an inferred network M to serve as an estimate of H, or it constructs a characteristic to serve as an estimate of the corresponding characteristic of H. For instance, the data may be used to infer a distribution that estimates the steady-state distribution of H. The data could be dynamical, consisting of time-course observations, or it might be taken from the steady state, as with microarray measurements assumed to come from the steady state of some phenotypic class. In the latter case, it makes sense to consider inference accuracy relative to the steady-state distribution of H, rather than H itself. For full network inference, the inference procedure is a mathematical operation, a mapping from a space of samples to a space of networks, and it must be evaluated as such. There is a generated data set S and the inference procedure is of the form ψ(S) = M. If a characteristic is being estimated, then ψ(S) is a characteristic, for instance, ψS = F, a probability distribution.

Measuring Inference Performance Using Distance Functions

Focusing on full network inference, the goodness of an inference procedure ψ is measured relative to some distance, μ, specifically, µ μM,H=μψS,H , which is a function of the sample S. In fact, S is a realization of a random set process, Σ, governing data generation from H. In general, there is no assumption on the nature of Σ. It might be directly generated by H or it might result from directly generated data corrupted by noise of some sort. μψΣ,H is a random variable and the performance of ψ is characterized by the distribution of μψΣ,H , which depends on the distribution of Σ. The salient statistic regarding the distribution of μψΣ,H is its mean, EΣμψΣ,H , where the expectation is taken with respect to Σ.

Rather than considering a single network, we can consider a distribution, H, of random networks, where, by definition, the occurrences of realizations H of H are governed by a probability distribution. This is precisely the situation with regard to the classical study of random Boolean networks. Averaging over the class of random networks, our interest focuses on

μH,Σ,ψ=EHEΣμψΣ,H
(27)

It is natural to define the inference procedure ψ1 better than the inference procedure ψ2 relative to the distance μ, the random network H, and the sampling procedure Σ if

μH,Σ,ψ1<μH,Σ,ψ2
(28)

Whether an inference procedure is “good” is not only relative to the distance function, it is relative to how one views the value of the expected distance. Indeed, it is not really possible to determine an absolute notion of goodness.

In practice, the expectation is estimated by an average,

μˆH,Σ,ψ=1mj=1mμψSj,Hj
(29)

where S1, S2,..., Sm are sample point sets generated according to Σ from networks H1, H2,…, Hm randomly chosen from H.

The preceding analysis applies virtually unchanged when a characteristic is being estimated. One need only replace H and H by λ and Λ, where λ and Λ are a characteristic and a random characteristic, respectively, and replace the network distance μ by the characteristic distance.

We next present three examples using previously introduced distance functions to measure inference performance. Algorithm description will be sketchy in order to avoid long digressions from the issue of distance illustration. We defer to the cited literature for details.

Example 1.

The Boolean network model has been in existence for a long time and various inference procedures have been proposed [16-18]. One proposed method for Boolean networks with perturbation is based on the observation of a single dynamic realization of the network [6]. This method will be discussed in some detail in Section 5 in regard to consistent inference; for now, we are only concerned with the distance between the inferred network and the original network generating the data, where the distance function is given by µfun(M, H) in Eq. 7. Fig. (11) shows the average (in percentage) of the distance function using 80 data sequences generated from 16 randomly generated Boolean networks with 7 genes, perturbation probability p = 0.01, uniform connectivity k = 2 or k = 3, and data sequence lengths varying from 500 to 40,000. The reduction in performance from connectivity 2 to connectivity 3 is not surprising because the number of truth-table lines jumps dramatically.

Fig. (1)
Rule-based distance performance for Boolean-network inference for connectivity k = 2, 3.

Example 2.

There have been a number of papers addressing the inference of connectivity graphs using information-theoretic approaches [9, 10, 19]. In a study proposing using the minimum description length (MDL) principle to infer regulatory graphs [8], the hamming distance was used to compare the performance of the newly proposed algorithm with an earlier information-theoretic algorithm, called REVEAL [9]. Fig. (22) compares the hamming distances between the inferred networks and the corresponding synthetic networks that generated the data relative to increasing sample size. It does so for the REVEAL algorithm and the MDL algorithm using three different settings for a user-defined parameter. The performance measures are obtained by averaging over 30 randomly generated networks, each containing 20 genes and 30 edges, with the distance function being normalized over 30, the number of edges in the synthetic networks.

Fig. (2)
Hamming distance performance for inferring regulatory graphs using information theory.

Example 3.

A probabilistic Boolean network (PBN) is a network defined as a collection of discrete-valued networks such that at any point in time one of the constituent networks is controlling the network dynamics [20]. In a context-sensitive PBN there is a binary random variable determining whether there should be a switch of constituent networks at that time point, the modeling assumption being that there are latent variables outside the model network whose changes induce stochasticity into the PBN [21]. Typically, there is also a probability of permutation. This example considers a Bayesian connectivity-based inference procedure for designing PBNs from steady-state data [22]. A synthetic PBN, H, composed of two constituent Boolean networks is used to generate a random sample of size 60 from its steady-state distribution and the inference procedure is used to construct a designed PBN composed of ten constituent PBNs (note that the inference procedure does not have input relating to the number of constituent BNs of the generating network). According to definition, the attractors of a PBN are the attractors of its constituent BNs. H has six singleton attractors, two of which, call them xa and xb, contain 0.99 of the steady-state mass. The designed PBN has more attractors, which is not uncommon, but xa and xb appear in all ten constituent networks as singleton attractors and they contain 0.78 of the steady-state mass. Since, for PBNs with low probability of network switching almost all of the steady-state mass lies in its attractors [21], μstead1ψS,H0.21 (or approximately so), the maximum 1-norm being 2.

5. CONSISTENCY

The greater the amount of data, the better inference one can expect. The hope is that, for large data sets, the inferred network will be close to the generating network. We define an inference procedure, ψ, to be consistent if μH,Σ,ψ0 as Σ . We illustrate consistency using Boolean networks with perturbation. We use the inference procedure referred to in Example 1 that applies to a single observed time series [6] and the distance function µfun of Eq. 7.

Owing to perturbation, the network has a steady-state distribution and all states communicate with each other. Hence, given a long time series we are likely to observe most of the states and their corresponding state-to-state transitions xkx(k), for k = 1, 2,…, N, where x(k) denotes the next state following xk under the network state function. If we ignore perturbation, then using the observed state-to-state transitions we can construct a table of state-to-gene transitions of the form xkxi, for k = 1, 2,…, N and i = 1, 2,…, n. These define the functions f1, f2,…, fn accordingly. Because the truth table for function fi has 2n rows of the form fi(xk), some rows may be empty owing to insufficient observations and these rows can be filled in randomly. As the length of the time series increases, the probability of not observing the state xk goes to 0. Indeed, for any positive integer c, if we let η(xk) denote the number of times xk is observed in the time series, then Pηxkc1 as Σ , where the probability is with respect to the time series Σ.

With perturbation, the state-to-state transitions do not directly define functions because state xk may transition to more than one state. However, assuming a perturbation probability less than 0.5, the transitions from xk will be dominated by the single transition determined by the state function f and this dominating choice can be used for inference. Letting ηj(xk) denote the number times we observe the transition xkxj if fxk=xk is the function-defined transition, then

Pηkxk>maxη1xk,η2xk,...,ηNxk1 as Σ
(30)

Thus, if fˆ denotes the inferred state function, then Pfˆ=f1 as Σ.Similar asymptotic statements hold for f1, f2,…, fn. This insures that, for any τ>0,PμfunψΣ,H<τ1 as Σ for any Boolean network H. Since µfun(ψ(Σ), H) ≤ 1, this is equivalent to EΣμfunψΣ,H0 as Σ Finally, if H is the class of all Boolean networks on n genes with perturbation probability p, then, since H is a finite set,

μfunH,Σ,ψ=EHEΣμfunψΣ,H0 as Σ
(31)

and the inference procedure is consistent relative to µfun.

The preceding argument assumes that the perturbation probability is known. A modification of the inference procedure yields an estimator for p [6]; however, if p is also being estimated, then the model space H is no longer finite and the consistency proof has to be modified. We do not believe this is the proper place to go into such mathematical issues.

6. APPROXIMATION

Inference performance is evaluated based on the ability of an inference procedure to identify the network from which the data have been derived. This can only be done exactly if the data-generating network is known. Suppose we do not know the random network, H, generating the data for which we want to evaluate the inference procedure, ψ, but know a network N that we believe to be a good approximation to the networks in H. We might then compare the inferred network to N. In effect, such a comparison is approximating μH,Σ,ψ by EΣμψΣ,N .

The key issue is approximation accuracy. The triangle inequality implies

μψS,NμN,HμψS,HμψS,N+μN,H
(32)

for any sample set S and HH . Hence,

EΣμψΣ,NEHμN,H EHEΣμψΣ,H EΣμψΣ,N+EHμN,H
(33)

If EH[µ(N, H)] ≈ 0, meaning that EH[µ(N, H)] is small, then the preceding inequality leads to the approximate inequality

EΣμψΣ,NμH,Σ,ψEΣμψΣ,N
(34)

Thus, if EH[µ(N, H)] ≈ 0, then

μH,Σ,ψEΣμψΣ,N
(35)

and it is reasonable to judge the performance of ψ relative to H by EΣμψΣ,N . On the other hand, if EH[µ(N, H)] is not small, then both bounds in Eq. 33 are loose and nothing can be asserted regarding the performance of ψ relative to the data sets on which it is being applied. Therefore, unless EH[µ(N, H)] is small, the entire validation procedure is flawed because the approximation of H by N is confounding the procedure. In addition, if EHμN,H0, one still has to estimate EΣμψΣ,N, which generally means that the number of sample sets is sufficiently large that the expectation is well-estimated by the average distance.

The preceding approximation methodology is common in the literature. A proposed inference procedure is applied to one or more real data sets. The inferred network is compared, not to the unknown random network generating the data, but to a model network that has been human-constructed from the literature (and implicitly assumed to approximate the data-generating network). For instance, a directed graph (adjacency matrix), A, is constructed from relations found in the literature and the hamming distance is used in the approximating expectation, EΣμhamψΣ,A, in Eq. 35. The aim is to compare the result of the inference procedure to some characteristic related to existing biological knowledge. The problem is that the constructed regulatory graph may not be a good approximation to the regulatory graph for the system generating the data. This can happen because the literature is incomplete, there are insufficiently validated connections reported in the literature, or the conditions under which connections have been discovered, or not discovered, in certain papers are not compatible with the conditions under which the current data have been derived. As a result of any of these situations, the overall validation procedure is confounded by the precision (or lack thereof) of the approximation.

7. VALIDATION FROM EXPERIMENTAL DATA

Another form of approximation results from using experimental data for validation rather than synthetic data generated from a known, ground-truth model. In this situation, there is a test-data sampling procedure generating data from which an estimate of the desired characteristic corresponding to the underlying physical network is formed. Validation is then via the random variable μψΣ,ξΩ, where Σ is the training-data sampling procedure used to design the network and Ω is a real-data test-sampling procedure to validate the designed network by direct construction of the characteristic via independent sampling. To simplify the notation we consider a single underlying network H rather than a random network H. In this situation, EHμN,H in Eq. 33 is replaced by μξΩ,λH , where λH is the characteristic for H, and Eq. 33 takes the form

EΩEΣμψΣ,ξΩEΩμξΩ,λHEΣμψΣ,λHEΩEΣμψΣ,ξΩ+EΩμξΩ,λH
(36)

If EΩ[µ(ξ(Ω), λH)] ≈ 0, then

μH,Σ,ψEΩEΣμψΣ,ξΩ
(37)

If ξ is a consistent estimator of λH, so that EΩμξΩ,λH0 for large test samples, then, on average, the approximation is good.

Consider what happens if one only has data to estimate (train) the model, which may happen when data are limited on account of cost or the availability of samples. In this case, one tests on the same data, thereby having Ω = Σ in Eq. 36 and the resubstitution estimate, EΣμψΣ,ξΣ , in Eq. 37. If ξ is a consistent estimator of λH and the single training sample is large, then the conclusion of Eq. 37 again holds. But we do not have a large sample. Hence, Eq. 36 cannot be used to insure good average performance. But it also cannot be used to insure good performance when there is a small independent test-data sample. In the independent case, we are concerned with the absolute difference

Δtest=EΣμψΣ,λHEΩEΣ[μψΣ,ξΩ]
(38)

When the same data are used for training and testing, our interest is with

Δtrain=EΣμψΣ,λHEΣμψΣ,ξΣ
(39)

As with classification, where resubstitution error estimation is usually biased low owing to overfitting by the classification rule, in the case of network validation, resubstitution is risky because the characteristic of the designed network is being compared to a characteristic inferred from the same data with which the network has been designed. According to Eq. 36, as in the case of classification, this is not a problem for large samples, but it can be a serious problem for small samples because overfitting can cause Δtrain to be much less than Δtest. Whereas substantial effort has gone into studying these kinds of problems in pattern recognition, there appears to be an absence of the analogous study for network validation.

Example 4.

An attractor-preserving inference method for PBNs based on steady-state data has been proposed and applied to PBNs [23]. A PBN has been designed from cDNA microarray data using 7 genes: WNT5A, pirin, S100P, RET1, MART1, HADHB, and STC2. The steady-state distribution of the designed network has been compared to the histogram of the data, the histogram serving as an estimate of the steady-state distribution of the underlying physical network. Fig. (33) illustrates the comparison of the portion of the steady-state distribution corresponding to the data states with the data histogram. Referring to Eq. 14, the 1-norm and 2-norm yield the resubstitution error distances μstead1ψS,ξS=0.45 (out of a maximum of 2) and μstead2ψS,ξS=0.1262, respectively, the latter being the root-mean-square error.

Fig. (3)
Comparison of steady state distribution for a designed network and data histogram.

8. CONCLUSION

This paper has proposed a mathematically rigorous framework for the validation of inference procedures for gene regulatory networks and has illustrated this framework employing validation methods used in the literature. Owing to the central role of regulatory networks in systems biology and the need to apply inference procedures to the massive data sets resulting from high-throughput technologies, validation cannot be left to ad hoc methods whose own performances are not understood. A formal framework is necessary. As should be clear from the paper, a great deal of work needs to be done to establish the properties of inference procedures under various conditions, such as the sampling procedure, model class, and validation criterion (distance function). Absent rigorous results in this regard, proposed inference procedures will remain speculative and the quality of their performances unknown. A sound epistemology will be lacking.

ACKNOWLEDGEMENTS

I would like to acknowledge the National Science Foundation (CCF-0514644) and the National Cancer Institute (2R25CA090301-06) for supporting this work. I also wish to thank Wentao Zhao for assisting with the literature search and Barak Faryabi for proof reading the manuscript.

REFERENCES

1. de Jong H. Modeling and simulation of genetic regulatory systems: a literature review. Computat. Biol. 2002;9(1):67–103. [PubMed]
2. Shmulevich I, Dougherty E. R. Genomic Signal Processing. Princeton: Princeton University Press; 2007.
3. Datta A, Pal R, Dougherty ER. Intervention in probabilistic gene regulatory networks. Curr. Bioinformat. 2006;1(2):167–184.
4. Kauffman SA. Metabolic stability and epigenesis in randomly constructed genetic nets. Theoret. Biol. 1969;22:437–467. [PubMed]
5. Kauffman SA. The Origins of Order. New York: Oxford University Press; 1993.
6. Marshall S, Yu L, Xiao Y, Dougherty ER. Inference of probabilistic Boolean networks from a single observed temporal sequence. EURASIP J. Bioinformat. Sys. Biol. 2007:15 pages. Article ID 32454. [PMC free article] [PubMed]
7. Bernard A, Hartemink A. Informative structure priors: joint learning of dynamic regulatory networks from multiple types of data. Pacific Sympo. Biocomput. 2005;10:459–470. [PubMed]
8. Zhao W, Serpedin E, Dougherty ER. Inferring gene regulatory networks from time series data using the minimum description length principle. Bioinformatics. 2006;22(17):2129–2135. [PubMed]
9. Liang S., Fuhrman S., Somogyi R. REVEAL a general reverse engineering algorithm for inference of genetic network architectures. Pacific Sympo. Biocomput. 1998;3:18–29. [PubMed]
10. Margolin A, Nemenman I, Basso K, Klein U, Wiggins C, Stolovitzky G, Favera RD, Califano A. ARACNE: an algorithm for reconstruction of genetic networks in a mammalian cellular context. BMC Bioinformat. 2006;7(Suppl. 1):S7. [PMC free article] [PubMed]
11. Chen X, Anantha G, Wang X. An effective structure learning method for constructing gene networks. Bioinformatics. 2006;22(11):1367–1374. [PubMed]
12. Zou M, Conzen SD. A new dynamic Bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data. Bioinformatics. 2005;21(1):71–79. [PubMed]
13. Bansal M, Belcastro V, Ambesi-Implombato A, di Bernardo D. How to infer gene networks from expression profiles. Mol. Syst. Biol. 2006;3:78. [PMC free article] [PubMed]
14. Kim S, Li H, Dougherty ER, Chao N, Chen Y, Bittner ML, Suh EB. Can Markov chain models mimic biological regulation. Biol. Syst. 2002;10(4):447–458.
15. Brun M, Kim S, Choi W, Dougherty ER. Comparison of network models via steady state trajectories. EURASIP J. Bioinformat. Syst. Biol. 2007:11 pages. Article ID 82702. [PMC free article] [PubMed]
16. Akutsu T, Miyano S, Kuhara S. Identification of genetic networks from a small number of gene expression patterns under the Boolean model. Pacific Sympo. Biocomput. 1999;4:17–28. [PubMed]
17. Lahdesmaki H, Shmulevich I, Yli-Harja O. On learning gene regulatory networks under the Boolean network model. Machine Learn. 2003;52:147–167.
18. Shmulevich I, Saarinen A, Yli-Harja O, Astola J. Inference of genetic regulatory networks via best-fit extensions. In: W Zhang, I Shmulevich., editors. Computat. Statis. Approaches Genom. Boston: Kluwer Academic Publishers; 2002. pp. 197–210.
19. Nemenman I. Information theory, multivariate dependence, and genetic network inference. Technical Report NSF-KITP-4-54, KITP, UCSB. 2004.
20. Shmulevich I, Dougherty ER, Kim S, Zhang W. Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics. 2002;18:261–274. [PubMed]
21. Brun M, Dougherty ER, Shmulevich I. steady state probabilities for attractors in probabilistic Boolean networks. EURASIP J. Signal Process. 2005;85(10):1993–2013.
22. Zhou X, Wang X, Pal R, Ivanov I, Bittner ML, Dougherty ER. A Bayesian connectivity-based approach to constructing probabilistic gene regulatory networks. Bioinformatics. 2004;20(17):2918–2927. [PubMed]
23. Pal R, Ivanov I, Datta A, Dougherty ER. Generating Boolean networks with a prescribed attractor structure. Bioinformatics. 2005;21(21):4021–4025. [PubMed]

Articles from Current Genomics are provided here courtesy of Bentham Science Publishers

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...