- Journal List
- Bioinformatics
- PMC2677737

# Gaussian process regression bootstrapping: exploring the effects of uncertainty in time course data

## Abstract

**Motivation:** Although widely accepted that high-throughput biological data are typically highly noisy, the effects that this uncertainty has upon the conclusions we draw from these data are often overlooked. However, in order to assign any degree of confidence to our conclusions, we must quantify these effects. *Bootstrap* resampling is one method by which this may be achieved. Here, we present a parametric bootstrapping approach for time-course data, in which Gaussian process regression (GPR) is used to fit a probabilistic model from which replicates may then be drawn. This approach implicitly allows the time dependence of the data to be taken into account, and is applicable to a wide range of problems.

**Results:** We apply GPR bootstrapping to two datasets from the literature. In the first example, we show how the approach may be used to investigate the effects of data uncertainty upon the estimation of parameters in an ordinary differential equations (ODE) model of a cell signalling pathway. Although we find that the parameter estimates inferred from the original dataset are relatively robust to data uncertainty, we also identify a distinct second set of estimates. In the second example, we use our method to show that the topology of networks constructed from time-course gene expression data appears to be sensitive to data uncertainty, although there may be individual edges in the network that are robust in light of present data.

**Availability:** Matlab code for performing GPR bootstrapping is available from our web site: http://www3.imperial.ac.uk/theoreticalsystemsbiology/data-software/

**Contact:** ku.ca.lairepmi@krik.luap, ku.ca.lairepmi@fpmuts.m

**Supplementary information:**Supplementary data are available at *Bioinformatics* online.

## 1 INTRODUCTION

The use of data obtained from high-throughput technologies such as microarrays has become standard in systems biology. There are many ways in which these data are exploited, such as reverse engineering putative pathways and networks directly from the data e.g.(Lèbre, 2007; Opgen-Rhein and Strimmer, 2007), or inferring the values of unknown parameters in mechanistic models (e.g. Barenco *et al.*, 2006; Swameye *et al.*, 2003). The methods used to obtain the data are often subject to significant levels of measurement noise, and so we might expect repetitions of the experiments to yield quantitatively different datasets. However, the costs associated with high-throughput experiments usually mean that the number of technical replicates is restricted, and so it is difficult to quantify the effects of data uncertainty upon the inferences we draw. Clearly, if our aim is to attach biological meaning to our results (for example, by proposing putative pathways), then we need to have some degree of confidence that any conclusions we make are robust to the uncertainty in the data. That is, we need to be sure that what we infer (whether it be the rate constants of a biochemical reaction, the topology of a gene regulatory network or any other unknown quantity or reverse engineered model) is not specific to the particular noisy dataset that we happened to observe.

Bootstrapping is a well-known resampling method that may be used to assess properties (such as the standard error) of an inferred quantity or statistical estimator (Efron, 1979; Efron and Tibshirani, 1993). The process that generated the data is estimated by an approximating distribution from which samples may be drawn. Bootstrap datasets are then obtained from this distribution, and the statistical estimator is calculated for each. This induces a sampling distribution over the estimator, from which we may assess, for example, its variance amongst all of the bootstrap datasets. Previous biological applications of bootstrapping include, to name a few examples, placing confidence intervals on phylogenies (Felsenstein, 1985), assessing the reliability of conclusions drawn from clustering expression data (Kerr and Churchill, 2001), and constructing ‘robust’ estimates of gene networks (Imoto *et al.*, 2005).

We here consider a parametric bootstrap for time-course data in which the time-dependent process that generated the data is modelled using Gaussian process regression (GPR). In recent years, this Bayesian non-linear regression technique has grown in popularity, and has been applied in several systems biology contexts (Gao *et al.*, 2008; Lawrence *et al.*, 2007; Yuan, 2006). To our knowledge, GPR has not previously been used as a method for bootstrapping time-course data. However, it would seem to be ideally suited to this task, since it provides a method for fitting a plausible probabilistic model that captures the time dependence of the data, and from which it is easy to draw bootstrap samples.

We demonstrate GPR bootstrapping using two examples from the systems biology literature: estimating the parameters of an ordinary differential equation model for the STAT5 signalling pathway (Swameye *et al.*, 2003); and inference of gene regulatory networks in *Arabidopsis thaliana* (Smith *et al.*, 2004).

Below, we first provide an overview of GPR and how it may be used in general as a bootstrapping method (Section 2), and then we describe how the approach may be applied (Section 3). In Section 4, we summarize our findings for two examples, and discuss the implications in Section 5. We conclude by highlighting the importance of bootstrapping in general as a method for assessing the effects of data uncertainty.

## 2 APPROACH

GPR is a Bayesian non-linear regression method, which has been used to good effect in a number of studies (Gao *et al.*, 2008 Lawrence *et al.*, 2007; Yuan, 2006). Formally, a Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian (normal) distribution (Rasmussen and Williams, 2006). A GP is defined by a mean function and a covariance function, which specify the mean vectors and covariance matrices for each finite collection of the random variables. GP theory is discussed in more detail by MacKay (1998) and Rasmussen and Williams (2006), but for completeness and convenience we present an overview of standard GPR theory in addition to our own contribution of how it may be used to perform a parametric bootstrap.

### 2.1 Regression

In a regression problem, we are interested in elucidating the relationship between a collection of covariates or inputs *x*_{1}, …, *x*_{p}, and a continuous dependent output variable, *z*. We assume that *x*_{1}, …, *x*_{p}, *z* are all real-valued, and we write the collection of covariates as a *p*-component column vector, **x**=[*x*_{1}, …, *x*_{p}]^{⊤} ∈ℝ^{p}. It is assumed that there is an unknown deterministic function, *f*, which wholly describes the relationship between *z* and **x**, so that *z*=*f*(**x**). Our aim is therefore to find the function, *f*.

In practice, the methods by which measurements of *z* are obtained introduce experimental noise. We hence define a random variable, *y*, to represent the experimentally observable version of *z*. We assume that *y* may be written as *y*=*z*+ɛ, where ɛ is a noise term. For convenience, we also assume that ɛ ∼ 𝒩(0, σ^{2}) and that ɛ is independent of **x**. For the time being, we consider the case in which the variance, σ^{2}, is known, but shall return later to the problem of how it may be estimated.

To summarize, we have,

One way to approach the regression problem is to impose a fixed parametric form on *f* [such as *f*(**x**)=∑_{i=1}^{M}β_{i}φ_{i}(**x**), where the β_{i} are parameters, the φ_{i} are a set of basis functions and *M* ∈ ℕ], and then to estimate its parameters from a set of experimentally obtained observations using methods such as ordinary least squares. An alternative is to recognize that the function, *f*, is unknown, and hence is itself a source of uncertainty; GPR provides us with a means by which to do this.

GPR belongs to a class of approaches known as *non-parametric Bayesian methods*. Such methods can be viewed broadly as providing probability models on function spaces (Müller and Quintana, 2004). Apart from GPs, the other well-known non-parametric Bayesian methods are those based on Dirichlet processes (DPs). These were introduced by Ferguson (1973) and Antoniak (1974), and provide a framework for the probabilistic modelling of unknown probability distributions. That is, rather than assuming that a given sample has been drawn from a probability distribution of known parametric form (but with unknown parameters), DP-based approaches model the uncertainty in the probability distribution itself. In contrast, GP approaches provide a framework for the probabilistic modelling of unknown functions rather than unknown distributions.

### 2.2 GP priors

In GPR, we assume a GP prior for *f*(**x**) with mean function *m* and covariance function *k*. This means the following:

- For any input,
**x**_{*}∈ℝ^{p}, we regard the value taken by*f*(**x**) at**x**=**x**_{*}to be a random variable. The notation*f*(**x**_{*}) should now be understood to denote this random variable. - Given a finite collection of covariate vectors,
**x**_{1}, …,**x**_{n}, the random variables*f*(**x**_{1}), …,*f*(**x**_{n}) are assumed to be jointly distributed according to a multivariate Gaussian with mean**m**= [*m*(**x**_{1}), …,*m*(**x**_{n})]^{⊤}and covariance matrix (K)_{ij}=*k*(**x**_{i},**x**_{j}).

We thus write *f*(**x**) ∼ 𝒢𝒫(*m*, *k*).

Note that if we assume the regression model of Equation (1), the GP prior over *f*(**x**) induces a GP prior over the observable outputs *y*(**x**). That is, assuming Equation (1) and that *f*(**x**) ∼ 𝒢𝒫(*m*, *k*), it follows that,

where *l*(**x**_{i}, **x**_{j})=*k*(**x**_{i}, **x**_{j})+σ^{2} δ(**x**_{i}, **x**_{j}). Here, δ(**x**_{i}, **x**_{j}) is the standard Kronecker delta function.

### 2.3 From prior to posterior

We now suppose that, having assumed a GP prior *f*(**x**) ∼ 𝒢𝒫(*m*, *k*), we proceed to obtain a set of output measurements *y*_{1}, …, *y*_{r} at the covariate vectors **x**_{1}, …, **x**_{r}. We are interested in determining how we may update our GP prior in light of these observed data. We show below that, given any finite collection **x**_{1}^{*}, …, **x**_{s}^{*} of covariate vectors, the joint conditional probability of the function values *f*(**x**_{1}^{*}), …, *f*(**x**_{s}^{*}) given the observations is again described by a multivariate normal. We hence obtain a GP posterior over *f*(**x**).

We view *y*_{1}, …, *y*_{r} as realizations of the random variable *y*(**x**) at the inputs **x**. We know from Equation (2) that,

where **m**_{o} = [*m*(**x**_{1}), …, *m*(**x**_{r})]^{⊤} and (*K*_{o})_{ij}=*k*(**x**_{i}, **x**_{j})+σ^{2}δ(**x**_{i}, **x**_{j}). For notational brevity, we henceforth write [**y**(**x**)]^{⊤} to mean [*y*(**x**_{1}), …, *y*(**x**_{r})]^{⊤}.

Let **x**_{1}^{*}, …, **x**_{s}^{*} be another finite collection of covariate vectors. From our assumption of a GP prior over *f*, together with Equation (3), it is straightforward to see that,

where, **m**_{*}=[*m*(**x**_{1}^{*}), …, *m*(**x**_{s}^{*})]^{⊤}, (**K**_{**})_{ij}=*k*(**x**_{i}^{*}, **x**_{j}^{*}) and (**K**_{*})_{ij}=*k*(**x**_{i},**x**_{j}^{*}).

From Equations (3) and (4), and using standard properties of Gaussian distributions (von Mises, 1964), it follows that the function values *f*(**x**_{1}^{*}), …, *f*(**x**_{s}^{*}) conditioned on the observed outputs **y** are also jointly distributed according to a multivariate normal. Specifically,

where **y**=[*y*_{1}, …, *y*_{r}]^{⊤}, and,

Here, *I*_{r} is the *r* × *r* identity matrix.

Since Equation (5) is true for any *s* ∈ ℕ, it follows that the function outputs, *f*(**x**), conditioned on the observations, **y**, define a GP, which is referred to as the GP posterior.

### 2.4 Using the GP posterior

Equation (5) provides the joint posterior distribution of the function values *f*(**x**_{1}^{*}), …, *f*(**x**_{s}^{*}), given the GP prior and the observations, **y**. Since the mean of a Gaussian distribution is also its mode, the maximum *a posteriori* prediction of [*f*(**x**_{1}^{*}), …, *f*(**x**_{s}^{*})]^{⊤} is simply the mean vector **m**_{post}. Thus, GPR allows the prediction of *f*(**x**) at any finite collection of covariate vectors, **x**_{1}^{*}, …, **x**_{s}^{*}. The covariance matrix, *K*_{post}, describes the variability of the distribution about the mean, and hence may be used to place confidence intervals around this prediction. Figure 1A illustrates the use of GPR to make predictions and specify confidence intervals.

In this article, we are concerned not only with fitting a regressor to the dataset, but also with sampling from the regression model in order to obtain bootstrap datasets. This is similar to the work of Kerr and Churchill (2001), who also generate bootstrap samples by first fitting a model to a set of time-course data (in their case, an ANOVA model). The advantages of GPR are its non-linearity, that it implicitly allows us to model the uncertainty in the underlying function, *f*, and that it is relatively easy to apply. Generating samples from our GP regressor is also fairly simple. We know that the joint posterior distribution of any finite collection, *f*(**x**_{1}^{*}), …, *f*(**x**_{s}^{*}), is a multivariate normal [as given in Equation (5)], and hence we may simulate samples using standard methods for such distributions (Press *et al.*, 2007).

Since we are concerned with the generation of plausible datasets, rather than just plausible samples of the underlying function values, it follows that we are actually interested in *y*(**x**) rather than *f*(**x**). However, if we can sample function outputs, *f*(**x**), and if we know (or can estimate) the variance σ^{2}, then we can use Equation (1) in order to obtain samples of *y*(**x**). Thus, in practice, we proceed by first sampling [*f*(**x**_{1}^{*}), …, *f*(**x**_{s}^{*})]^{⊤} from the multivariate normal described by Equations (5), (6) and (7), and then adding Gaussian noise sampled from 𝒩(0, σ^{2}**I**_{s}).

In this study, we generate bootstrap samples at the same points as those at which the data were observed (i.e. we choose *s* = *r* and set **x**_{1}^{*} = **x**_{1}, …, **x**_{s}^{*} = **x**_{r}). Figure 1B provides an example of a number of bootstrap samples obtained using a GP regressor fitted to a gene expression time course.

### 2.5 The mean and covariance functions

In order to specify a GP prior, it is clearly necessary to provide a mean function, *m*, and a covariance function, *k*. The covariance function is the more important of these, as it describes how we believe the value of the function outputs, *f*(**x**), covary with one another, and hence allows us to express our beliefs about fundamental properties of *f*, such as how rapidly it changes. For the sake of simplicity and parsimony, the mean function is often chosen to be zero, and this is the approach we adopt here. This does not present a serious limitation: as we can see from the regressor in Figure 1A, the mean of the posterior process (represented by the red line) is certainly not constrained to be zero, and we are able to obtain a good fit to the data. Of course, other mean functions may be chosen to express stronger prior beliefs about the underlying function. There are many possible choices for the covariance function, *k*, and we here consider two of the more popular options, the *squared exponential* covariance function,

where | ·| denotes the Euclidean distance; and a standard *Matérn* covariance function,

Here, the constants σ_{g}, σ_{f}, *l*_{1} and *l*_{2} are *hyperparameters*. Although its smoothness properties have been criticized as unrealistic (Stein, 1999), the squared exponential covariance function remains the most frequently used ‘default’ choice for GPR, largely because of its simplicity. There are many examples of covariance function (Rasmussen and Williams, 2006, ch. 4), which allow the GP prior to be tailored to specific scenarios. In this article, we employ *k*_{SE} and *k*_{M}, as they are simple, yet sufficiently flexible to allow a good fit to the data.

The hyperparameters of the covariance function provide us with another means to encode our prior beliefs about the nature of *f*. We can see, for example, that if *l*_{1} (or *l*_{2}) is very large, then *f*(**x**_{i}) and *f*(**x**_{j}) will only tend to vary together if |**x**_{i} − **x**_{j}| is small: the value of the function at **x**_{i} will only affect the value at **x**_{j} if **x**_{i} and **x**_{j} are close together. Ideally, we would either use prior knowledge to specify the hyperparameters, or adopt a fully Bayesian approach and integrate them out. However, we are rarely able to express our prior beliefs so precisely, and while a full Bayesian approach is possible (using, for example, Markov chain Monte Carlo (MCMC)), the associated computational expense is often undesirable. This is certainly the case here: in the example presented in Section 3.2 (which we expect may represent a typical application), we are required to fit regressors to 800 gene expression time-course datasets, so we wish to minimize the costs of fitting the GPR model. An alternative and computationally cheaper method is to estimate the hyperparameters in order to maximize the (log) likelihood of the observed data. We also use this approach to estimate the variance of the noise term, σ^{2}, in Equation (1). From Equation (3) and the definition of a multivariate normal, the likelihood of **y** is given by,

where θ is the vector of the covariance function's hyperparameters, det(·) denotes the determinant, and we write *K*_{o}(θ, σ^{2}) to make explicit the dependence of *K*_{o} on the hyperparameters. Taking logs and removing constant terms, we deduce that the maximum likelihood values for θ and σ^{2} are given by,

This optimization can be approached using standard methods for optimization (Press *et al.*, 2007), such as the Nelder–Mead simplex method or gradient descent.

## 3 APPLICATIONS

In order to demonstrate the potential applications of the GPR bootstrap, we consider two examples from the literature: estimating the parameters of a model of the STAT 5 signalling pathway, and inferring a gene network.

### 3.1 Parametric ODE modelling of signalling pathways

The JAK-STAT pathway is a well-studied signalling pathway that describes a mechanism by which signals carried by cytokines may be transduced to the cell nucleus via STAT activation, dimerization and relocation to the nucleus (Aaronson and Horvath, 2002; Horvath, 2000). Swameye *et al.* (2003) suggested a number of parametric ODE models to describe the JAK2-STAT5 signalling pathway, the parameters of which were estimated from experimental data. We consider one of the proposed models [taken from Swameye *et al.*, 2003; Supplementary Material), and—using data from the original experiments—apply our GPR bootstrapping approach in order to assign confidence intervals to the parameter estimates.

#### 3.1.1 The Model

The model we consider is as follows,

Here, *v*_{1}, *v*_{2} and *v*_{3} represent the concentrations of (respectively) unphos-phorylated STAT5, phosphorylated monomeric STAT5 and phosphorylated dimeric STAT5 in the cytoplasm. The variable *v*_{4} denotes the concentration of STAT5 in the nucleus, and *D* is an experimentally determined quantity (which varies over time) related to the amount of Epo-induced phosphorylation of the EpoR (Swameye *et al.*, 2003). The *r*_{i}'s are parameters (see Swameye *et al.*, 2003; Supplementary Material). The initial values of *v*_{2}, *v*_{3} and *v*_{4} at time *t*=0 are assumed to be zero (since it is supposed that all STAT5 in the cell is initially cytoplasmic and unphosphorylated), while the initial concentration of unphosphorylated cytoplasmic STAT5, *v*_{1}(*t*=0), is treated as an unknown parameter.

The quantities *v*_{1}, *v*_{2}, *v*_{3} and *v*_{4} were not measured individually. Instead, the amount of phosphorylated STAT5 in the cytoplasm, *y*_{1}, and the total amount of cytoplasmic STAT5 (phosphorylated and unphosphorylated), *y*_{2}, were recorded. These can be written in terms of the *v*_{i}'s as follows,

where *r*_{5} and *r*_{6} are two unknown scaling parameters, which must also be estimated. In total, there are thus six unknown parameters in this model [*r*_{1}, *r*_{3}, *r*_{4}, *r*_{5}, *r*_{6} and *v*_{1}(0)].

#### 3.1.2 GPR bootstrapping and parameter estimation

Swameye *et al.* (2003) measured *y*_{1} and *y*_{2} at a number of discrete time points in order to obtain several sets of experimental data. We focus on just one of these datasets (the ‘DATA1_hall’ set, available from the original authors at http://webber.physik.uni-freiburg.de/~jeti/), which we use together with our GPR bootstrapping approach in order to obtain 1500 bootstrapped datasets. To define our GP prior, we choose a zero mean function and the squared exponential function of Equation (8).

In order to learn the hyperparameters and fit the GP regressor to the dataset, we make use of the `gpml` suite of Matlab functions accompanying Rasmussen and Williams (2006), available from http://www.gaussianprocess.org/gpml/

For each of our bootstrapped datasets, we estimate the unknown parameters of the ODE system presented in Equation (10) using the stochastic ranking evolutionary strategy (SRES) of Runarsson and Yao (2000), as implemented in the libSRES C library (Ji and Xu, 2006). This allows us to find the parameter values which minimize the sum of squared differences between the data and the predictions made by the ODE model. This optimization problem is susceptible to the usual difficulty of becoming stuck in a local minimum. The evolutionary nature of SRES goes some of the way toward mitigating this difficulty, but to reduce the impact of becoming stuck in a local minimum yet further, we also run the algorithm for a large number of iterations and rerun eight times for each dataset (taking as our final estimate the ‘best’ amongst these eight runs). Before considering the bootstrapped data, we use SRES to estimate parameter values from the original dataset. The values so obtained are: *v*_{1}(0)=0.996, *r*_{1}=2.43, *r*_{3}=0.256, *r*_{4}=0.303, *r*_{5}=1.27 and *r*_{6}=0.944. For this ‘optimal’ set of parameters the model provides a reasonable fit to the observed data which is comparable with the fit obtained in the original paper (Swameye *et al.*, 2003). The aim of our bootstrapping approach is to determine whether or not these parameter estimates are robust to the uncertainty in the data.

### 3.2 Gene network inference

When considering how our GPR bootstrapping approach may be applied in order to investigate the effects of data uncertainty on the reverse engineering of gene regulatory networks we consider only relevance networks (Butte *et al.*, 2000) and graphical Gaussian models (GGMs). However, our method could just as easily be applied in order to assess the effects of data uncertainty on network inference methods (such as Lèbre, 2007) more generally. We consider temporal expression data for the 800 *A.thaliana* genes from (Smith *et al.*, 2004) which are provided in the ‘arth800’ dataset of the `R` package ‘GeneNet’ (Schäfer *et al.*, 2006).

#### 3.2.1 Gene relevance networks

Butte *et al.* (2000) introduced the idea of a *gene relevance network*—a type of graphical model in which vertices represent genes and in which we draw an edge between genes *g*_{1} and *g*_{2} if and only if the expressions of *g*_{1} and *g*_{2} are correlated. Thus, relevance networks provide us with a means to represent (linear) dependencies between genes. Correlations are calculated between genes in a pairwise fashion; it is decided whether or not to draw an edge between *g*_{1} and *g*_{2} without reference to the presence or absence of edges between any other genes. To determine whether or not to place an edge between genes *g*_{1} and *g*_{2}, we first calculate the (in our case, Pearson) correlation between their gene expression time courses, square this value to obtain a score *s*, and then place an edge if *s* > *r* for some prespecified threshold value *r*.

#### 3.2.2 Graphical Gaussian models

GGMs are used to represent dependencies between genes that have been detected by *partial* correlations. In contrast to relevance networks (where a missing edge between two genes indicates marginal independence), the absence of an edge between genes *g*_{1} and *g*_{2} in a GGM means that *g*_{1} and *g*_{2} are *conditionally* independent. We use the R package ‘GeneNet’ in order to infer GGMs from time-course gene expression data according to (Opgen-Rhein and Strimmer, 2007), and make use of the package's capability to calculate an empirical posterior probability, *p*_{e}(*g*_{1}, *g*_{2}) (Schäfer and Strimmer, 2005), for the existence of the edge between *g*_{1} and *g*_{2}. If *p*_{e}(*g*_{1}, *g*_{2}) > τ, where τ is some prespecified threshold (cut-off) value for the probability, then an edge is drawn between *g*_{1} and *g*_{2}.

#### 3.2.3 Bootstrapping the data

We apply our GPR bootstrapping approach to the *A.thaliana* data. This dataset comprises measurements taken for 800 genes at 11 times, with two measurements at each time point (Fig. 1A illustrates the data for one of the genes). We proceed as in the previous example, but this time make use of the following covariance function,

where *k*_{SE} and *k*_{M} are as previously described. Using the method of Section 2 we obtain 1000 bootstrap datasets, each one consisting of two measurements at 11 time points for 800 genes.

## 4 RESULTS

### 4.1 Parametric ODE modelling of signalling pathways

For each of our 1500 bootstrapped datasets, we find the ‘optimal’ set of parameters using SRES. This induces a joint sampling distribution over the optimal parameters for the ODE model, whose marginals are represented by the histograms in Figure 2.

**...**

Note that the joint sampling distribution is conceptually very different to the joint posterior parameter distribution that might be sought using a Bayesian approach: in the former, we find a single parameter estimate for each of a large number of sampled datasets, whereas in the latter we first specify a prior distribution over the parameters and then seek to update this in light of observed data.

Figure 2 shows that the marginal sampling distributions are generally quite narrow. This can be quantified by calculating the coefficient of variation, *c*_{v}, for each of the parameters, *c*_{v}(*v*_{1}(0))=0.0338, *c*_{v}(*r*_{1})=0.175, *c*_{v}(*r*_{3})=1.77, *c*_{v}(*r*_{4})=0.214, *c*_{v}(*r*_{5})=0.0914 and *c*_{v}(*r*_{6})=0.0434. The coefficient of variation for *r*_{3} is significantly greater than for the other parameters. This is due to the influence of bootstrap samples for which the optimal estimate for *r*_{3} was ∼5 (see the red bar in Fig. 2). Indeed, across all of the bootstrap samples, there appear to be two distinct sets of estimated optimal parameter values. The first (much larger) set comprises estimates centred around the values obtained from the original dataset. The second set comprises estimates for which *r*_{3} ≈ 5, and contains only 28 elements. Although obtained for just a small number of the bootstrap samples (≈2%), the parameter estimates in this second set still provide a good fit to the original data (see Supplementary Fig. 1). To quantify this, the average mean square error (MSE) obtained using parameter estimates from the second set was 0.10 for the fit to the *y*_{1} values, and 0.031 for the fit to the *y*_{2} values. By comparison, the MSE obtained using the parameter values estimated from the original data was 0.026 for the fit to the *y*_{1} values, and 0.0043 for the *y*_{2} values. This suggests that parameter estimates from the second set provide (on average) a slightly worse fit overall, but do a marginally better job of fitting *y*_{2} than those derived from the original dataset.

### 4.2 Gene network inference

We start by considering the inferred GGMs. Let *N*_{O}(τ) denote the network inferred from the original dataset using threshold τ, and similarly let *N*_{B}^{(i)}(τ) be the network inferred from the *i*-th bootstrap dataset. To assess how similar the bootstrapped networks are to *N*_{O}(τ), we calculate the proportion, ρ^{(i)}(τ), of edges in the original network that also appear in *N*_{B}^{(i)}(τ). We hence obtain a sampling distribution over ρ^{(i)}(τ) for a given τ. In addition to considering the data sampled using GPR bootstrapping, we also performed a non-parametric bootstrap of the data, and calculated a sampling distribution over ρ^{(i)}(τ) based upon 1000 (non-parametric) bootstrap datasets. Histograms describing the sampling distributions obtained for different values of τ are presented in Figure 3.

^{(i)}(τ) for different values of τ. Black: GPR bootstrap. White: non-parametric bootstrap.

In each case, the sampling distribution is approximately normal. For smaller values of τ, the mean of the sampling distribution over ρ^{(i)}(τ) is greater than for larger values. This is unsurprising: as τ gets smaller, the value of the empirical posterior probability required for an edge gets lower (until at the extreme case, τ=0, edges are drawn between *all* vertices, regardless of the data). Thus, for smaller values of τ, the sensitivity to the data is reduced. This has the effect of making the inferred network more robust to perturbations in the data, but unfortunately also makes the network increasingly meaningless. For more meaningful values of τ (say, of around 0.85 or above), the degree of similarity between the bootstrapped networks and *N*_{O}(τ) [as measured by the mean value of ρ^{(i)}(τ)] is disappointingly low. This suggests that the original inferred GGM is highly sensitive to uncertainty in the data.

We repeated the above analysis for relevance networks, and obtained similar results (see Supplementary Material). To summarize, the mean values of ρ^{(i)}(*r*) for *r* values of 0.95, 0.85, 0.75, 0.65, 0.55 and 0.45 were (respectively) 0.062, 0.18, 0.29, 0.42, 0.51 and 0.61. Yet again, these values are low, indicating that the topology of relevance networks is sensitive to uncertainty in the data (note that *r* and τ are not directly comparable, so it is difficult to compare the relative tolerance to data uncertainty of relevance networks and GGMs). Although the overall topology of the network seems to be sensitive to data uncertainty, there are individual edges that demonstrate a much higher degree of tolerance. For example, taking *r* to be 0.85, we may look for edges that appear in 100% of the networks obtained from the bootstrapped datasets. If we do this, then we find 16 edges connecting 13 vertices, as shown in Figure 4. We can use this approach more generally to construct networks that have a required level of tolerance to data uncertainty, omitting any edges that do not appear in at least *q*% of the bootstrap samples. In this way, we can construct ‘high confidence’ relevance networks.

## 5 DISCUSSION

Our results highlight the necessity of accounting for data uncertainty when trying to draw conclusions from experimentally obtained data. In the case of the parametric ODE model of the JAK2-STAT5 signalling pathway, we showed that in addition to the parameter estimates obtained from the original dataset, there is a second set of plausible estimates which (had stochastic effects provided us with a slightly different dataset) we may well have concluded were the ‘correct’ values. We believe that the presence of the distinct second set of plausible parameter estimates indicates that the error function (i.e. the sum of squared differences between the data and the model predictions) that was minimized in order to fit the model to the original data possesses a second (local) minimum. As well as demonstrating the importance of taking into account the noise in experimental datasets, our results could also be viewed as an endorsement for Bayesian methods, which do not seek to identify a single optimal parameter set, but instead approximate the whole posterior parameter distribution.

The results from our network inference investigation perhaps provide an even better illustration of the effect that data uncertainty can have upon inference. Our approach demonstrates that the inference of an edge between two vertices is highly sensitive to the level of noise in the data, and hence it is likely that the false positive rate for each individual edge is high. We also showed how GPR bootstrapping may be used to construct ‘high confidence’ relevance networks for which we would expect the false positive rate to be lower.

## 6 CONCLUSION

Determining the effects of uncertainty in experimental data is imperative if we are to have any degree of confidence in the conclusions that we draw or the models that we reverse engineer from biological data. GPR bootstrapping is a widely applicable and easily implemented approach that allows us to investigate and quantify these effects. We have illustrated the use of GPR bootstrapping using two examples, and discussed the impact that data uncertainty has upon inference. Although we have here concentrated on time-course data, our approach could easily be applied in situations where the independent variable is something other than time. Given the current levels of noise in post-genomic data, approaches such as GPR bootstrapping are vital in order to allow us to make the most of currently available information and to provide us with a means to assess the conclusions we draw.

*Funding:* Wellcome Trust (080713/Z/06/Z).

*Conflict of Interest*: none declared.

## REFERENCES

- Aaronson DS, Horvath CM. A road map for those who don't know JAK-STAT. Science. 2002;296:1653–1655. [PubMed]
- Antoniak CE. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Ann. Stat. 1974;2:1152–1174.
- Barenco M, et al. Ranked prediction of p53 targets using hidden variable dynamic modeling. Genome Biol. 2006;7:R25. [PMC free article] [PubMed]
- Butte AJ, et al. Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc. Natl Acad. Sci. USA. 2000;97:12182–12186. [PMC free article] [PubMed]
- Efron B. Bootstrap methods: Another look at the jackknife. Ann. Stat. 1979;7:1–26.
- Efron B, Tibshirani RJ. An Introduction to the Bootstrap. New York: Chapman and Hall; 1993.
- Felsenstein J. Confidence limits on phylogenies: an approach using the bootstrap. Evolution. 1985;39:783–791.
- Ferguson TS. A Bayesian analysis of some nonparametric problems. Ann. Stat. 1973;1:209–230.
- Gao P, et al. Gaussian process modelling of latent chemical species: applications to inferring transcription factor activities. Bioinformatics. 2008;24:i70–i75. [PubMed]
- Horvath CM. STAT proteins and transcriptional responses to extracellular signals. Trends Biochem. Sci. 2000;25:496–502. [PubMed]
- Imoto S, et al. Computational Methods in Systems Biology. Vol. 3082. Berlin: Springer-Verlag; 2005. Residual bootstrapping and median filtering for robust estimation of gene networks from microarray data; pp. 149–160.
- Ji X, Xu Y. libsres: a c library for stochastic ranking evolution strategy for parameter estimation. Bioinformatics. 2006;22:124–126. [PubMed]
- Kerr MK, Churchill GA. Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc. Natl Acad. Sci. USA. 2001;98:8961–8965. [PMC free article] [PubMed]
- Lawrence ND, et al. Modelling transcriptional regulation using Gaussian processes. In: Schölkopf B, et al., editors. Advances in Neural Information Processing Systems 19. Cambridge, MA: MIT Press; 2007. pp. 785–792.
- Lèbre S. Inferring dynamic genetic networks with low order independencies. arXiv.org. 2007:1–36.
**arXiv:0704.2551v4**. [PubMed] - MacKay DJC. Introduction to Gaussian processes. In: Bishop CM, editor. Neural Networks and Machine Learning. Berlin: Springer; 1998. pp. 133–165. NATO ASI Series.
- Müller P, Quintana FA. Nonparametric Bayesian data analysis. Stat. Sci. 2004;19:95–110.
- Opgen-Rhein R, Strimmer K. From correlation to causation networks: a simple approximate learning algorithm and its application to high-dimensional plant gene expression data. BMC Syst. Biol. 2007;1:37. [PMC free article] [PubMed]
- Press WH, et al. Numerical Recipes: The Art of Scientific Computing. New York: Cambridge University Press; 2007.
- Rasmussen CE, Williams CKI. Cambridge, MA: The MIT Press; 2006. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning)
- Runarsson TP, Yao X. Stochastic ranking for constrained evolutionary optimization. IEEE Trans. Evol. Comput. 2000;4:284–294.
- Schäfer J, Strimmer K. An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics. 2005;21:754–764. [PubMed]
- Schäfer J, et al. Reverse engineering genetic networks using the GeneNet package. R News. 2006;6:50–53.
- Smith SM, et al. Diurnal changes in the transcriptome encoding enzymes of starch metabolism provide evidence for both transcriptional and posttranscriptional regulation of starch metabolism in Arabidopsis leaves. Plant Physiol. 2004;136:2687–2699. [PMC free article] [PubMed]
- Stein ML. Interpolation of Spatial Data : Some Theory for Kriging (Springer Series in Statistics). New York: Springer; 1999.
- Swameye I, et al. Identification of nucleocytoplasmic cycling as a remote sensor in cellular signaling by databased modeling. Proc. Natl Acad. Sci. USA. 2003;100:1028–1033. [PMC free article] [PubMed]
- von Mises R. Mathematical Theory of Probability and Statistics. New York: Academic Press; 1964.
- Yuan M. Flexible temporal expression profile modelling using the Gaussian process. Comput. Stat. Data Anal. 2006;51:1754–1764.

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (575K) |
- Citation

- Estimating replicate time shifts using Gaussian process regression.[Bioinformatics. 2010]
*Liu Q, Lin KK, Andersen B, Smyth P, Ihler A.**Bioinformatics. 2010 Mar 15; 26(6):770-6. Epub 2010 Feb 9.* - An integer optimization algorithm for robust identification of non-linear gene regulatory networks.[BMC Syst Biol. 2012]
*Chemmangattuvalappil N, Task K, Banerjee I.**BMC Syst Biol. 2012 Sep 2; 6:119. Epub 2012 Sep 2.* - Estimating dynamic models for gene regulation networks.[Bioinformatics. 2008]
*Cao J, Zhao H.**Bioinformatics. 2008 Jul 15; 24(14):1619-24. Epub 2008 May 27.* - Learning gene regulatory networks from gene expression measurements using non-parametric molecular kinetics.[Bioinformatics. 2009]
*Aijö T, Lähdesmäki H.**Bioinformatics. 2009 Nov 15; 25(22):2937-44. Epub 2009 Aug 25.* - Biological Network Inference and analysis using SEBINI and CABIN.[Methods Mol Biol. 2009]
*Taylor R, Singhal M.**Methods Mol Biol. 2009; 541:551-76.*

- Topological sensitivity analysis for systems biology[Proceedings of the National Academy of Scie...]
*Babtie AC, Kirk P, Stumpf MP.**Proceedings of the National Academy of Sciences of the United States of America. 2014 Dec 30; 111(52)18507-18512* - Derivative processes for modelling metabolic fluxes[Bioinformatics. 2014]
*Žurauskienė J, Kirk P, Thorne T, Pinney J, Stumpf M.**Bioinformatics. 2014 Jul 1; 30(13)1892-1898* - Combining test statistics and models in bootstrapped model rejection: it is a balancing act[BMC Systems Biology. ]
*Johansson R, Strålfors P, Cedersund G.**BMC Systems Biology. 846* - A method to identify differential expression profiles of time-course gene data with Fourier transformation[BMC Bioinformatics. ]
*Kim J, Ogden RT, Kim H.**BMC Bioinformatics. 14310* - Parameter Trajectory Analysis to Identify Treatment Effects of Pharmacological Interventions[PLoS Computational Biology. 2013]
*Tiemann CA, Vanlier J, Oosterveer MH, Groen AK, Hilbers PA, van Riel NA.**PLoS Computational Biology. 2013 Aug; 9(8)e1003166*

- Gaussian process regression bootstrapping: exploring the effects of uncertainty ...Gaussian process regression bootstrapping: exploring the effects of uncertainty in time course dataBioinformatics. 2009 May 15; 25(10)1300

Your browsing activity is empty.

Activity recording is turned off.

See more...