Gaussian Process Based Expected Information Gain Computation for Bayesian Optimal Design

Optimal experimental design (OED) is of great significance in efficient Bayesian inversion. A popular choice of OED methods is based on maximizing the expected information gain (EIG), where expensive likelihood functions are typically involved. To reduce the computational cost, in this work, a novel double-loop Bayesian Monte Carlo (DLBMC) method is developed to efficiently compute the EIG, and a Bayesian optimization (BO) strategy is proposed to obtain its maximizer only using a small number of samples. For Bayesian Monte Carlo posed on uniform and normal distributions, our analysis provides explicit expressions for the mean estimates and the bounds of their variances. The accuracy and the efficiency of our DLBMC and BO based optimal design are validated and demonstrated with numerical experiments.


Introduction
As acquiring data in experiments is generally computationally demanding and time-consuming, maximizing the informativeness of experimental data is of crucial importance. For example, in the area of climate science, a complex system with various stochastic inputs is integrated to represent the real climate situation. Usually, the locations and the time of putting sensors to collect climate observations are optional. With limited resources, careful selections of sensor placements are required (see Reference [1] for a detailed discussion). Many works focus on finding experimental data carrying more information, and this topic is usually referred to as optimal experimental design (OED) [2]. In this paper, we consider the OED problem in the context of the Bayesian inverse problem. Given a forward problem, the inverse problem is to infer parameters inherent of the forward model through a set of design points and the corresponding responses. In the Bayesian setting, the parameters of interest are viewed as random variables, and hence the posterior distribution of the parameters can be obtained via the Bayes' rules [3][4][5][6]. In linear cases, the OED problem includes various criteria such as the D-optimal design criterion and the A-optimal design criterion. The D-optimal design criterion seeks to maximize the determinant of the information matrix of the design, whereas the A-optimal design criterion considers minimizing the trace of the inverse of the information matrix which results in minimizing the average variance [7][8][9].
We focus on the decision theoretic approach that considers maximizing the expectation of Kullback-Leibler (KL) divergence from the posterior distribution to the prior distribution [10]. This decision theoretic approach is a nonlinear generalization of the Bayesian D-optimal criterion [11]. The objective function we want to maximize, the expectation of KL divergence, is also referred to as the expected information gain (EIG) over parameters. Computing EIG is usually a challenging

Formulation of Experimental Design
In this section, we review the setting of the Bayesian optimal design problem following the presentation in [14]. In the Bayesian setting, the unknown parameters are viewed as random variables. Let (Ω, F , P ) be a probability space, where Ω is a sample space, F is a σ-field, and P is the probability measure on (Ω, F ). Let θ : Ω → R n θ denote the parameters of interest, where n θ is the dimension of the unknown parameters. Assume that θ is associated with a prior measure µ on R n θ satisfying µ(A) = P (θ −1 (A)) for A ∈ R n θ . Throughout this paper, we assume that all the random variables have densities with respect to the Lebesgue measure. Let d ∈ D ⊂ R n d denotes the design variable, where n d is the number of design variables and D denotes the design space. Let y ∈ R n y denote the response associated with d where n y is the dimension of response. The inference of θ can be obtained based on the prior distribution and observations via Bayes' rule, (1) The likelihood function is often determined by a deterministic forward model and a statistical model for measurement of model noises. Here we model the relation of the design variable and the observation by a deterministic model G(θ, d) and additive Gaussian noises , where G is the forward model. In many practical problems, the forward model is computationally expensive, and its explicit form is not given. So we can just view it as a black box whose internal structure is unknown, whereas we can generate noisy responses given fixed design variables and parameters. Following the decision theoretic approach [10], we set the utility function as the KL divergence from the posterior distribution to the prior distribution, This term is actually independent of θ. Noting that u(d, y) is a function of both d and y, therefore we further take expectation of u over y to define the expected information gain (EIG): Then the optimal experimental design is to find a design point which maximizes the expected utility, that is, A double-loop Monte Carlo (DLMC) estimator of EIG is proposed in [30]. Rewrite U(d) as and note that p(θ|d) = p(θ), since the specification of d does not provide further information about inference of θ . Then, the DLMC method approximates U(d) as where θ (i) are drawn from the prior p(θ), and y (i) are drawn from the conditional distribution p(y|θ = θ (i) , d) (i.e., the likelihood), and hence p(y (i) |d) can be estimated via the importance sampling technique, Combining (6) and (5) yields a biased estimator U(d) of U(d). However, if we sample {θ (i,j) } n in j=1 for every y (i) (i = 1, . . . , n out ), the complexity of this method is O(n out n in ). To reduce the computational cost, a sample reuse technique is employed in [14]. That is, for every d, we drawn a fresh batch from the prior {θ (k) } n out k=1 and use this set for both the outer Monte Carlo and the inner Monte Carlo (i.e., θ (·,j) = θ (k) ). Consequently, the complexity is reduced to O(n out ).

GP Based Framework for Bayesian Optimal Design
Our main framework for Bayesian optimal design is based on two powerful tools according to Gaussian processes: the Bayesian Monte Carlo (BMC) method and the Bayesian optimization (BO) method. In this section, we first review BMC and conduct the analysis of BMC for the normal and the uniform distributions, and then present our novel double loop Bayesian Monte Carlo (DLBMC) for EIG. After that, we propose BO to find the maximizer of the approximated EIG. Finally, we review the classical Markov chain Monte Carlo (MCMC) method for Bayesian parameter inference.

Bayesian Monte Carlo
Consider the integral problem I : is the integrand, and X is the support of x, that is, the integration domain. The idea of BMC is to formulate an integral problem into a Bayesian inference problem by placing a prior over the integrand f and to obtain the posterior distribution of f conditioning on the data collected. A natural way of putting a prior over function is through a Gaussian process, which is completely characterized by its mean function µ(x) and kernel function k(x, x ), that is, f ∼ GP (µ(·), k(·, ·)). A commonly used choice for the kernel function is the Gaussian kernel (or known as squared exponential kernel), where both σ f and l are hyperparameters of the kernel function. The choice of hyperparameters affects the result to a large extent. Therefore the hyperparameters need to be determined carefully. Having collected noisy observations a Gaussian process provides a posterior distribution  for an arbitrary new N×1 . In some special cases (For example, when x is distributed with Gaussian and the kernel is a Gaussian kernel [25]), GP allows us to estimate the integration in a closed form, of which the posterior mean is given by, where z := X k N (x) T p(x) dx, and the posterior variance is given by Note that in the above analytical expression of posterior mean and variance, when the hyperparameters and the kernel function are given, z and K are determined by {x (i) } N i=1 and they are independent of the observations { f (i) } N i=1 . Therefore, the computation procedure for the posterior mean and variance of the BMC estimator proceeds the following two steps: first, an input sample set {x (i) } N i=1 is generated, and z and K are computed; second, the corresponding observation set { f (i) } N i=1 is collected, and the posterior mean and variance are computed through (7) and (8) respectively. Next, when x is an uniform or Gaussian random vector, we provide detailed derivations for the posterior mean and variance of BMC estimators.
We note that Theorem 1 is presented in References [22,25], but we give the above detailed proof for completeness. Theorem 2 (Bayesian Monte Carlo for the uniform distribution). Consider the integral I = X f (x)p(x)dx, where x is a random vector uniformly distributed in the hypercube X = [l 1 , r 1 ] × · · · × [l n , r n ]. The prior mean function is assumed to be a zero function, and the kernel is assumed to be the Gaussian kernel k(x, x ) = σ 2 f exp(− x − x 2 2 /(2l 2 )) with predetermined hyperparameters σ f and l. Having collected noisy , the posterior mean of BMC and the upper bound of variance of BMC are given by . . , f (N) ] N×1 , and the components of z are given by for i = 1, . . . , N , with Φ(x; µ, σ, t) being the cumulative distribution function (CDF) of the Gaussian distribution N (µ, σ 2 ).
Proof. The components of z can be computed analytically, for i = 1, . . . , N , dx is the CDF of the Gaussian distribution N (µ, σ 2 ) and v j denotes the j-th component of the vector v. However, since the double integral of the Gaussian density function has no analytical form, the variance of the estimator cannot be obtained, and we give an upper bound of the variance,

Estimating the Expected Information Gain Using Double-Loop BMC
In general, the value of the EIG has no closed form and has to be approximated via numerical methods. Based on the idea of BMC for efficiently evaluating integrals, we develop a double-loop BMC (DLBMC) scheme to approximate the EIG.
Letting e(y, First, we consider the straightforward detailed calculation of U(d) for any fixed d. To compute . . , n out , n out denotes the sample size of the outer layer. To compute h (i) := j=1 are needed and θ (i,j) are generated from the prior. We propose using the BMC method to evaluate integrals , and U. So far, there are two problems. First, the computation complexity for the forward model is O(n in n out ), which grows fast with the increase of the problem dimension. Second, since we usually have no prior knowledge of the support Y of y, we can not uniformly sample {y (i) } n out i=1 . To overcome these obstacles, we employ the sample reuse technique [14] that sets θ (·,j) = θ (j) , and the computational complexity is reduced to O(n in ). Besides, it allows us to generate samples {θ (j) } n in j=1 in advance, and we can use the corresponding forward model outputs to estimate Y-suppose the corresponding forward model values are given by {G (j) = G(d, θ (j) )} n in j=1 , and then Y can be approximated by Y : . Intuitively speaking, the approximated Y is slightly smaller than the actual field Y, and consequently, bias is induced in the estimator. With increased sample size n in , Y can be captured more accurately and the bias can be reduced.
In the process of estimating U, we propose using the BMC method to compute e, h and U. Since two layers of integration are involved, let the hyperparameters of BMC for the inner layer and the outer layer be {l in , (σ f ) in } and {l out , (σ f ) out } respectively. In our previous discussion about BMC in (7)-(8), z and K can be computed, once the input sample set {x (i) } N i=1 is given. Therefore, for computational simplicity, after {θ (j) } n in j=1 and {y (i) } n out i=1 are generated, we can compute {z in , K in } and {z out , K out } ahead. Taking the prior being the standard normal distribution for example, z in and K in are given by, , for j = 1, . . . , n in , Details of our DLBMC method for estimating U are summarized in Algorithm 1. Note that we only use the mean estimates of the DLBMC estimator in the following. The variance of DLBMC is potentially useful, but as discussed in Section 3.1, the variance of BMC typically does not have a closed form, and we are not able to derive a closed form for the variance of DLBMC in this work. We will consider the variance of DLBMC in our future work. It is also possible to consider other numerical integration methods to compute EIG, for example, the sparse grid quadrature rules [31,32] and their combination with physical model reduction techniques [33], but they are out of the scope of this paper.

Bayesian Optimization
The ultimate goal of the optimal experimental design problem is to find the optimizer d in (4). In this problem, since the computing of the function value U(d) and the gradient ∇U is prohibitively expensive, it is challenging to apply function-value-based or gradient-based optimization methods. As the Bayesian optimization (BO) method [27,28,34,35] typically only requires a low objective function evaluation budget [36] and does not require any gradient information, it can be suitable for this problem. In this section, we give a brief review of BO and apply it to obtain the maximizer of EIG (4).
To compute the maximizer of EIG U : R n d → R (see (4)), for a given maximum number of iterations t max , that is, the evaluation budget, Bayesian optimization begins with putting a GP prior on U ∼ GP (µ 0 (d), k(d, d )), and then randomly chooses an initial point d 1 and collects the corresponding response U 1 = U(d 1 ). Next, the posterior mean function µ 1 (d) and variance σ 1 (d) are updated via collected data set S 1 = {d 1 , U 1 }. Usually d 1 alone is inadequate to find the maximum and therefore we need a strategy to choose the next design point. Typically, the next point is determined through maximizing an acquisition function A, that is, at t-th iteration, d t+1 = arg max d A(d|S 1:t ). After the next point d 2 is obtained, we sample the objective function U 2 . The collected data set is then augmented as S 1: 2 , and the posterior mean function µ 2 (d) and the variance function σ t (d) are also updated. The above procedure repeats until the given maximum budget t max is reached.
In the case that D is infinite, the process of finding the next design point d t+1 = arg max d∈D A(d|S 1:t ) is demanding. However, performing global search over the discretized space is usually effective [27,37], since we assume that evaluating U is more costly than computing the GP surrogate. Therefore, the design space D is discretized over equidistant grids and we denote the discretized design space as D. Supposing we have collected data set S 1:t , the posterior of U is a GP distribution with mean µ t (d), kernel k(d, d ) and variance σ 2 t (d), We note the design space considered in this paper is assumed to be bounded, such that it can be directly discretized. For unbounded design spaces, an unbounded Bayesian optimization approach is developed through gradually extending regions with regularization in [38]. Choosing a proper acquisition function is crucial for the Bayesian optimization algorithm since it guides the search for the optimum. Popular choices of acquisition function include maximizing the probability of improvement (PI) [39,40], and maximizing the expected improvement (EI) in the efficient global optimization (EGO) algorithm [41,42]. A review for the selection of acquisition functions is in [27]. Suggested by [37], we apply the GP-UCB algorithm to choose the next point-the acquisition function is set to a linear combination of the posterior mean function and the posterior variance function, can be considered as the upper confidence bound of the current Gaussian process. It is clear that maximizing the acquisition function µ t−1 (d) + β t−1 σ t−1 (d) shows a trade-off between exploring the point with potential high function value and exploiting the point with high uncertainty. Here, β t−1 is the parameter balancing exploring and exploiting. Details of our BO strategy for optimal design are shown in Algorithm 2. We set the prior mean function to µ 0 (d) = 0, and set the kernel to be the Gaussian kernel given by k(d, d ) = σ 2 f exp(− d − d 2 /(2l 2 )) . The design space D is discretized over equidistant grids and we denote the discretized design space as D . In the step of maximizing the acquisition function, d t is located through a grid search over D.
A natural measure in performance of the Bayesian optimization method is defined through average cumulative regret. Supposing the maximum U(d ) is known, the instantaneous regret for iteration t is defined as r t = U(d ) − U(d t ) and the cumulative regret R T after T iterations is defined as the sum of the instantaneous regrets R T = ∑ T t=1 r t . Then the average cumulative regret R T /T is defined as R T /T = ∑ T t=1 r t /T. It should be noted that neither r t nor R T can be obtained directly from the Algorithm 2. In [37], it is proven that, for finite design space D, setting δ ∈ (0, 1) and β t = 2 log(|D|t 2 π 2 /6δ), the Bayesian optimization method is no-regret with high probability, that is, lim T→∞ R T /T = 0 . Algorithm 2 Bayesian optimization (BO) for optimal design 1: Input: Design space D and its discretized design space D, prior µ 0 (d) = 0, hyperparameters l, σ f of the Gaussian kernel, hyperparameter δ, and maximum number of iterations t max . 2: for t = 1, . . . , t max do 3: Find the maximizer of the acquisition function: d t = arg max d∈D µ t−1 (d) + β t−1 σ t−1 (d).

4:
Sample the objective function U t = U(d t ) using Algorithm 1.

5:
Augment the data set S 1: 6: Perform Bayesian update to obtain µ t and σ t over D using (11) and (12) respectively.

Bayesian Parameter Inference
After the optimal design points are selected and the corresponding noisy observations D = {d (i) , y (i) } N i=1 are collected, we can conduct Bayesian inference for the system parameters, that is, to assess the posterior distribution p(θ|D). The posterior p(θ|D) can be calculated via Bayes' rule (1). However, as there is no closed-form for the evidence in (1) in many practical problems, the Markov chain Monte Carlo (MCMC) method is often used to generate samples of the posterior distribution. Next, we give a brief review of the MCMC algorithm.
The basic idea of MCMC is to construct a Markov chain over the state space until the chain has reached a stationary distribution, which is assumed to be a target distribution. Here our target distribution is set to be the posterior distribution p(θ|D). Although there are many variants of MCMC, we focus on the Metropolis-Hastings MCMC (MH-MCMC) method [43][44][45]. The basic idea of constructing the Markov chain in MH-MCMC algorithm is that at each step, given current state θ, a candidate state θ cand is proposed with probability q(θ cand |θ), where q(·|·) is referred to as the proposal distribution. Note that proposal distribution can be arbitrary. A commonly used proposal is the symmetric Gaussian distribution, that is, q(θ cand |θ) = N (θ, λI), where λ denotes the stepsize. Whether to accept the candidate state is determined by the acceptance probability α, given by the following formula, Note that the equation holds only when the proposal distribution is symmetric. If p(θ cand |D)/p(θ|D) > 1, it means that θ cand is more possible than the current state θ, we accept the proposal with probability α = 1. Otherwise, we accept the proposal with probability α. The detailed MH-MCMC algorithm is summarized in Appendix B.

Numerical Experiments
In this section, we consider three numerical examples: a standard linear Gaussian model, a nonlinear simple model, and a partial differential equation (PDE) model. Since the first problem has analytical expressions, we examine the performance of our method by comparing the numerical result with the exact solution. The second problem is a commonly-used test problem, and we demonstrate the efficiency of our method through it. In the third test problem, a physical system governed by the diffusion equation is considered, in which a contaminant source inversion problem is studied.

Test Problem 1: Linear Gaussian Problem
We consider the standard linear Gaussian problem in the following form, where the noise and the prior are assumed to be Gaussian, ∼ N (0, σ 2 ) , θ ∼ N (0, I n θ ×n θ ) and n θ = n d . The posterior distribution is a multivariate Gaussian distribution where the mean and the covariance are The expected information gain (EIG) for θ can then be given in a closed form. (The detailed deduction is shown in Appendix A.), Note that maximizing EIG is equivalent to minimizing the determinant of the posterior covariance matrix [11] (the Bayesian D-optimal design). We consider the one-dimensional case where n θ = 1, d ∈ [0, 1], and set the standard deviation of the noise to σ = 0.1. The hyperparameters of DLBMC are set to l in = 0.5, (σ f ) in = 1, and l out = 0.2, (σ f ) out = 1. Figure 1 shows the exact EIG, the estimated EIG using DLMC with 300 samples, and the estimated EIG using DLBMC with 300 samples. It can be seen that the estimated EIG of DLBMC is more accurate and more stable than that of DLMC. Besides, for d = 0.3, the relationship between the sample size and the bias of the DLMC and DLBMC estimators is studied. As the sample size increases, we compute the bias of the DLBMC estimator and the DLMC estimator averaging 20 trails respectively. Figure 2(Left) shows the results of the bias, where it can be seen that the DLBMC estimator converges to the true value faster than the DLMC estimator, and DLBMC is more accurate than DLMC.
As discussed in Section 3.2, the variance of the BMC estimator for the uniform distribution is not explicitly given. We use the sample variance as an alternative to compare the discrepancy of the DLMC estimator and the DLBMC estimator. Fixing the design point d = 0.5, for different sample sizes of DLMC and DLBMC, we repeatedly compute the estimator U(d) n times, and denote them by , and then the sample variance estimator is defined as As the number of repeat times n increases, we compare the sample variance of two estimators with different sample sizes for DLMC and DLBMC in Figure 2(Right). It is clear that the DLBMC estimator outperforms the DLMC estimator.

Test Problem 2: Nonlinear Simple Problem
In this section, a simple nonlinear model is tested, which is also studied in [14]. This model is written as the prior is set to θ ∼ U (0, 1), and the standard deviation of the observation noise is set to σ = 0.01. The hyperparameters of DLBMC are set to l in = 0.01, (σ f ) in = 0.2, and l out = 0.05, (σ f ) out = 0.2. Figure 3a,b show the estimated EIG using DLMC and DLBMC with 300 and 500 samples respectively. A reference solution using DLMC with 10 5 samples is also compared in Figure 3a,b. Here, 20 trails of the DLMC estimator and the DLBMC estimator are generated, and Figure 3 shows the mean estimates and the intervals containing 80% of the trails. It is clear that, compared to DLMC, our DLBMC estimator has smaller variances. Compared to the reference solution, DLBMC gives biased estimation. With the increasing sample size, the extent of bias is reduced as we expect.  To illustrate the effect of optimal design, we compare the posterior distribution given by three design points. Let design A = 1 be the optimal design point, let design B = 0.2 be the local maximizer of the EIG, and let design C = 0 which has the least information since it has the least EIG value. The MH-MCMC algorithm (Algorithm A1) with N iter = 1000 and γ = 0.2 is used to generate samples of the posterior distribution, and kernel density estimation is used to obtain the posterior density functions from these samples. For this test problem, the ground-truth is set to 0.75. From Figure 4, it can be seen that the posterior distribution obtained through design A is the most accurate, and it has the smallest variance.

Test Problem 3: Source Inversion for the Diffusion Problem
Letting D ⊂ R 2 be a bounded and connected domain with a polygonal boundary ∂D, the governing equation of the diffusion problem studied in this test problem is: find a random function u(x, ω) ∈ D × Ω → R, such that P-a.e. in Ω, u(x, ω) = 0 , on ∂D , where (Ω, F , P ) is a probability space. We consider a square physical domain D = [−1, 1] × [−1, 1] ⊂ R 2 , and u(x, ω) in (14)-(15) denotes the concentration of a contaminant at the point x ∈ D. Let f (x, ω) denote the field of contaminant source. As f is usually strictly positive following the setting in [13,[46][47][48], we set the prior distribution of f (x, ω) to a log-normal random field, that is, f (x, ω) = exp(a(x, ω)) where a(x, ω) is a normal random field. In this study, the experimental goal is to infer the underlying contaminant field f given several observations {x i , y i } K i=1 , where design variable x i denotes i-th sensor placement, the response is the corresponding numerical PDE solution u(x i ) with additional noise, that is, y i = u(x i ) + i , i ∼ N (0, σ 2 ), and K is the number of sensors. In this test problem, the hyperparameters of DLBMC are set to l in = 0.02, (σ f ) in = 0.01, and l out = 0.005, (σ f ) out = 0.005.
We parameterize the permeability field log[ f (x, ω)] by a truncated Karhunen-Loève (KL) expansion. Consider the random field a(x, ω) = log[ f (x, ω)] with mean function a 0 (x), standard deviation σ and covariance function C(x, y), where c is the correlation length. Then the truncated KL expansion of f is expressed as where a n (x) and λ n are the eigenfunctions and eigenvalues of (16) and {ξ n } M n=1 are uncorrelated random variables. The prior of {ξ n } M n=1 are set to be independent standard normal distributions, ξ n ∼ N (0, 1) for n = 1, . . . , M.
In the numerical experiment, we set a 0 (x) = 1, c = 2 and σ = 1. Fixing the hyperparameters in the truncated KL expansion, the response depends on the random variables {ξ i } M i=1 . Therefore we define the parameter of interest as θ := [ξ 1 , . . . , ξ M ], and rewrite the governing diffusion equation as, We use the bilinear finite element method (FEM) to discretize the diffusion equation over a 17 × 17 square grid and let the standard deviation of noise be 1% of the mean observed value. Supposing K sensors are placed over the design space, generally, we can perform a batch design, and write the following altered forward model 1 , θ) . . . , θ) . . .

G(d i
where the subscript i denotes the i-th design variable for i = 1, . . . , K. Directly maximing over the EIG with altered forward model can give optimal design in the context of batch design. Let f truth denote the underlying true permeability field. Suppose we have collected data on K sensors, and then we perform MCMC to get samples {θ (i) } N iter i=1 of the posterior distribution using Algorithm A1. In this test problem, we set N iter = 4000. Two useful statistics can be obtained from the samples-the maximum a posterior (MAP) estimate θ MAP and the mean estimate θ MEAN . Let f MAP and f MEAN be the source fields generated by θ MAP and θ MEAN respectively. To test the accuracy of the inversion, we introduce the following relative errors where f truth , f MAP and f MEAN are discretized over the FEM grids for computational simplicity and · 2 denotes the l 2 vector norm.
Noting that performing grid search over (17 × 17) K grid is computationally expensive, we utilize Bayesian optimization method to efficiently find the optimal designs within a few iterations. Three cases of K are considered in the following, which are K = 1, 2, 3.
First, for K = 1, we first perform Bayesian optimization over a 17 × 17 grid. The performance of Bayesian optimization is shown in Figure 5. In Figure 5(Left), we can see that, Bayesian optimization can find the maximum of the EIG within a few iterations. The optimal design found by Bayesian optimization is [0, 0], which is expected due to the symmetry of the forward model. In this case, as the number of gird points (289 points) is not too large, we can verify that the optimal design is [0, 0] through grid search. Figure 5(Right) shows that the average cumulative regret is convergent to zero. For comparison, we randomly generate 20 different design points and use the MH-MCMC algorithm to generate 4000 samples of the posterior distribution. Figure 6 shows the locations of the optimal design and the 20 random designs. Figure 7a,b show the relative errors of MAP and mean estimates of the source field (averaged over 20 trails) versus the values of EIG, where it is clear that as the value of EIG becomes larger, the errors of both MAP and mean estimates reduce. In addition, the optimal design point gives large EIG values and relatively small errors. Although the errors associated with the optimal design point are typically smaller than the errors associated with the random design points. They are still large, and the estimated source fields are not accurate enough. Therefore, we next consider more design points.   For the multiple design cases (K = 2, 3), given the fact that the computational cost of batch design increases exponentially as K increases, we perform Bayesian optimization over a uniform 9 × 9 coarse grid. For K ≥ 2, it can be seen that the optimal solution is not unique due to the symmetry of the forward model, for example, [x 1 ; x 2 ] and [−x 1 ; −x 2 ] share the same EIG value. After performing Bayesian optimization several times with BO budget t max = 100, the sets of optimal designs for K = 2 case are shown in Figure 8, where each line connecting blue circle and red circle represents a pair of optimal design. The numerical result shows that, compared with a pair of two design points that are symmetric with respect to [0; 0], a pair of slightly skewed design points can provide more information. Again, we randomly generate 20 sets of design points, and compare them with the optimal design (we choose [0.5, 0.5; −0.5, −0.25]). Figure 9 shows the errors of MAP and mean estimates (averaged over 20 trails) of the source field versus the values of EIG, where it is clear that the optimal design leads to the largest EIG value and the smallest error. Besides, the comparison of true source field and the fields generated by MAP and mean estimates are presented in Figure 10. It can be seen that the estimated source field associated with the optimal design matches the true source field well. For K = 3, let the BO budget t max = 100, the set of optimal design that found by Algorithm 2 is [0.75, 0.25; 0.5, −0.25; −0.75, −0.25]. Figure 11 shows the values of EIG and the relative errors in MAP and mean estimates (averaged over 20 trails) for the optimal design and twenty random designs, where it can be seen that the optimal design has the largest EIG value and the smallest errors which are consistent with the results for K = 1, 2. Figure 12 shows that the estimated source fields generated by the MAP and the mean estimates match the true source field well. (e) Estimated source field by the mean estimate (f) Estimated source field by the mean estimate Figure 10. Comparison of the true source field and estimated source fields by MAP and mean estimates for K = 2, test problem 3.
To further quantify the performance of optimal designs, we compute the ratio of E MAP of random designs and E MAP of optimal designs (denoted as E MAP respectively) for K = 1, 2, 3 cases. Figure 13 presents the histograms of ratio of relative errors (i.e., E (random) MAP /E (opt) MAP ), where the green lines are the corresponding kernel smoothing function estimates. We can see that, in all three cases, with high probability, the optimal design can give better posterior performance than the random designs. Especially, from Figure 13c, the error of random designs can be ten times greater than that of the optimal design.

Conclusions and Discussion
Efficiently using a small number of samples to reduce the cost of computing the expected information gain (EIG) is a fundamental concept to solve the challenging Bayesian optimal experimental design problem. Based on the Bayesian Monte Carlo (BMC) method, a novel double-loop Bayesian Monte Carlo (DLBMC) estimator is proposed for evaluating the EIG in this work. To result in an efficient overall optimization procedure to find the maximizer of the EIG, a Bayesian optimization (BO) procedure for EIG is developed. In addition, our analysis gives explicit expressions of the mean estimate of the BMC estimator and the bounds of its variance for the uniform and the normal distributions. Detailed numerical studies show that our DLBMC method can provide accurate mean estimates with small variances, and the overall BO procedure leads to optimal designs which give efficient Bayesian inference results.
As our novel DLBMC estimator for EIG is based on Gaussian process, it is currently limited to low-dimensional problems where the number of design variables is not large. In this work, the classical BO and MCMC approaches are used, and it is not straight forward to apply them for high-dimensional problems. For high-dimensional Bayesian optimal design problems with a large number of design variables, a possible solution is to conduct a sequential design procedure, and apply DLBMC at each step in the sequential procedure. Conducting a systematic DLBMC based on the sequential design will be the focus of our future work. By the definition of expected information gain, p(θ|y, d) p(θ) p(θ|y, d) dθ dy .
Next, we consider the integration y Θ