Kernel-based active subspaces with application to computational fluid dynamics parametric problems using the discontinuous Galerkin method

Nonlinear extensions to the active subspaces method have brought remarkable results for dimension reduction in the parameter space and response surface design. We further develop a kernel-based nonlinear method. In particular we introduce it in a broader mathematical framework that contemplates also the reduction in parameter space of multivariate objective functions. The implementation is thoroughly discussed and tested on more challenging benchmarks than the ones already present in the literature, for which dimension reduction with active subspaces produces already good results. Finally, we show a whole pipeline for the design of response surfaces with the new methodology in the context of a parametric CFD application solved with the Discontinuous Galerkin method.


Introduction
Nowadays, in many industrial settings the simulation of complex systems requires a huge amount of computational power. Problems involving high-fidelity simulations are usually large-scale, moreover the number of solutions required increases with the number of parameters. In this context, we mention optimization tasks, inverse problems, optimal control problems, and uncertainty quantification; they all suffer from the curse of dimensionality, that is, in this case, the computational time grows exponentially with the dimension of the input parameter space. Data-driven reduced order methods (ROM) [11,52,54] have been developed to deal with such costly outer loop applications for parametric PDEs, but the limit for high dimensional parameter spaces remains.
One approach to alleviate the curse of dimensionality is to identify and exploit some notion of low-dimensional structure of the model or objective function that maps the inputs to the outputs of interest. A possible linear input coordinate transformation technique is the Sliced Inverse Regression (SIR) [34] approach and its extensions [16,35,74]. Sharing some characteristics with SIR, there is the active subspaces (AS) property 1 [53,12,13,75] which, in the last years, has emerged as a powerful linear data-driven technique to construct ridge approximations using gradients of the model function. AS has been successfully applied to quantify uncertainty in the numerical simulation of the HyShot II scramjet [15], and for sensitivity analysis of an integrated hydrologic model [29]. Reduction in parameter space has been coupled with model order reduction techniques [27,46,51] to enable more complex numerical studies without increasing the computational load. We mention the use of AS in cardiovascular applications with POD-Galerkin [60], in nonlinear structural analysis [25], in nautical and naval engineering [66,62,63,61], coupled with POD with interpolation for structural and computational fluid dynamic (CFD) analysis [19,65], and with Dynamic Mode Decomposition in [64]. Applications in automotive engineering within a multi-fidelity setting can be found in [48], for turbomachinery see [57], while for results in chemistry see [30,70]. Advances in efficient global design optimization with surrogate modeling are presented in [38,37] and applied to the shape design of the N + 2 Supersonic Passenger Jet. Applications to enhance optimization methods have been developed in [68,23,20,18]. AS has also been successfully used to reduced the memory consumption of highly parametrized systems such as artificial neural networks [17,39].
Possible extensions and variants of the active subspaces property are the Local Active Subspace method [50], the Active Manifold method [10] which reduces the problem to the analysis of a 1D manifold by traversing the level sets of the model function at the expense of high online costs, the shared Active Subspace method [31], the active subspaces property for multivariate functions [75], and more recently an extension of AS to dynamical systems [5]. Another method is Nonlinear Level set Learning (NLL) [77] which exploits RevNets to reduce the input parameter space with a nonlinear transformation.
The search for low dimensional structures is also investigated in machine learning with manifold learning algorithms. In this context the Active Subspaces methodology can be seen as a supervised dimension reduction technique along with Kernel Principal Component Analysis (KPCA) [59] and Supervised Kernel Principal Component Analysis (SKPCA) [7]. Other methods in the context of kernel-based ROMs are [26,32,40]. In [43] a non-linear extension of the active subspaces property based on Random Fourier Features [47,36] is introduced and compared with machine learning manifold learning algorithms for the construction of Gaussian process regressions (GPR) [73].
From the preliminary work [43] in the context of supervised dimension reduction algorithms in machine learning, we develop the kernel-based active subspaces (KAS) method. The novelties of our contribution are the following: • regarding the AS theoretical background, we provide an upper bound of the ridge approximation error (2) for vector-valued objective functions and for a wide collection of probability distributions (see Assumption 3).
• we extend kernel-based AS to vector-valued model functions and develop a detailed algorithmic procedure for the optimization of the feature map. We also test different spectral measures (see Equation (3.2) for the definition), differently from [43] where only the Gaussian measure is employed.
• the application to several test problems of increasing complexity. In particular, we mainly test KAS on problems where the active subspace is not present or the behaviour is not linear, differently from [43], where the comparison is made with KPCA and its variants on datasets with linear trends in the reduced parameter space, apart from the hyperparaboloid test case that we have also included among our toy problems.
• the KAS method is finally applied to a computational fluid dynamics problem and compared with the standard AS technique. We study the evolution of fluid flow past a NACA 0012 airfoil in a duct composed by an initialization channel and a chamber. The motion is modelled with the unsteady incompressible Navier-Stokes equations, and discretized with the Discontinuous Galerkin method (DG) [28]. Physical and geometrical parameters are introduced and sensitivity analysis of the lift and drag coefficients with respect to these parameters is provided.
The work is divided as follows: in section 2 we briefly present the active subspaces property of a model function with a focus on the construction of Gaussian process response surfaces. Then, section 3 illustrates the novel method called kernel-based active subspaces for both scalar and vector-valued model functions. Several tests to compare AS and KAS are provided in section 4 where we start from scalar functions with radial symmetry, we analyze an epidemiology model and a vector-valued output generated from a stochastic elliptic PDE. A parametric CFD test case for the study of the flow past a NACA airfoil using the Discontinuous Galerkin method is presented in section 5. Finally, we outline some perspectives and future studies in section 6.

Active Subspaces for parameter space reduction
Active Subspaces (AS) approach proposed in [53] and developed in [12] is a technique for dimension reduction in parameter space. In brief AS are defined as the leading eigenspaces of the second moment matrix of the model function's gradient (for scalar model functions) and constitutes a global sensitivity index [75]. In the context of ridge approximation, the choice of the active subspace corresponds to the minimizer of an upper bound of the mean square error obtained through Poincaré-type inequalities [75]. After performing dimension reduction in the parameter space through AS, the method can be applied to reduce the computational costs of different parameter studies such as inverse problems, optimization tasks and numerical integration. In this work we are going to focus on the construction of response surfaces with Gaussian process regression.
Definition 1 (Hypothesis on input and output spaces). The quantities related to the input space are: • m ∈ N the dimension of the input space, • (Ω, F, P ) the probability space, • X : (Ω, F, P ) → R m , the absolutely continuous random vector representing the parameters, • ρ : R m → R, the probability density of X with support X ⊂ R m .
The quantities related to the output are: • d ∈ N the dimension of the output space, • f : X ⊂ R m → V , the quantity/function of interest, also called objective function in optimization tasks.
Let B(R m ) be the Borel σ-algebra of R m . We will consider the Hilbert space where ∇f is the weak derivative of f , and ∥∇f ∥ L 2 =: |f | H 1 . We briefly recall how dimension reduction in parameter space is achieved in the construction of response surfaces. The first step involves the approximation of the model function with ridge approximation. We will follow [75,44] for a review of the method.
The ridge approximation problem can be stated in the following way: Definition 2 (Ridge approximation). Let B(R m ) be the Borel σ-algebra of R m . Given r ∈ N, r ≪ d and a tolerance ϵ ≥ 0, find the profile h : (R m , B(R m ), ρ) → V and the r-rank projection In particular we are interested in the minimization problem argmin Pr∈M(m×m) whereh = E ρ [f |σ(P r )] is the conditional expectation of f under the distribution ρ given the σalgebra σ(P r ). The range of the projector P r , R r ∼ Im(P r ) ⊂ R m , is the reduced parameter space. The kernel of the projector P r , R m−r ∼ Im(P r ) ⊂ R m , is the inactive subspace. The existence of h is guaranteed by the Doob-Dynkin lemma [9]. The functionh is proven to be the optimal profile for each fixed P r , as a consequence of the definition of the conditional expectation of a random variable with respect to a σ-algebra. Dimension reduction is effective if the inequality (2) is satisfied for a specific tolerance. The choice of r is certainly of central importance. The dimension of the reduced parameter space can be chosen a priori for a specific parameter study (for example r-dimensional regression), it can be chosen in order to satisfy the inequality (2) or it is determined to guarantee a good accuracy of the numerical method used to evaluate it [Corollary 3.10, [14]].
Dividing the left term of the inequality (2) we obtain the Relative Root Mean Square Error (RRMSE) and since it is a normalized quantity, we will use it to make comparisons between different models We remark that P r is not unique. It can be shown that ifh is the optimal profile, then P r is not uniquely defined and can be chosen arbitrarily from the set {Q r : R m → R m | ker Q r = ker P r }, see [Proposition 2.2, [75]]. The following lemma is the key ingredient in the proof of the existence of an active subspace. It is inherently linked to probability Poincaré inequalities of the kind for zero-mean functions in the Sobolev space h ∈ H 1 (X ), where C P (X , ρ) is the Poincaré constant dependent on the domain X and on the probability density functions (p.d.f.), ρ. We need to make the following assumption to prove the next lemma and the next theorem.
Definition 3. The probability density function ρ : X → R belongs to one of the following classes: 1. X is convex and bounded, ∃δ, where Hess(V (x)) is the Hessian of V (x).
where V is a convex function. In this case we require also f Lipschitz continuous.
In particular the uniform distribution belongs to the first class, the multivariate Gaussian distribution N (m, Σ) to the second with α = 1/(σ max (Σ)) and the exponential and Laplace distributions to the third. A complete analysis of the various cases is done in [44].
Proposition 1. Let (Ω, F, P ) be a probability space, X : (Ω, F, P ) → R m an absolutely continuous random vector with probability density function ρ belonging to one of the classes from the Assumption 3. Then the following inequality is satisfied for all scalar functions h ∈ H 1 (X ) and for all r-rank orthogonal projectors, P r , where C P (P r , ρ) is the Poincaré constant depending on P r and on the p.d.f. ρ.
A summary of the values of the Poincaré constant in relationship with the choice of the probability density function ρ is reported in [44].
In the next theorem the projection P r will depend on the output function f , so also the Poincaré constant C P (P r , ρ) will depend in fact on f . We introduce the following notation for the matrix that substitutes the uncentered covariance matrix of the gradient ∇f in the case of the application of AS to scalar model functions [14] H = where D x f (x) ∈ M(d × m) is the Jacobian matrix of f . The matrix R V (ρ) depends on the class which ρ belongs to, see Appendix A.
Theorem 1 (Existence of an active subspace). Under the hypothesis 1, let f ∈ H 1 (R m , B(R m ), ρ ; V ) and let the p.d.f. ρ satisfy Lemma 1 and Assumption 3. Then the solutionP r of the ridge approximation problem 2 is the orthogonal projector to the eigenspace of the first r-eigenvalues of H ordered by magnitude with r ∈ N chosen such that with C(C P , τ ) a constant depending on τ > 0 related to the choice of ρ and on the Poincaré constant from lemma 1, andh = E ρ [f |σ(P r )] is the conditional expectation of f given the σ-algebra generated by the random variable P r • X. .
Proof. This theorem summarizes the results from Proposition 2.5 and Proposition 2.6 of [75], and from Lemma 3.1, Lemma 4.2, Lemma 4.3, Lemma 4.4 and Theorem 4.5 of [44]. The proof is expanded in Appendix A.
The eigenspace span{v 1 , . . . , v r } ⊂ R m is the active subspace and the remaining eigenvectors generate the inactive subspace span{v r+1 , .
is necessary for f to satisfy the error bound (2).
For the explicit procedure to compute the active subspace given its dimension r see Algorithm 1: from W 1 and W 2 we define the approximations of the projector P r withP r = W 1 W T 1 .

Response surfaces
The term response surface refers to the general procedure of finding the values of a model function f for new inputs without directly computing it but exploiting regression or interpolation from a training set {x i , f (x i )}. The procedure for constructing a Gaussian process response is reported in Algorithm 2, while in Algorithm 3 we show how to exploit it o predict the model function at new input parameters. Directly applying the simple Monte Carlo method with N samples we get a reduced approximation of f as where we have made explicit the dependence of the optimal profileh ϵ on ϵ, Y 1 , . . . , Y N are independent and identically distributed samples of Y ∼ ρ, andP r is an approximation of P r obtained with the simple Monte Carlo method from H, see Algorithm 1. An intermediate approximation error is obtained employing the Poincaré inequality and the central limit theorem for the Monte Carlo approximation where C 1 is a constant, and λ n+1 , . . . , λ m are the eigenvalues of the inactive subspace of H [Theorem 4.4, [14]]. In practiceĥ ϵ,N (P r X) is approximated with a regression or an interpolation such that a response where C 2 is a constant, and δ depends on the chosen method. An estimate for the successive approximations is given by ≤ C 1 (1 + N −1/2 ) 2 τ (λ 1 + · · · + λ n ) 1/2 + (λ n+1 + · · · + λ m ) 1/2 2 + C 2 λ where dist(Im(P r ), Im(P r )) ≤ τ , and λ i are the eigenvalues of H [Theorem 4.8, [14]]. In our numerical simulations we will build the response surface R with Gaussian process regression (GPR) [73].
Algorithm 2 Response surface construction with Gaussian process regression over the active subspace.

4: return trained Gaussian process
Algorithm 3 Prediction phase using the Gaussian process response surface over the active subspace.
Require: trained response surface y(x) Require: Map the test samples x onto the active subspace:x = W 1 x. 2: Evaluate the Gaussian process onx and return the prediction t ∼ N (E[t], σ 2 (t)): Keeping the notations of section 1, X : (Ω, F, P ) → R m is the absolutely continuous random vector representing the m-dimensional inputs with density ρ : X ⊂ R m → R, and f : X ⊂ R m → (V, R V ) is the model function that we assume to be continuously differentiable and Lipschitz continuous.

Kernel-based Active Subspaces extension
One drawback of sufficient dimension reduction with AS applied to ridge approximation is that if a clear linear trend is missing, projecting the inputs as P r X represents a loss of accuracy on the approximation of the model f that may not be compensated even by the choice of the optimal profileh • P r = E ρ [f |σ(P r )]. In order to overcome this, non-linear dimension reduction to one-dimensional parameter space could be achieved discovering a curve in the space of parameters that cuts transversely the level sets of f , this variation is presented in [10] as Active Manifold. Another approach could consist in finding a diffeomorphism ϕ that reshapes the level sets such that subsequently applying AS dimension reduction to the new model functionf • ϕ = f could be more profitable: Unfortunately constructing the Active Manifold or finding the right diffeomorphism ϕ could be a complicated matter. If we renounce to have a backward map and we weaken the bond of the method with the model, we can consider an immersion ϕ from the space of parameters X to an infinite-dimensional Hilbert space H obtaining This is a common procedure in machine learning in order to increase the number of features [73]. Then AS is applied to the new model functionf : A response surface can be built with 2 remembering to replace every occurrence of the inputs x with their images ϕ(x). A synthetic scheme of the procedure is represented in Figure 1.
In practice we consider a discretization of the infinite-dimensional Hilbert space R D ≃ H with D > m. Dimension reduction with AS results in the choice of a r-rank projection in the much broader set of r-rank projections in H.
Since for AS only the samples of the Jacobian matrix of the model function are employed, we can ignore the definition of the new mapf : ϕ(X ) ⊂ H → (V, R V ) and focus only on the computation of the Jacobian matrix off with respect to the new input variable z := ϕ(x). The uncentered covariance matrix becomes where µ := ϕ # (L X ) is the pushforward probability measure of L X (the law of probability of X) with respect to the map ϕ. Simple Monte Carlo can be applied sampling from the distribution ρ in the input space X The gradients off with respect to the new input variable Z are computed from the known values D x f with the chain rule.
The application of the chain rule to the composition of functionsf • ϕ : R m → H → V is applicable iff is defined in an open set U ⊃ ϕ(X ). If ϕ is non singular and also injective the new input space is a m-dimensional submanifold of H. If ϕ is also smooth there exists a smooth extension off : ϕ(X ) ⊂ H → V onto the whole domain H, see Proposition 1.36 from [71].
If the Hilbert space H has finite dimension H ∼ R D this procedure leaves us with an underdetermined linear system to solve for D zf where † stands for the right Moore-Penrose inverse of the matrix Dϕ(x) with rank r, that is with the usual notation for the singular value decomposition (SVD) of Dϕ(x) and Σ † ∈ M(r × r) equal to the diagonal matrix with the inverse of the singular values as diagonal elements.
As anticipated if f is smooth enough and ϕ is an embedding, so that Dϕ has full rank, the previous system has an unique solution. The most crucial part is the evaluation of the gradients D x f (x) from the input output couples, when they are not available analytically or from the adjoint method applied to PDEs models: different approaches are present in the literature, like local polynomial regressions and Gaussian process regression on the whole domain to approximate the gradients; both are available in the ATHENA package [49]. For an estimate of the ridge approximation error due to inaccurate gradients see [12]. Finally, we remark that in the AS method we approximate the random variable X as with {v i } ⊂ R m the active eigenvectors, whereas with KAS the reduced input space is contained in 3: Solve the eigenvalue problem: with v i ∈ R D , and ordered eigenvalues (λ 1 , . . . , λ D )

Choice of the Feature Map
The choice for the map ϕ is linked to the theory of Reproducing Kernel Hilbert Spaces (RKHS) [8], and it is defined as where σ f is an hyperparameter corresponding to the empirical variance of the model, W ∈ M(D × m) is the projection matrix whose rows are sampled from a probability distribution µ on R m and b ∈ R D is a bias term whose components are sampled independently and uniformly in the interval [0, 2π]. We remark that its Jacobian can be computed analytically as for all i ∈ {1, . . . , m}, and for all j ∈ {1, . . . , D}.
We remark that in order to guarantee the correctness of the procedure for evaluating the gradients we have to prove that the feature map is injective and non singular. In general however the feature map (16) cannot not be injective due to the periodicity of the cosine but at least it is almost surely non singular if the dimension of the feature space is high enough.
The feature map (16) is not the only effective immersion that provides a kernel-based extension of the active subspaces. For example an alternative is the following composition of a linear map with a sigmoid where C is a constant, α is an hyperparameter to be tuned, and W ∈ M(D, m) is, as before, a matrix whose rows are sampled from a probability distribution on R m .
Other choices involve the use of Deep Neural Networks to learn the profile h and the projection function P r of the ridge approximation problem [67].
The tuning of the hyperparameters of the spectral measure consists in a global optimization problem where the dimension of the domain can vary between 1 and the dimension of the input space m. The object function to optimize is the relative root mean square error (RRMSE) where T test = (t i ) i∈{1,...,N } are the predictions obtained from the response surface built with KAS and associated to the test set, Y test = (y i ) i∈{1,...,N } are the targets associated to the test set, and y is the mean value of the targets. We implemented a logarithmic grid-search, see Algorithm 5, making use of the SciPy library [69]. Another choice could be Bayesian stochastic optimization implemented in the open-source library GPyOpt [3]. The tuning of the hyperparameters of the spectral measure chosen is the most computationally expensive part of the procedure. We report the computational complexity of the algorithms introduced to have a better understanding of the additional cost implied by the implementation of response surface design with KAS. Let us assume that the number of random Fourier features D, the number of input, output, and gradient samples M , and the dimension of the parameter space m, are ordered in this manner D > M > m, as is usually the case, and that the quantity of interest f is a scalar function. The cost of computing an active subspace is O(M m 2 ), that is the cost of the SVD of the gradients matrix dY used to get the active and inactive eigenvectors in Algorithm 1. The cost of the training of a response surface with Gaussian process regression in Algorithm 2 depends on the cost of minimization of the log-likelihood: each evaluation of the log-likelihood involves the computation of the determinant and the inverse of the regularized Gram matrix K(θ) + σI M , that is O(M 3 ). Finally, the cost for the evaluation of the kernel-based active subspace is associated to the SVD of dỸ that is O(DM 2 ) in Algorithm 4, and to the resolution of the overdetermined linear system to obtain the gradients dỸ , that is M times O(Dm 2 ) since it is related to the evaluation of the pseudo-inverse of Dϕ. So, the computational complexity for the response surface design with AS and GPR is O(n GPR M 3 ), while for the response surface design , where n GPR is the maximum number of steps of the optimization algorithm used to minimize the log-likelihood, n grid-search is the number of hyperparameter instances γ ∈ G to try in Algorithm 5, and n is the number of batches in the n-fold cross validation procedure. In particular, for each grid search hyperparameter the main cost is associated to the GPR training since n GPR usually satisfy Dn < n GPR M , when the optimizer chosen is L-BFGS-B from SciPy [69], accounting also for the number of restarts of the optimizer: in the numerical tests we performed the number of restarts of the training of the GPR is problem-dependent but always less than 10. In general, the number n grid-search depends on the chosen application, and the multiplicative factor between the computational complexity of the response surface design procedure with KAS or AS is lower than 3n grid-search n.
Algorithm 5 Tuning the feature map with logarithmic grid-search.
Require: active subspace dimension r Require: tolerance for the tuning procedure tol ≈ 0.8.
1: Create the grid G and set the variable BEST to 1.

5:
Compute score with n-fold cross validation: 6: for i = 1 to n do 7: Divide input, output, and gradients in train and test datasets.

10:
Predict the values T test using Algorithm 3 with input X test .

12:
if score[n] > tol then 13: Stop cross validation and pass to the next value of α. 14: end if 15: end for 16: if mean(score) < BEST then 17: Save W and b, and set BEST to mean(score).

Random Fourier Features
The motivation behind the choice for this map from Equation (16) comes from the theory on Reproducing Kernel Hilbert Spaces. The infinite-dimensional Hilbert space (H, ⟨·, ·⟩) is assumed to be a RKHS with real shift-invariant kernel k : X × X → R with k(0) = 1 and feature map ϕ.
In order to get a discrete approximation of ϕ : X ⊂ R m → H, random Fourier features are employed [47,36]. Bochner's theorem [41] guarantees the existence of a spectral probability measure µ such that From this identity we can get a discrete approximation of the scalar product ⟨·, ·⟩ with Monte Carlo method, exploiting the fact that the kernel is real and from this relation we obtain the approximation ϕ ≈ z. The sampled vectors {ω i } i=1,...,D are called random Fourier features. The scalars {b i } i=1,...,D are bias terms introduced since in the approximation we have excluded some trigonometric terms from the following initial expression Random Fourier features are frequently used to approximate kernels. We consider only spectral probability measures which have a probability density, usually named spectral density. In the approximation of the kernel with random Fourier features, under some regularity conditions on the kernel, an explicit probabilistic bound depending on the dimension of the feature space D can be proved [41]. This technique is used to scale up Kernel Principal Component Analysis [56,55] and Supervised Kernel Principal Component Analysis [7], but in the case of kernel-based AS the resulting overdetermined linear system employed to compute the Jacobian matrix of the new model function increases in dimension instead.
The most famous kernel is the squared exponential kernel also called Radial Basis Function kernel (RBF) where l is the characteristic length-scale. The spectral density is Gaussian N (0, 1/4π 2 l 2 ): Thanks to Bochner's theorem to every probability distribution that admits a probability density function corresponds a stationary positive definite kernel. So having in mind the definition of the feature map ϕ from Equation (16), we can choose any probability distribution for sampling the random projection matrix W ∈ M(D × m) without focusing on the corresponding kernel since it is not needed by the numerical procedure.
After the choice of the spectral measure the corresponding hyperparameters have to be tuned. This is linked to the choice of the hypothesis models in machine learning and it is usually carried out for the hyperparameters of the employed kernel. From the choice of the kernel and the corresponding hyperparameters some regularity properties of the model are implicitly assumed [73].

Benchmark test problems
In this section we are going to present some benchmarks to prove the potential gain of KAS over standard linear AS, for both scalar and vectorial model functions. In particular we test KAS on radial symmetric functions, with 2-dimensional and 8-dimensional parameter spaces, on the approximation of the reproduction number R 0 of the SEIR model, and finally on a vectorial output function that is the solution of a Poisson problem.
One dimensional response surfaces are built following the algorithm described in subsection 2.1. The tuning of the hyperparameters of the feature map is carried out with a logarithmic grid-search and 5-fold cross validation for the Ebola test case, while for the other cases we employed Bayesian stochastic optimization implemented in [24] with 3-fold cross validation. The score function chosen is the Relative Root Mean Square Error (RRMSE). The spectral measure for each test case is chosen by brute force among the Laplace, Gaussian, Beta and multivariate Gaussian distributions. The number of Fourier features is not established based on a criterion but we have seen experimentally that above a certain threshold the number of features is high enough to at least reproduce the accuracy of the AS method. Since the most sensitive part to the final accuracy of the response surface is the tuning of the hyperparameters of the spectral measures, we suggest to choose an affordable number of features between 1000 and 2000, and focus on the tuning of said hyperparameters instead.We remark that the number of samples employed is problem dependent: some heuristics to determine it can be found in [12], but the crucial point is that additional training samples with respect to the ones used for the AS method are not needed for the novel KAS method. Moreover, the CPU time for the hyperparameters tuning procedure is usually negligible with respect to the time required to obtain input-output pairs from the numerical simulation of PDEs models: in our applications the tuning procedure's computational time is in the order of minutes (usually around 10-15 minutes for most testcases), while for the CFD application of Section 5 it is in the order of days and for the stochastic elliptic partial differential equation of Subsection 4.3 it is in the order of hours. We also remark that the tuning Algorithm 5, the GPR training restarts, and the choice of the spectral measure can be easily parallelized.
For the radial symmetric and Ebola test cases the inputs are sampled from a uniform distribution with problem dependent ranges.For the stochastic elliptic partial differential case the inputs are the coefficients of a Karhunen-Loève expansion and are sampled from a normal distribution. All the computations regarding AS and KAS are done using the open source Python package called ATHENA [49].

Radial symmetric functions
Radial symmetric functions represent a class of model functions for which AS is not able to unveil any low dimensional behaviour. In fact for these functions any rotation of the parameter space produce the same model representation. Instead Kernel-based AS is able to overcome this problem thanks to the mapping onto the feature space.
We present two benchmarks: an 8-dimensional hyperparaboloid defined as and the surface of revolution in R 3 with generatrix g(x) = sin( The gradients are computed analytically. For the hyperparaboloid we use N s = 500 independent, uniformly distributed training samples in [−1, 1] 8 , while for the sine case the training samples are N s = 800 in [−3, 3] 2 . In both cases the test samples are 500. The feature space has dimension 1000 for both the first and the second case. The spectral distribution chosen is the multivariate normal with hyperparameter a uniform variance λI d , and a product of Laplace distributions with γ and b as hyperparameters, respectively. The tuning is carried out with 3-fold cross validation. The results are summarized in Table 1. Looking at the eigenvalues of the uncentered covariance matrix of the gradientsH for the hyperparaboloid case in Figure 2, we can clearly see how the decay for AS is almost absent, while using KAS the decay after the first eigenvalue is pronounced, suggesting the presence of a kernelbased active subspace of dimension 1. The one-dimensional sufficient summary plots, which are f (x) against W T 1 x -in the AS caseor against W T 1 ϕ(x) -in the KAS case -, are shown in Figure 3 and Figure 4, respectively. On the left panels we present the Gaussian process response surfaces obtained from the active subspaces reduction, while on the right panels the ones obtained with the kernel-based AS extension. As we can see AS fails to properly reduce the parameter spaces, since there are no preferred directions over which the model functions vary the most. The KAS approach, on the contrary, is able to unveil the corresponding generatrices. This results in a reduction of the RMS by a factor of at least 3 (see Table 1).  Figure 2: Eigenvalues of the covariance matrixH ∈ R 8×8 applied to the hyperparaboloid case for the AS procedure on the left, and the first 10 eigenvalues of the covariance matrixH ∈ R 1000×1000 for the KAS procedure applied to the same case on the right.

SEIR model for the spread of Ebola
In most engineering applications the output of interest presents a monotonic behaviour with respect to the parameters. This means that, for example, the increment in the inputs produces a proportional response in the outputs. Rarely the model function has a radial symmetry, and in such cases the parameter space can be divided in subdomains, which are analyzed separately. In this section we are going to present a test case where there is no radial symmetry, showing that, even in this case the kernel-based AS presents better performance with respect to AS. For the Ebola test case 3 , the output of interest is the basic reproduction number R 0 of the SEIR model, described in [21], which reads with parameters distributed uniformly in Ω ⊂ R 8 . The parameter space Ω is an hypercube defined by the lower and upper bounds summarized in Table 2.  We can compare the two one-dimensional response surfaces obtained with Gaussian process regression. The training samples are N s = 800, and we use 1000 features. As spectral measure we use again the multivariate gaussian distribution N (0, Σ) with hyperparameters the elements of the diagonal of the covariance matrix. The tuning is carried out with 5-fold cross validation. Even in this case, the KAS approach results in smaller RMS with respect to the use of AS (around 60% less), as reported in Table 1. In Figure 5 we report the comparison of the two approaches over an active subspace of dimension 1.

Elliptic Partial Differential Equation with random coefficients
In our last benchmark we apply the kernel-based AS to a vectorial model function, that is the solution of a Poisson problem with heterogeneous diffusion coefficient. We refer to [75] for an application, on the same problem, of the AS approach.
We consider the following stochastic Poisson problem on the square x = (x, y) ∈ Ω := [0, 1] 2 : with homogeneous Neumann boundary condition on the right side of the domain, that is ∂Ω right , Neumann boundary conditions on the left side of the domain, that is ∂Ω left , and Dirichlet boundary conditions on the remaining part of ∂Ω. The diffusion coefficient κ : (Ω, A, P ) × Ω → R, where A is a σ-algebra, is such that log(κ) is a Gaussian random field, with covariance function C(x, y) defined by where β = 0.03 is the correlation length. This random field is approximated with the truncated Karhunen-Loève decomposition where (X i ) i∈1,...,m are independent standard normal distributed random variables, and (γ i , ψ i ) i∈1,...,d are the eigenpairs of the Karhunen-Loève decomposition of the zero-mean random field κ.
In our simulation the domain Ω is discretized with a triangular unstructured mesh T with 3194 triangles. The parameter space has dimension m = 10. The simulations are carried out with the finite element method (FEM) with polynomial order one, and for each simulation the parameters (X i ) i=1,...m are sampled from a standard normal distribution. The solution u is evaluated at d = 1668 degrees of freedom, thus (V, R V ) ≈ (R d , S + M ) where the metric R V is approximated with the sum of the stiffness matrix S ∈ R d × R d and the mass matrix M ∈ R d × R d . This sum is a discretization of the norm of the Sobolev space H 1 (Ω). The number of features used in the KAS procedure is D = 1500, the number of different independent simulations is M = 1000.
Three outputs of interest are considered. The first target function f : R m → R is the mean value of the solution at the right boundary ∂Ω right , which reads and it is used to tune the feature map minimizing the RRMSE of the Gaussian process regression, as described in Algorithm 5. A summary of the results for the first output is reported in Table 1. The plots of the regression are reported in Figure 6. Even in this case both from a qualitative and a quantitative point of view, the kernel-based approach achieves the best results. The second output we consider is the solution function This output can be employed as a surrogate model to predict the solution u given the parameters X that define the diffusion coefficient instead of carrying out the numerical simulation. The surrogate model should be constructed over the span of the modes identified by the chosen reduction strategy, after projecting the data. AS and KAS modes are distinguished but can detect some common regions of interest as shown in Table 3. The third output is the evaluation of the solution at a specific degree of freedom with indexî, that is f : in this case the dimension of the input space is m = 100. Since we use a Lagrangian basis in the finite element formulation and the polinomial order is 1, the node of the mesh associated to the chosen degree of freedom has coordinates [0.27, 0.427] ∈ Ω. Qualitatively we can see from Table 3 that the AS modes locate features in the domain which are relatively more regular with respect to the KAS modes. To obtain this result we increased the dimension of the input space, otherwise not even the AS modes could locate properly the position in the domain Ω of the degree of freedom. f (x) Mean Test Confidence Figure 6: Comparison between the sufficiency summary plots obtained from the application of AS and KAS methods for the stochastic PDE model, defined in Equations (27) and (30). The left plot refers to AS, the right plot to KAS. With the blue solid line we depict the posterior mean of the GP, with the shadow area the 68% confidence intervals, and with the blue dots the testing points.
In the second and third case the diffusion coefficient is given by where v j ∈ R D , j ∈ {1, . . . , D}, is the j-th active eigenvector from the KAS procedure and the functionsΨ := (ψ 1 , . . . ,ψ D ) are defined byΨ where ϕ is the feature map defined in Equation (16) with the projection matrix W and bias b, and Ψ := (γ 1 ψ 1 , . . . , γ m ψ m ). The gradients of the three outputs of interest considered are evaluated with the adjoint method.

A CFD parametric application of KAS solved with the DG method
We want to test the kernel-based extension of the active subspaces in a computational fluid dynamics context. The lift and drag coefficients of a NACA 0012 airfoil are considered as model functions. Numerical simulations are carried out with different input parameters for quantities that describe the geometry and the physical conditions of the problem. The evolution of the model is protracted until a periodic regime is reached. Once the simulation data have been collected, sensitivity analysis is performed searching for an active subspace and response surfaces with GPR are then built from the application of AS and KAS techniques. The fluid motion is modelled through the unsteady incompressible Navier-Stokes equations approximated through the Chorin-Temam operator-splitting method implemented in HopeFOAM [4]. HopeFOAM is an extension of OpenFOAM [1,72], an open source software for the solution of complex fluid flows problems, to variable higher order element method and it adopts a Discontinuous Galerkin Method, based on the formulation proposed by Hesthaven and Warburton [28].
The Discontinuous Galerkin method (DG) is a high-order method, which has appealing features such as the low artificial viscosity and a convergence rate which is optimal also on unstructured grids, commonly used in industrial frameworks. In addition to this, DG is naturally suited for the solution of problems described by conservative governing equations (Navier Stokes equations, Maxwell's equations and so on) and for parallel computing. All these properties are due to the fact that, differently from formulations based on standard finite elements, no continuity is imposed on the cell boundaries and neighboring elements only exchange a common flux. The major drawback of DG is its high computational cost with respect to continuous Galerkin methods, due to the  Nowadays efforts are aimed at applying the DG in problems which involve deformable domains [76] and at improving the computational efficiency of the DG adopting techniques based on hybridization methods, matrix-free implementations, and massive parallelization [42,45].

Domain and mesh description
The domain Ω of the fluid dynamic simulation is a two-dimensional duct with a sudden area expansion and a NACA 0012 airfoil is placed in the largest section. The inflow ∂Ω I is placed at the beginning of the narrowest part of the duct, and here the fluid velocity is set constant along all the inlet boundary. The outlet is placed on the right hand side and it is denoted with ∂Ω O . We refer with ∂Ω W := ∂Ω\{∂Ω O ∪ ∂Ω I } to the boundaries of the airfoil and to the walls of the duct, where no slip boundary conditions are applied. The horizontal lengths of the sections of the channels are 0.6 m and 1.35 m, respectively. The vertical length of the duct after the area expansion is 0.4 m, while the width of the first one depends on two distinct parameters. The airfoil has a chord-length equal to 0.1 m but its position with respect to the duct and its angle of attack are described by geometric parameters. Further details about the geometric parameterization of the geometry are provided in the following section. A proper triangulation is designed with the aid of the gmsh [2] tool and the domain is discretized with 4445 unstructured elements.
The evaluation of adimensional magnitudes, commonly used for characterizing the fluid flow field, requires the definition of some reference magnitudes. For the problem at hand we consider the equivalent diameter of the channel in correspondence of the inlet as the reference lengthscale, while the reference velocity is the one imposed at the inlet.

Parameter space description
We chose 7 heterogeneous parameters for the model: 2 physical, and 5 geometrical which describe the width of the channel and the position of the airfoil. In Table 4 are reported the ranges for the geometrical and physical parameters of the simulation. U is the first component of the initial velocity, ν is the kinematic viscosity, x 0 and y 0 are the horizontal and vertical components of the translation of the airfoil with respect to its reference position (see Figure 7), α is the angle of the counterclockwise rotation and the center of rotation is located right in the middle of the airfoil, y + and y − are the module of the vertical displacements of the upper and lower side of the initial conduct from a prescribed position.

Governing equations
The CFD problem is modeled through the incompressible Navier-Stokes and the open source solver HopeFOAM [4] has been employed for solving this set of equations [28].
Let Ω ⊂ R 2 be the two-dimensional domain introduced in subsection 5.1, and let us consider the incompressible Navier-Stokes equations. Omitting the dependence on (x, t) ∈ Ω × R + in the In order are represented the maximum angle of attack α, the ranges for the horizontal translation x 0 , the ranges for the vertical translation y 0 , and the minimum opening of the channel which depends on the parameters y + and y − in Table 4. first two equations for the sake of compactness, the governing equations are where p is the scalar pressure field, u = (u, v) is the velocity field, ν is the viscosity constant and u 0 is the initial velocity. In conservative form, the previous equations can be rewritten as with the flux F given by From now on, in order to have a more compact notation, the advection term is written as N (u) = ∇ · F(u).
For each timestep the procedure is broken into three stages accordingly to the algorithm proposed by Chorin and adapted for a DG framework by Hesthaven et al. [28]: the solution of the advection dominated conservation law component, the pressure correction weak divergence-free velocity projection, and the viscosity update. The non-linear advection term is treated explicitly in time through a second order Adams-Bashforth method [22], while the diffusion term implicitly. The Chorin algorithm is reported in Algorithm 6.
In order to recover the Discontinuos Galerkin formulation, the equations introduced by the Chorin method are projected onto the solution space by introducing a proper set of test functions and then the variables are approximated over each element as a linear combination of local shape functions. The DG does not impose the continuity of the solution between neighboring elements and therefore it requires the adoption of methods for the evaluation of the flux exchange between neighboring elements. In the present work the convective fluxes are treated accordingly to the Lax-Friedrichs scheme, while the viscous ones are solved through the Interior Penalty method [6,58].
The aerodynamic quantities we are interested in are the lift and drag coefficients in the incompressible case computed from the quantities u, p, ν, A ref , and u 0 with a contour integral along the Algorithm 6 Chorin Algorithm.
Require: state variables u and p at t = 0, mesh, and boundary conditions 1: while t < t final do 2: Update state variables u n−1 = u n , u n = u n+1 .

3:
Find a guess value for the velocityũ by solving:

7:
Update t n . 8: end while 9: return state variables u and p at t = t final airfoil Γ as The vector n is the outward normal along the airfoil surface. The circulation in Γ is affected by both the pressure and stress distributions around the airfoil. The projection of the force along the horizontal and vertical directions gives the drag and lift coefficients respectively where the reference area A ref is the chord of the airfoil times a length of 1 m. For the aerodynamic analysis of the fluid flow past an airfoil see [33].

Numerical results
In this section a brief review of the procedure and some details about the numerical method and the computational domain will be presented along the results obtained. For what concerns the DG the polynomial order chosen is 3. The total number of degrees of freedom is 133350. Small variations on the mesh are present in each of the 285 simulations due to the different configurations of the domain. Each simulation is carried out until a periodic behaviour is reached and for this reason the final times range between 3.5 and 5 s, depending on the specific configuration. The integration time intervals are variable and they are updated at the end of each step in order to satisfy the CFL condition. The 7 physical and geometrical parameters of the simulation are sampled uniformly from the intervals in Table 4. In total we consider a dataset of 285 samples. With the purpose of qualitatively visualizing the results, 4 different simulations are reported in Figure 8 for the module of the velocity field and the scalar pressure field, respectively, both evaluated at the last time instant. These simulations were chosen from the 285 collected in order to show significant differences in the evolution of the fluid flow. In Table 5 are reported the corresponding parameters. Depending on the position of the airfoil and the other physical parameters, different fluid flow patterns can be qualitatively observed. Table 5: Parameters associated to the simulations plotted in Figure 8.  Table 5.
The lift (C L ) and drag (C D ) coefficients are evaluated when stationary or periodic regimes are reached, starting from the values of pressure and viscous stresses evaluated on the nodes close to the airfoil. After this sensitivity analysis is carried out. First the AS method is applied. The gradients necessary for the application of the AS method are obtained from the Gaussian process regression of the model functions C L and C D on the whole parameters' domain. The eigenvalues of the uncentered covariance matrix for the lift and drag coefficients suggest the presence of a one-dimensional active subspace in both cases.
The plots of the first active eigenvector components are useful as sensitivity measures, see Figure 9. The greater the absolute value of a component is, the greater is its influence on the model function. We observe that the lift coefficient is influenced mainly by the vertical position of the airfoil and the angle of attack, while the drag coefficient depends mainly on the initial velocity, and secondarily on the viscosity and on the angle of attack.
As one could expect from physical considerations, the angle of attack affects both drag and lift coefficients, while the viscosity, which governs the wall stresses, is relevant for the evaluation of the C D . The vertical position of the airfoil with respect to the symmetric axis of the section of the duct after the area expansion also greatly affects both coefficients, and this is mainly due to the fact that the fluid flow conditions change drastically between the core, where the speed is higher, and the one close to the wall of the duct, where the speed tends to zero. On the other hand, the horizontal translation has almost no impact on the results, given the regularity of the fluid flow along the x-axis for the considered range of x 0 . Moreover, the non-symmetric behaviour of the upper and lower parameters which determine the opening of the channel is due to the non-symmetric choice of the range considered for the angle of attack. First active eigenvector Drag coefficient Figure 9: Components of the first active eigenvector for the lift coefficient (on the left), and for the drag coefficient (on the right). Values near 0 suggest little sensitivity for the target function.
The KAS method was applied with 1500 features. In order to compare the AS and KAS methods 5-fold cross validation was implemented. The score of cross validation is the relative root mean square error (RRMSE) defined in Equation 19. The Gaussian process regressions for the two methods are shown in Figure 10 for the lift coefficient, and in Figure 11 for the drag coefficient. They were obtained as a single step of 5-fold cross validation with one fifth of the 285 samples used as test set. The spectral distribution of the feature map is the Gaussian distribution for the lift, and the Beta for the drag, respectively. The RRMSE mean and standard deviation from 5-fold cross validation, are reported for different active dimensions in Table 6. The feature map from Equation (16) was adopted. The hyperparameters of the spectral distributions were tuned with logarithmic grid-search with 5-fold cross validation as described in Algorithm 5. f (x) Mean Test Confidence Figure 11: Comparison between the sufficiency summary plots obtained from the application of AS and KAS methods for the drag coefficient C D defined in Equation 39. The left plot refers to AS, the right plot to KAS. With the blue solid line we depict the posterior mean of the GP, with the shadow area the c68% confidence intervals, and with the blue dots the testing points.
Regarding the drag coefficient, the relative gain using the KAS method reaches the 19.2% on average when employing the Beta spectral measure for the definition of the feature map. The relative gain of the one dimensional response surface built with GPR from the KAS method is 7% on average for the lift coefficient. This result could be due to the higher noise in the evaluation of the C L . In this case the relative gain increases when the dimension of the response surface increases to 2 with a gain of 14.6%. A slight reduction of the AS RRMSE relative to the drag coefficient is ascertained when increasing the dimension of the response surface.

Conclusions and perspectives
In this work we presented a new nonlinear extension of the active subspaces property that introduces Kernel-based Active Subspaces (KAS). The method exploits random Fourier features to find active subspaces on high-dimensional feature spaces. We tested the new method over 5 different benchmarks of increasing complexity, and we provided pseudo-codes for every aspects of the proposed kernel-extension. The tested model functions range from scalar to vector-valued. We also provide a CFD application discretized by the Discontinuous Galerkin method. We compared the kernel-based active subspaces to the standard linear active subspaces and we observed in all the cases an increment of the accuracy of the Gaussian response surfaces built over the reduced parameter spaces. The most interesting results regard the possibility to apply the KAS method when an active subspace does not exist. This was shown for radial symmetric model functions. Future developments will involve the study of more efficient procedures for tuning the hyperparameters of the spectral distribution. Other possible advances could be done finding an effective back-mapping from the targets to the actual parameters in the full original space. This could promote the implementation of optimization algorithms or other parameter studies enhanced by the kernel-based active subspaces extension.

A Appendix -Proof details
In this section we provide an expanded version of the proof of Theorem 1.
Proof of Theorem 1: existence of an active subspace. The proof is remodeled from [44,75], and it is developed in five steps: 1. Since R V ∈ M(d, d) is symmetric positive definite there exists a basis of eigenvectors (w i ) i∈{1,...,d} and a corresponding set of positive eigenvalues (β i ) i∈{1,...,d} such that 2. Let us define the ridge approximation error as Then we can decompose the error analysis for each component employing the spectral decomposition (1) so we can define e i (X) = w i · e(X) = f i (X) − h i (P r (X)), ∀i ∈ {1, . . . , d} and treat each component separately.
4. The spectral decomposition (1) is employed again and the covariance matrix H is introduced in the last equation β i C(C p (ρ, P r (X))) tr((I − P T r )E P ((∇f (X)) T w i ) ⊗ ((∇f (X)) T w i ) (I − P r )) 1 q = = C(C p (ρ, P r (X))) tr((I − P T r )E P (∇f (X)) T d i=1 β q i w i ⊗ w i ∇f (X) (I − P r )) 1 q = = C(C p (ρ, P r (X))) tr((I − P T r )E P (∇f (X)) T R V (ρ)∇f (X) (I − P r )) 1 q = = C(C p (ρ, P r (X))) tr((I − P T r )H(I − P r )) 1 q , where R V (ρ) is the original metric matrix if ρ belongs to the first or second class of Assumption 3 and is equal to if ρ belongs to the third class.
5. Finally the bound in the statement of the theorem is recovered solving the following minimization problem with classical model reduction arguments employing singular value decomposition (SVD)P r = argmin Pr∈O(m,m) tr((I − P T r )H(I − P r )). (48)