Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Biometrics. Author manuscript; available in PMC 2009 Apr 6.
Published in final edited form as:
PMCID: PMC2665800

Semiparametric Regression of Multidimensional Genetic Pathway Data: Least-Squares Kernel Machines and Linear Mixed Models


We consider a semiparametric regression model that relates a normal outcome to covariates and a genetic pathway, where the covariate effects are modeled parametrically and the pathway effect of multiple gene expressions is modeled parametrically or nonparametrically using least-squares kernel machines (LSKMs). This unified framework allows a flexible function for the joint effect of multiple genes within a pathway by specifying a kernel function and allows for the possibility that each gene expression effect might be nonlinear and the genes within the same pathway are likely to interact with each other in a complicated way. This semiparametric model also makes it possible to test for the overall genetic pathway effect. We show that the LSKM semiparametric regression can be formulated using a linear mixed model. Estimation and inference hence can proceed within the linear mixed model framework using standard mixed model software. Both the regression coefficients of the covariate effects and the LSKM estimator of the genetic pathway effect can be obtained using the best linear unbiased predictor in the corresponding linear mixed model formulation. The smoothing parameter and the kernel parameter can be estimated as variance components using restricted maximum likelihood. A score test is developed to test for the genetic pathway effect. Model/variable selection within the LSKM framework is discussed. The methods are illustrated using a prostate cancer data set and evaluated using simulations.

Keywords: BLUPs, Kernel function, Model/variable selection, Nonparametric regression, Penalized likelihood, REML, Score test, Smoothing parameter, Support vector machines

1. Introduction

Analysis of microarray data has been mainly focused on detection of individually significantly expressed genes (Efron et al., 2001; Tusher, Tibshirani, and Chu, 2001). This approach has some major limitations: (1) long list of individually significant genes without a single encompassing theme is difficult to interpret; (2) cellular processes often affect sets of genes and individually highly ranked genes are often downstream genes, so moderate changes in many genes may give more insight into biological mechanisms than dramatic change in a single gene (Mootha et al., 2003); (3) individually highly ranked genes can be poorly annotated and are often not reproducible across studies (Fortunel et al., 2003). Researchers have now become more interested in knowledge-based studies on gene sets, for example, genetic pathways that are more biologically interpretable and reproducible (Goeman et al., 2005; Subramanian et al., 2005).

A data example motivating the proposed research is the data from the Michigan prostate cancer study (Dhanasekaran et al., 2001). Prostate-specific antigen (PSA) has been routinely used as a biomarker for screening prostate cancer. Recently there have been significant breakthroughs in the effort of finding candidate genes related to prostate cancer. The early results of Dhanasekaran et al. (2001) indicate that certain functional genetic pathways seemed dysregulated in prostate cancer relative to noncancerous tissues. One is interested in studying the genetic pathway effects on PSA after adjusting for effects of clinical and demographic covariates. Due to the complicated unknown relationships between genes and PSA, we propose a flexible framework to model the genetic pathway effect parametrically or nonparametrically.

There is a vast literature on multidimensional nonparametric modeling. Methods such as multivariate kernel smoothing (Wand and Jones, 1995), projection pursuit regression (Friedman and Stuetzle, 1981), and multivariate adaptive regression splines (MARS) (Friedman, 1991), are usually computationally expensive. Popular spline-based methods include generalized additive models (GAMs) (Hastie and Tibshirani, 1990), thin-plate splines (Wahba, 1990; Green and Silverman, 1994), penalized regression splines (Ruppert, Wand, and Carroll, 2004), and smoothing spline ANOVA (Gu, 2002). These methods require the specification of the smoothness condition of an unknown function using differentiability conditions, which is much more involved and awkward in multidimensional settings.

In the past decade, the kernel machine method has been developed in machine learning as a powerful learning technique for multidimensional data (Vapnik, 1998; Schölkopf and Smola, 2002; Suykens et al., 2002; Rasmussen and Williams, 2006). Popular examples of kernel machine methods include support vector machine (SVM) (Vapnik, 1998) and Bayesian Gaussian process (Rasmussen and Williams, 2006). In the context of function approximation, kernel machine methods and spline-based methods share a similar theoretical foundation, but their model-fitting philosophies are different. Kernel machine methods start with a kernel function that implicitly determines the smoothness property of the unknown function. By contrast, spline-based methods start with the smoothness conditions of the unknown function and a corresponding kernel function can usually be derived from these conditions (Wahba, 1990). Kernel machine methods hence greatly simplify specification of a nonparametric model, especially for multidimensional data.

In this article, we propose a semiparametric model for covariate and genetic pathway effects on a continuous outcome (e.g., PSA), where covariates effects are modeled parametrically and genetic pathway effect is modeled parametrically or nonparametrically using least-squares kernel machine (LSKM). We establish a connection between LSKM and linear mixed models, and show that the LSKM estimator of the regression coefficients and the pathway effect can be obtained by fitting a linear mixed model. This connection provides a unified framework for inference of parameters in models with multidimensional covariates, including the regression coefficients, the nonparametric function, and smoothing parameters. Our work extends the connection between univariate smoothing splines and linear mixed models (Speed, 1991; Wang, 1998; Zhang et al., 1998) to multivariate smoothing with an arbitrary kernel function. We also propose a score test to test for the nonparametric genetic pathway effect, and a model/variable selection method within the LSKM framework.

The rest of the article is organized as follows. In Section 2, we present the semiparametric model for Gaussian outcomes. In Section 3, we describe the LSKM method. In Section 4, we establish a connection between LSKMs and linear mixed models and propose a score test for testing for the genetic pathway effect. We discuss the variable selection problem in LSKM in Section 5. The performance of the proposed method is evaluated by simulations in Section 7, and is illustrated using the prostate cancer microarray data in Section 6. The article ends with a discussion in Section 8.

2. Semiparametric Model for Multidimensional Data

2.1 The Model

Suppose the data consist of n subjects. For subject i (i = 1, … , n), yi is a normally distributed continuous outcome, xi is a q × 1 vector of clinical covariates and zi is a p × 1 vector of gene expressions within a pathway. We assume an intercept is included in xi. The outcome yi depends on xi and zi through the following partial linear model


where β is a q × 1 vector of regression coefficients, h(zi) is an unknown centered smooth function, and the errors ei are assumed to be independent and follow N(0, σ2).

Model (1) models covariate effects parametrically and the pathway effect parametrically or nonparametrically. When h(·) = 0, (1) reduces to the standard linear regression model. When xi = 1, it reduces to LSKM regression (Suykens et al., 2002).

2.2 Specifications of a Function Space of h(z) Using a Kernel

We assume the nonparametric function h(z) lies in a function space K generated by a positive definite kernel function K(·,·). From Mercer's theorem (Cristianini and Shawe-Taylor, 2000), under some regularity conditions, a kernel function K(·,·) implicitly specifies a unique function space spanned by a particular set of orthogonal basis functions (features) {ϕj(z)}j=1J. In other words, any h(z) ∈, K can be represented usinga set of bases as h(z)=Σj=1Jωjϕj(z)=ϕ(z)Tω (the primal representation), where ω is a vector of coefficients. Equivalently, h(z) can also be represented using a kernel function K(·,·) as h(z)=Σl=1LαlK(zl,z;ρ) (the dual representation), for some integer L, some constants αl and some {z1,,zL}Rp. For a multidimensional z, it is more convenient to specify h(z) usingthe dual representation, because explicit basis functions or features might be complicated to specify, and the number of features might be high or even infinite.

Two popular kernel functions and the corresponding function spaces are as follows: (1) The dth Polynomial Kernel: K(z1,z2)=(z1Tz2+ρ)d, where ρ and d are tuning parameters. The dth polynomial kernel generates the function space 𝒦 spanned by all possible dth-order monomials of the components of z. For example, if d = 1, the first polynomial kernel generates the linear function space with basis functions {ϕj(z)} = {z1, … , zp}. If d = 2, the second polynomial kernel corresponds to the quadratic function space with basis functions {ϕj(z)} = {zk , zk zk′} (k, k′ = 1, … , p), that is, the main effects, all two way interactions and quadratic main effects of the zk's. (2) The Gaussian Kernel: K(z1,z2)=exp{z1z22ρ}, where z1z22=Σk=1p(z1kz2k)2. The Gaussian kernel generates the function space spanned by radial basis functions. See Buhmann (2003) for their mathematical properties and desirable features. Examples of other choices of kernel functions include the sigmoid and neural network kernels, and the B-spline kernel (Schölkopf and Smola, 2002). The choice of a kernel function determines which function space one would like to use to approximate h(z).

3. LSKM Estimation in the Semiparametric Model

Assume h(·) ∈ K, the function space generated by a kernel function K(·,·). Estimation of β and h(·) in (1) proceeds by maximizing the scaled penalized likelihood function


where λ is a tuning parameter which controls the tradeoff between goodness of fit and complexity of the model. When λ = 0, the model interpolates the gene expression data, whereas when λ = ∞, the model reduces to a simple linear model without h(·).

By the Representer theorem (Kimeldorf and Wahba, 1970), the general solution for the nonparametric function h(·) in (2) can be expressed as


where α = (α1, …, αn)T are unknown parameters. Substituting (3) back into (2) we have


where K is an n × n matrix whose (i, j)th element is K(zi, zj). Differentiating J(β,α) with respect to β and α, some calculations give



where X=(x1T,,xnT)T and y = (y1, …, yn)T. Plugging (6) into (3), we have that the function h(·) evaluated at the design points (z1, …, zn)T is estimated as


Using (3) and (6), ĥ(·) at an arbitrary z is


Equivalently, if h(z) = ϕ(z)Tω, where {ϕj(z)} are orthogonal basis functions, the corresponding LSKM regression coefficients ω̂. are


The kernel function K(·,·) usually depends on an unknown parameter ρ, such as the scale parameter in Gaussian kernel. Inference on β̂, ĥ(z) depends on λ, ρ and the residual variance σ2, which need to be estimated. Cross-validation can be used to estimate λ; however, its computation is often intensive. Little literature is available on the systematic estimation of ρ. and σ2. In the machine learning literature, ρ is often preset at some fixed values. Further, estimation of σ2 needs to properly account for the loss of degrees of freedom from estimating β and h(·). Hence it is desirable to develop a systematic method to estimate these parameters simultaneously. We accomplish this by establishinga connection between LSKM and linear mixed models.

4. LSKMs and Linear Mixed Models

4.1 Connection Between LSKMs and Linear Mixed Models

Linear mixed models have commonly been used for analyzing longitudinal and hierarchical data (Harville, 1977; Laird and Ware, 1982). A connection between smoothing splines and linear mixed models has been established (Speed, 1991; Wang, 1998; Zhang et al., 1998). We show here that the LSKM estimator in model (1) corresponds to the best linear unbiased predictor (BLUP) estimator from a linear mixed model, and the regularization parameters (τ, ρ) and the residual variance σ2 can be treated as variance components and estimated simultaneously usingrestricted maximum likelihood (REML).

To see this connection, simple calculations show that β̂ and ĥ from equations (5) and (7) can be equivalently obtained from the equations


where R = σ2I and τ = λ−1σ2. Equation (10) corresponds exactly to the normal equation of the linear mixed model


where β is a q × 1 vector of regression coefficients, h is an n × 1 vector of random effects with distribution N(0, τK), and eN(0, σ2I). A comparison of (11) with model (1) indicates that they have exactly the same form except that h is now treated as random effects. It follows that the BLUPs of the regression coefficients β̂ and the random effects ĥ under the linear mixed model (11) correspond to the LSKM estimator given in Section 3. In fact, one can easily see that the regression coefficient estimator β̂ in (5) is the weighted least-squares estimator under the linear mixed model representation (11) using the marginal covariance of y under (11) as V = σ2I + τK, i.e., β̂ = (XTV−1X)−1XTV−1y.

The linear mixed model representation of the LSKM in the semiparametric model (1) can also be considered as a Bayesian Gaussian process regression (Schölkopf and Smola, 2002). Note that this Bayesian correspondence is finite-dimensional (Wahba, 1990; Green and Silverman, 1994). It is not strictly equivalent to a continuous Bayesian Gaussian process (Rasmussen and Williams, 2006), because the finite-dimensional representation of h(·) does not lead to a coherent Bayesian model (Green and Silverman, 1994; Tipping, 2001; Sollich, 2002. To see the Bayesian representation, we can treat {h(z)} as a random vector with a Gaussian process (GP) prior, with mean 0 and covariance cov{h(z1), h(z2)} = τK(z1, z2). Note that the positive definiteness of the kernel function K(·,·) ensures it is a proper covariance function. Now we assume


One can easily see that under this Bayesian model, the semiparametric model (1) becomes the linear mixed model representation (11). This connection extends the connection between scalar smoothing splines and mixed models and their Bayesian formulations (Wang, 1998; Zhang et al., 1998) to multidimensional regression problems under the kernel machine framework.

The covariances of β̂ and ĥ(·) can be calculated in two ways. The first approach is to treat the true ĥ(·) as a fixed unknown function and the variance of yi as σ2. Using (5) and (7), the covariances of β̂ and ĥ(·) are


covF(h^)=σ2(τK)P2(τK),covF{h^(z)}=σ2(τKzT)P2(τKz)for arbitraryz,

where P = V−1V−1X(XTV−1X)−1XTV−1 and Kz = {K(z, z1), …, K(z, zn)}T for an arbitrary z. We term these covariances as frequentist covariances.

The second approach is to use the linear mixed model representation (11) and treat the true h(·) as a random function following the mean zero Gaussian process with covariance τK(·,·). The covariances of β̂ and ĥ(·) can then be calculated as a byproduct of the covariance of the fixed and random effects of the linear mixed model (11) and are



We term these covariances as Bayesian covariances.

4.2 Estimation of the Regularization Parameters and the Residual Variance

We discuss in this section estimation of the regularization parameter τ, the residual variance σ2 and the scale parameter ρ in K(·,·). Using the mixed model representation of LSKM, we propose to estimate (τ, ρ, σ2) simultaneously by treating them as variance components in the linear mixed model (11) and estimating them using REML.

Specifically, the REML under the linear mixed model (11) can be written as


where θ = (τ, ρ, σ2)T. The score equations of (τ, ρ, σ2) are


where P = V−1V−1X(XTV−1X)−1XTV−1. Let A denote the hat matrix so that XTβ̂ + ĥ = Ay. Using the identities V−1(y) = {σ2}−1(yXTβ̂ĥ) and P = {σ2}−1(IA) (Harville, 1977), one can show using equation (17) that σ^2={ntr(A)}1Σi=1n{yixiTβ^h^(zi)}2. Hence tr(A) represents the loss of degrees of freedom from estimating β and h(·) when estimating σ2. The covariance of θ̂ = (τ̂, ρ̂, σ̂2) can be estimated using the information matrix of the REML likelihood θlθl=12tr{PV(θ)θlPV(θ)θl}.

4.3 Test for the Nonparametric Function

Because we are interested in the effect of a whole genetic pathway rather than individual genes, it is of significant practical interest to test H0: h(z) = 0. In the PSA microarray example, this tests for a genetic pathway effect on PSA controlling for the effects of covariates. Assuming h(z) ∈ k, one can easily see from the linear mixed model representation (11) that H0: h(z) = 0 is equivalent to testing the variance component τ as H0: τ = 0 versus H1 : τ > 0. Note the null hypothesis places τ on the boundary of the parameter space. Because the kernel matrix K is not block diagonal, unlike the standard case considered by Self and Liang (1987), the likelihood ratio for H0 : τ = 0 does not following a mixture χ02 and χ12. We consider a score test in this article.

Zhang and Lin (2002) proposed a score test for H0: τ = 0 to compare a polynomial model with a smoothing spline. Unlike the smoothing spline case, a general kernel function K(·,·) in LSKM might depend on an unknown scale parameter ρ. However, for smoothing splines, K(·,·) does not depend on any unknown parameter. One can easily see from the linear mixed model (11) that under H0 : τ = 0, the kernel matrix K disappears, and hence the scale parameter ρ disappears and becomes inestimable.

Davies (1987) studied the problem of a parameter disappearing under H0 and proposed a score test by treating the score statistic as a Gaussian process indexed by the nuisance parameter and then obtaining an upper bound to approximate the p-value of the score test. This approach, however, does not work for our setting due to the unboundedness of the parameter space.

We here propose to test for H0 : τ = 0 using the score test by fixing ρ and varying its value and examining sensitivity of the score test for H0 : τ = 0 with respect to ρ. The REML version of the score statistic of τ under H0 : τ = 0 can be written as Qτ{β̂, σ̂2, ρ) − tr{P0 K(ρ)}, where β̂ and σ̂2 are the MLEs of β and σ2 under the linear model yi = xiβ + ei, the model under H0, P0 = IX(XTX)−1X, and


which is a quadratic function of y and follows a mixture of chi-squares under H0.

Following Zhang and Lin (2002), for each fixed ρ, we use the Satterthwaite method to approximate the distribution of Qτ(·; ρ) by a scaled chi-square distribution kχν2, where the scale parameter κ and the degrees of freedom ν are calculated by equating the mean and variance of Qτ(·; ρ) and those of kχν2. Specifically, one can show that κ = Ĩττ/2 and ν̃ = 22/Ĩττ, where I~ττ=IττIτσ2Iσ2σ21Iτσ2T,Iττ=tr(P0K(ρ))22,Iτσ2=tr(P0K(ρ)P0)2, and Iσ2σ2=tr(P02)2.e~=tr(P0K)2. Computation of the proposed score test is quite simple, because one only needs to fit the simple linear model yi=xiTβ+ei. We evaluate the performance of the score test using simulations.

5. Model Selection within the Kernel Machine Framework

The kernel machine method requires a kernel function to be explicitly specified. Section 2.2 provides wide choices of kernel functions. A question of substantial interest is which kernel function to choose. This kernel selection problem has much broader implications. We consider two types of kernel selection problems. The first is to choose between different parametric and nonparametric models with different smoothness properties. The second problem involves variable selection.

As stated in Section 2.2, a kernel function fully specifies a function space K where the unknown function h(·) resides. Hence this function space determines the type of models used to fit h(·). For example, a dth-degree polynomial kernel specifies a parametric model with dth order monomials; the kernel K(s,u)=01(st)+(tu)+dt specifies a cubic smoothing spline model (Wahba, 1990); and the Gaussian kernel assumes an infinitely smooth function. It is therefore clear that model selection within the kernel machine framework is in fact a special case of kernel selection.

Variable selection can also be treated as a kernel selection problem within the kernel machine framework. For example, let zp be a p-dimensional vector and zp′ a p′ dimensional sub-vector of zp with p′ < p. Then two kinds of kernel functions can be specified: one based on zp and another one based on zp′. The unknown function can then be fitted separately based on each kernel. If the fitted curves are not “far away” from each other, then the model using zp′ provides an equally good but more parsimonious fit than that using zp. This demonstrates that variable selection is also a special case of kernel selection.

These discussions show that model selection is a very interesting and important topic within the kernel machine framework. However, little work has been done in this area. We propose AIC and BIC as kernel selection criteria within the kernel machine framework. Equations (5) and (7) show that the estimated response ŷ can be expressed as ŷ = Ay, where A = (I + λ−1K)−1[λ−1K + X{XT (I + λ−1K)−1X}−1XT (I + λ−1K)−1] is the LSKM smoothing matrix. Let r = trace(A) be the degree-of-freedom of the kernel machine smoother A. We define the least squares kernel machine (KM) AIC and BIC as


where RSS = (yŷ)T (yŷ). Models with smaller KM_AIC/KM_BIC values are preferred.

6. Application to the Prostate Cancer Genetic Pathway Data

We applied the proposed semiparametric model to the analysis of prostate cancer genetic pathway data described in Section 1. The data set contained 59 patients who were clinically diagnosed with local or advanced prostate cancer. The objective of the study was to evaluate whether a genetic pathway has an overall effect on PSA after adjusting for covariates. We focus in this article on the cell growth pathway, which contains five genes. The outcome was pre-surgery PSA level. A log transformation was performed to make the normality assumption plausible. Two covariates included age and Gleason score, a well-established histological grading system for prostate cancer.

The semiparametric model (1) provides a convenient framework to evaluate the effect of the cell growth pathway on PSA by allowing for complicated interactions among the genes within the pathway. Specifically, we consider the model


where h(·) is a nonparametric function and eN(0, σ2). We fit this model using the LSKM method via the linear mixed model representation (11) and using the Gaussian kernel in estimating h(·). Under the linear mixed model representation, we estimated (β0, β1, β2) and h(·) using BLUPs, and estimated the smoothing parameter τ, ·the kernel parameter ρ and the residual variance σ2 simultaneously using REML. The results are presented in Table 1, indicating Gleason score was highly significant, while age was not.

Table 1
Parameter estimates of the semiparametric model and the score test for the genetic pathway effect for the PSA data using the LSKM via the linear mixed model representation

We tested for the cell growth pathway effect on PSA, H0 : h(z) = 0 versus H1 : h(z) ∈ HK using the score test described in Section 4.3. Table 1 gives the score test statistics and p-values for a range of ρ values. The p-values are not sensitive to the choice of ρ and range from 0.0007 to 0.0085, suggesting a strong cell growth pathway effect on PSA.

Even though the five genes are believed to function together biologically, it is of interest to investigate whether there are a small number of relatively important genes in the cell growth pathway that most affect PSA. We investigated this problem using the proposed variable selection method. An all-possible-subset selection procedure of genes was performed using the Gaussian kernel. The kernel machine AIC and BIC proposed in Section 5 were used as the model selection criteria. The result shows that the model with the lowest AIC and BIC values is the one containing genes FGF2 and IGFBP1. The detailed results are given in Web Table 1 in the Supplementary Materials. These two genes can be studied further in laboratory settings to explore their detailed relationship with PSA.

7. Simulation Studies

7.1 Simulation Study for the Parameter Estimates

We conducted a simulation study to evaluate the performance of the proposed LSKM estimation method for the semiparametric model (1) by fittingthe linear mixed model (11). We considered the followingmodel


where eiN(0, 1). To allow for xi and (z1, …, zip) to be correlated, xi was generated as xi = 3cos(zi1) + 2ui with ui being independent of zi1 and following N(0, 1), zij(j = 1, …, p) were generated from Uniform(0, 1). The nonparametric function h(·) was allowed to have a complex form with nonlinear functions of the z's and interactions amongthe z's. In our simulations, we first fit the model usingthe same set of z's as that in the true model. In practice, without advanced knowledge, the true set of z's is often unknown and the set of z's that is used might be larger than the true set and contains some noisy z's that are irrelevant to the outcome y. To mimic such a scenario, in the second set of simulations, we added some noisy z's in the set of z's and fit (19).

We considered four configurations by varying n (the sample size) and p (the number of covariates z's). For each setting, only the Gaussian kernel is used and 300 simulations were run.

Setting 1: n = 60, p = 5, true h(z)=10cos(z1)15z22+10exp(z3)z48sin(z5)cos(z3)+20z1z5. Fit the model with the five true z's. This settingmimics the PSA data.

Setting 2: n = 100, p = 8, h(·) is the same as setting 1. Fit the model (19) by including 3 additional irrelevant z6, z7, z8 besides the true z1, …, z5.

Setting 3: n = 200, p = 10, true h(z1,,z10)=10cos(z1)15z22+10exp(z3)z48sin(z5)cos(z3)+20z1z5+9z6sin(z7)8cos(z6)z7+20z8sin(z9)sin(z10)15z8310z8z9exp(z10)cos(z10). Fit the model assuming these 10 true z's are used.

Setting 4: n = 300, p = 15, h(·) is the same as that in setting 3. Fit the model with additional 5 irrelevant noisy predictors z11, …, z15 besides the true z1, …, z10.

The point estimate results are presented in Table 2. Because it is difficult to graphically display the fitted value of h(·) as a function of z, we summarized the goodness of fit of h(·) in the following way. For each simulation data set, we regressed the true h on the fitted ĥ, both evaluated at the design points. We then empirically summarized the goodness of fit of ĥ(·) by reportingthe average intercepts, slopes, and R2's obtained from these regressions over the 300 simulations. If the intercept from this regression is close to zero and the slope is close to one and R2 is close to one, it would provide empirical evidence that the estimated multi-dimensional function h(·) is close to the true manifold.

Table 2
Simulation results of estimated regression coefficients β and the nonparametric function h(·) in model y = xβ + h(z) + e based on 300 runs. True β = 1 and true σ2 = 1

The results in Table 2 show that, when the true set of z's was included in fitting h(·) and all the model parameters {β, h(·), τ, ρ, σ2} were estimated simultaneously, the LSKM method via the mixed model framework performed well in estimating β, h(·) and σ2. However, if the scale parameter ρ in the Gaussian kernel was fixed, which is often done in traditional machine learning, the model estimators could be subject to considerable bias, especially for the estimate of σ2. When ρ was fixed at values close to the estimated one, the bias was small. Because in practice, ρ is unknown, our results suggest it is useful to estimate the scale parameter ρ using the data. When extra irrelevant covariates z's besides the true set of z's were used in fitting h(·), the proposed method still performed well if all model parameters were estimated.

Table 3 compares the estimated standard errors of β̂ using the frequentist method (12) and the Bayesian method (14) with the empirical ones. The results show that both the frequentist and the Bayesian standard error estimates were close to their empirical counterparts. Table 3 also compares the estimated standard errors of ĥ (including intercept) using the frequentist method (13) and the Bayesian method (15) with the empirical standard errors. For the ease of presentation, for each setting, we averaged the SE estimates across all the grid points and presented these averages. The results show that when the scale parameter ρ was estimated, both the frequentist and the Bayesian standard error estimates were close to their empirical counterparts. When the scale parameter was fixed, the Bayesian and frequentist SEs were still close but could be quite different from the empirical SEs. These results further indicate that it is useful to estimate the scale parameter ρ in practice.

Table 3
Simulation study results of standard error estimates of β̂ and ĥ(·) in model y = xβ + h(z) + e based on 300 simulations

7.2 The Simulation Study for the Score Test

We next conducted a simulation study to evaluate the performance of the proposed variance component score test for H0 : h(·) = 0 versus H1 : h(·) ∈ K. The true model is the same as (19), where x and z's were generated in the same way as that in Section 6.1 and h(z)=ah1(z),h1(z)=2cos(z1)3z22+2ez3z41.6sin(z5)cos(z3)+4z1z5 and a = 0, 0.2, 0.4, 0.6, 0.8, 1. We studied the size of the test by generating data under a = 0, and studied the power by increasing a. The kernel parameter ρ was fixed at a wide range of values: 0.5, 1, 5, 10, 25, 50, 100, 200. The sample size was 60, mimicking the PSA data example. For the size calculations, the number of simulations was 2000, whereas for the power calculations, the number of runs was 1000.

Table 4 reports the empirical size (a = 0) and power (a > 0) of the variance component score test for H0. The results show that the size of the test was very close to the nominal value 0.05 and was not sensitive to the choice of the scale parameter ρ. As a increased, the power quickly approached 1. The power was not much affected by the value of ρ if a moderate ρ was specified, but was more affected if a large value of ρ was specified

Table 4
Simulation results for the score test for H0: h(z) = 0

7.3 The Simulation Study for Kernel Selection

A simulation study was also conducted to assess the performance of kernel selection using the kernel machine AIC and BIC criteria. The true model we considered is


where eN(0, 1), x was generated as x = 3 cos(z1) + 2u with u being independent of z1. All u and zj (j = 1, …, 5) were generated from N(0, 1). The sample size was 50, and the number of runs was 300. Three types of kernel functions were used in the simulation: the Gaussian kernel K(u, v) = exp(−∥uv2), the second-degree polynomial kernel K(u, v) = (uT v + 1)2, and the first-degree polynomial kernel that corresponds to ridge regression K(u, v) = uT v. For each simulated data set, the AIC and the BIC were calculated based on the model with three different kernels.

The mean AIC and BIC across 300 simulations for the Gaussian kernel are 190.79 (51.31) and 284.21 (50.21), respectively (the numbers within parenthesis are standard deviations), those for the second-degree polynomial kernel are 269.07 (10.00) and 308.91 (9.58), respectively, and those for the ridge regression are 363.67 (2.63) and 371.61 (2.51), respectively. The AIC and BIC values from each simulated data set are plotted in Figures Figures11 and and2.2. These results show that the kernel machine AIC and BIC of the model with Gaussian kernel are the smallest, whereas those of ridge regression are the largest. Hence the Gaussian kernel is preferred to both the second-degree polynomial kernel and the ridge regression kernel, which is desired in light of the complicated functional forms of the x's.

Figure 1
Simulation result of model selection using KMAIC.
Figure 2
Simulation result of model selection using KMBIC.

8. Discussion

In this article, we have developed the LSKM method for semiparametric regression with Gaussian outcomes, where we model the covariate effects parametrically and the genetic pathway effect parametrically or nonparametrically. The kernel machine method does not require an explicit analytical specification of the smoothness conditions on the nonparametric function and unifies the model building procedure in both one- and multiple-dimensional settings. Therefore, it is a more general and flexible method for multi-dimensional smoothing.

A key contribution of this article is that we have established a close connection between kernel machine methods and linear mixed models and all the model parameters can be estimated within the unified linear mixed model framework. This mixed model connection greatly facilitates the estimation and inference for multi-dimensional nonparametric regressions and can be easily implemented using familiar statistical software such as SAS PROC MIXED or Splus NLME.

We proposed a score test for the genetic pathway effect. This can be easily implemented using existing software. Although it requires fixing the scale parameter ρ, our results show that the test is not sensitive to the choice of ρ and has good performance. Alternatively, a Bayesian approach, such as the one proposed by Chen and Dunson (2003), might be used. This method has the advantage that there is no need to fix the scale parameter by proper prior specifications. However, its theoretical properties are unknown. It is of further research interest to study the performance of this Bayesian method and to develop better frequentist methods of testing τ in the kernel machine setting.

Kernel selection within the kernel machine framework is an important and complicated problem. It includes model selection and variable selection as special cases. In this article we propose to use kernel machine AIC/BIC as kernel selection criteria. Our simulation results show AIC/BIC performs well. Further research is still needed to examine their theoretical properties in detail before they can be adopted as a universal criteria.

We have considered in this article a single nonparametric function of multi-dimensional covariates. One could generate the proposed semiparametric model to incorporate multiple multi-dimensional nonparametric functions. For example, if one is interested in modeling multiple genetic pathway effects, one could consider an semiparametric additive model


where zj(j = 1, …, m) denotes a pj × 1 vector of genes in the jth pathway and hj (·) denotes the nonparametric function associated with the jth genetic pathway.

Machine learning is an emerging area of research in statistics. The field has experienced a rapid development in the past decade mainly by computer scientists dealing with multi-dimensional data. It has shown increasing promises and wide applications in biomedical research, especially in bioinformatics. These techniques however are somewhat disconnected with well-established biostatistical methods. Our effort of establishing a close connection between LSKMs and linear mixed models is an attempt to build a bridge between kernel machines that are familiar to computer scientists but less familiar to biostatisticians. This connection opens a door for adopting other well-established statistical techniques used in mixed models, such as Bayesian approaches, to handle multi-dimensional data via the machine learning framework. It also opens a new research direction for model/variable selection methods within the kernel machine framework. Such an interface is still in its infancy and has a lot of room for further developments.

Supplementary Material


DL and XL's research was supported by a grant from the National Cancer Institute (CA–76404). DG's research was supported by a grant from the National Institute of Health (GM072007). We thank the associate editor and three reviewers for their helpful comments that have improved the article.


9. Supplementary Materials

The kernel machine AIC and BIC estimates of models containing all the subsets of genes in the cell growth pathway for the analysis of the prostate cancer data are given in Web Table 1 at the Biometrics website http://www.tibs.org/biometrics.


  • Buhmann MD. Radial Basis Functions. Cambridge University Press; Cambridge, U.K.: 2003.
  • Chen Z, Dunson DB. Random effects selection in linear mixed models. Biometrics. 2003;59:762–769. [PubMed]
  • Cristianini N, Shawe-Taylor J. An Introduction to Support Vector Machines. Cambridge University Press; Cambridge, U.K.: 2000.
  • Davies RB. Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika. 1987;74:33–43. [PubMed]
  • Dhanasekaran SM, Barrette TR, Ghosh D, Shah R, Varambally S, Kurachi K, Pienta KJ, Rubin MA, Chinnaiyan AM. Delineation of prognostic biomarkers in prostate cancer. Nature. 2001;412:822–826. [PubMed]
  • Efron B, Tibshirani R, Storey J, Tusher V. Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association. 2001;96:1151–1160.
  • Fortunel NO, Otu HH, Ng HH, Chen J, Mu X, Chevassut T, Li X, Joseph M, et al. Comment on “ ‘Stemness’: Transcriptional Profiling of Embryonic and Adult Stem Cells” and “A Stem Cell Molecular Signature.” Science. 2003;302:393. [PubMed]
  • Friedman JH. Multivariate adaptive regression splines (with discussion) Annals of Statistics. 1991;19:1–141.
  • Friedman JH, Stuetzle W. Projection pursuit regression. Journal of the American Statistical Association. 1981;76:817–823.
  • Goeman JJ, Oosting J, Cleton-Jansen A-M, Anninga JK, van Houwelingen HC. Testing association of a pathway with survival using gene expression data. Bioinformatics. 2005;21:1950–1957. [PubMed]
  • Green PJ, Silverman BW. Nonparametric Regression and Generalized Linear Models. Chapman and Hall; London: 1994.
  • Gu C. Smoothing Spline ANOVA Models. Springer; New York: 2002.
  • Harville D. Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American Statistical Association. 1977;72:320–340.
  • Hastie TJ, Tibshirani RJ. Generalized Additive Models. Chapman and Hall; London: 1990.
  • Kimeldorf GS, Wahba G. Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications. 1970;33:82–95.
  • Laird N, Ware JH. Random-effects models for longitudinal data. Biometrics. 1982;38:963–974. [PubMed]
  • Mootha VK, Lindgren CM, Eriksson KF, et al. PGC-1alpha responsive genes involved in oxidative phosphorylation are coordinately Downregulated in human diabetes. Nature Genetics. 2003;34:267–273. [PubMed]
  • Rasmussen CE, Williams CKI. Gaussian Processes for Machine Learning. MIT Press; Cambridge, Massachusetts: 2006.
  • Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression. Cambridge University Press; Cambridge, U.K.: 2004.
  • Schölkopf B, Smola AJ. Learning with Kernels. MIT Press; Cambridge, Massachusetts: 2002.
  • Self SG, Liang KY. Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under non-standard conditions. Journal of the American Statistical Association. 1987;82:605–610.
  • Sollich P. Bayesian methods for support vector machines: Evidence and predictive class probabilities. Machine Learning. 2002;46:21–52.
  • Speed T, Robinson GK. BLUP is a good thing: The estimation of random effects. Statistical Sciences. 1991;6:15–51. Discussion to.
  • Subramanian A, Tamayo P, Mootha V, Mukherjee S, Ebert B, Gillette M, Paulovich A, Pomeroy S, Golub T, Lander E, Mesirov J. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences. 2005;102:15545–15550. [PMC free article] [PubMed]
  • Suykens JAK, Van Gestel T, De Brabanter J, De Moor J, Vandewalle J. Least Squares Support Vector Machines. World Scientific; Singapore: 2002.
  • Tipping ME. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research. 2001;1:211–244.
  • Tusher V, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences. 2001;98:5116–5124. [PMC free article] [PubMed]
  • Vapnik V. Statistical Learning Theory. Wiley; New York: 1998.
  • Wahba G. Spline Models for Observational Data. SIAM Press; Philadelphia: 1990.
  • Wand MP, Jones MC. Kernel Smoothing. Chapman and Hall; London: 1995.
  • Wang Y. Smoothing spline models with correlated random errors. Journal of the American Statistical Association. 1998;93:341–348.
  • Zhang D, Lin X. Hypothesis testing in semiparametric additive mixed models. Biostatistics. 2002;4:57–74. [PubMed]
  • Zhang D, Lin X, Raz J, Sowers M. Semiparametric stochastic mixed models for longitudinal data. Journal of the American Statistical Association. 1998;93:710–719.
PubReader format: click here to try


Save items

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • MedGen
    Related information in MedGen
  • PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...