- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- NIHPA Author Manuscripts
- PMC2665800

# Semiparametric Regression of Multidimensional Genetic Pathway Data: Least-Squares Kernel Machines and Linear Mixed Models

^{1}Center for Statistical Sciences, Brown University, Providence, Rhode Island 02912, U.S.A

^{2}Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts 02115, U.S.A

^{3}Department of Biostatistics, University of Michigan, Ann Arbor, Michigan 48109, U.S.A

## Summary

We consider a semiparametric regression model that relates a normal outcome to covariates and a genetic pathway, where the covariate effects are modeled parametrically and the pathway effect of multiple gene expressions is modeled parametrically or nonparametrically using least-squares kernel machines (LSKMs). This unified framework allows a flexible function for the joint effect of multiple genes within a pathway by specifying a kernel function and allows for the possibility that each gene expression effect might be nonlinear and the genes within the same pathway are likely to interact with each other in a complicated way. This semiparametric model also makes it possible to test for the overall genetic pathway effect. We show that the LSKM semiparametric regression can be formulated using a linear mixed model. Estimation and inference hence can proceed within the linear mixed model framework using standard mixed model software. Both the regression coefficients of the covariate effects and the LSKM estimator of the genetic pathway effect can be obtained using the best linear unbiased predictor in the corresponding linear mixed model formulation. The smoothing parameter and the kernel parameter can be estimated as variance components using restricted maximum likelihood. A score test is developed to test for the genetic pathway effect. Model/variable selection within the LSKM framework is discussed. The methods are illustrated using a prostate cancer data set and evaluated using simulations.

**Keywords:**BLUPs, Kernel function, Model/variable selection, Nonparametric regression, Penalized likelihood, REML, Score test, Smoothing parameter, Support vector machines

## 1. Introduction

Analysis of microarray data has been mainly focused on detection of individually significantly expressed genes (Efron et al., 2001; Tusher, Tibshirani, and Chu, 2001). This approach has some major limitations: (1) long list of individually significant genes without a single encompassing theme is difficult to interpret; (2) cellular processes often affect sets of genes and individually highly ranked genes are often downstream genes, so moderate changes in many genes may give more insight into biological mechanisms than dramatic change in a single gene (Mootha et al., 2003); (3) individually highly ranked genes can be poorly annotated and are often not reproducible across studies (Fortunel et al., 2003). Researchers have now become more interested in knowledge-based studies on gene sets, for example, genetic pathways that are more biologically interpretable and reproducible (Goeman et al., 2005; Subramanian et al., 2005).

A data example motivating the proposed research is the data from the Michigan prostate cancer study (Dhanasekaran et al., 2001). Prostate-specific antigen (PSA) has been routinely used as a biomarker for screening prostate cancer. Recently there have been significant breakthroughs in the effort of finding candidate genes related to prostate cancer. The early results of Dhanasekaran et al. (2001) indicate that certain functional genetic pathways seemed dysregulated in prostate cancer relative to noncancerous tissues. One is interested in studying the genetic pathway effects on PSA after adjusting for effects of clinical and demographic covariates. Due to the complicated unknown relationships between genes and PSA, we propose a flexible framework to model the genetic pathway effect parametrically or nonparametrically.

There is a vast literature on multidimensional nonparametric modeling. Methods such as multivariate kernel smoothing (Wand and Jones, 1995), projection pursuit regression (Friedman and Stuetzle, 1981), and multivariate adaptive regression splines (MARS) (Friedman, 1991), are usually computationally expensive. Popular spline-based methods include generalized additive models (GAMs) (Hastie and Tibshirani, 1990), thin-plate splines (Wahba, 1990; Green and Silverman, 1994), penalized regression splines (Ruppert, Wand, and Carroll, 2004), and smoothing spline ANOVA (Gu, 2002). These methods require the specification of the smoothness condition of an unknown function using differentiability conditions, which is much more involved and awkward in multidimensional settings.

In the past decade, the kernel machine method has been developed in machine learning as a powerful learning technique for multidimensional data (Vapnik, 1998; Schölkopf and Smola, 2002; Suykens et al., 2002; Rasmussen and Williams, 2006). Popular examples of kernel machine methods include support vector machine (SVM) (Vapnik, 1998) and Bayesian Gaussian process (Rasmussen and Williams, 2006). In the context of function approximation, kernel machine methods and spline-based methods share a similar theoretical foundation, but their model-fitting philosophies are different. Kernel machine methods start with a kernel function that implicitly determines the smoothness property of the unknown function. By contrast, spline-based methods start with the smoothness conditions of the unknown function and a corresponding kernel function can usually be derived from these conditions (Wahba, 1990). Kernel machine methods hence greatly simplify specification of a nonparametric model, especially for multidimensional data.

In this article, we propose a semiparametric model for covariate and genetic pathway effects on a continuous outcome (e.g., PSA), where covariates effects are modeled parametrically and genetic pathway effect is modeled parametrically or nonparametrically using least-squares kernel machine (LSKM). We establish a connection between LSKM and linear mixed models, and show that the LSKM estimator of the regression coefficients and the pathway effect can be obtained by fitting a linear mixed model. This connection provides a unified framework for inference of parameters in models with multidimensional covariates, including the regression coefficients, the nonparametric function, and smoothing parameters. Our work extends the connection between univariate smoothing splines and linear mixed models (Speed, 1991; Wang, 1998; Zhang et al., 1998) to multivariate smoothing with an arbitrary kernel function. We also propose a score test to test for the nonparametric genetic pathway effect, and a model/variable selection method within the LSKM framework.

The rest of the article is organized as follows. In Section 2, we present the semiparametric model for Gaussian outcomes. In Section 3, we describe the LSKM method. In Section 4, we establish a connection between LSKMs and linear mixed models and propose a score test for testing for the genetic pathway effect. We discuss the variable selection problem in LSKM in Section 5. The performance of the proposed method is evaluated by simulations in Section 7, and is illustrated using the prostate cancer microarray data in Section 6. The article ends with a discussion in Section 8.

## 2. Semiparametric Model for Multidimensional Data

### 2.1 The Model

Suppose the data consist of *n* subjects. For subject *i* (*i* = 1, … , *n*), *y _{i}* is a normally distributed continuous outcome,

*x*_{i}is a

*q*× 1 vector of clinical covariates and

*z*_{i}is a

*p*× 1 vector of gene expressions within a pathway. We assume an intercept is included in

*x*_{i}. The outcome

*y*depends on

_{i}

*x*_{i}and

*z*_{i}through the following partial linear model

where ** β** is a

*q*× 1 vector of regression coefficients,

*h*(

*z*_{i}) is an unknown centered smooth function, and the errors

*e*are assumed to be independent and follow

_{i}*N*(0,

*σ*

^{2}).

Model (1) models covariate effects parametrically and the pathway effect parametrically or nonparametrically. When *h*(·) = 0, (1) reduces to the standard linear regression model. When *x*_{i} = 1, it reduces to LSKM regression (Suykens et al., 2002).

### 2.2 Specifications of a Function Space of h(z) Using a Kernel

We assume the nonparametric function *h*(** z**) lies in a function space

*generated by a positive definite kernel function*

_{K}*K*(·,·). From Mercer's theorem (Cristianini and Shawe-Taylor, 2000), under some regularity conditions, a kernel function

*K*(·,·) implicitly specifies a unique function space spanned by a particular set of orthogonal basis functions (features) ${\left\{{\varphi}_{j}\left(\mathit{z}\right)\right\}}_{j=1}^{J}$. In other words, any

*h*(

**) ,**

*z**can be represented usinga set of bases as $h\left(\mathit{z}\right)={\Sigma}_{j=1}^{J}{\omega}_{j}{\varphi}_{j}\left(\mathit{z}\right)=\varphi {\left(\mathit{z}\right)}^{T}\omega $ (the primal representation), where*

_{K}**is a vector of coefficients. Equivalently,**

*ω**h*(

**) can also be represented using a kernel function**

*z**K*(·,·) as $h\left(\mathit{z}\right)={\Sigma}_{l=1}^{L}{\alpha}_{l}K({\mathit{z}}_{l}^{\ast},\mathit{z};\rho )$ (the dual representation), for some integer

*L*, some constants

*α*l and some $\{{\mathit{z}}_{1}^{\ast},\dots ,{\mathit{z}}_{L}^{\ast}\}\in {R}^{p}$. For a multidimensional

**, it is more convenient to specify**

*z**h*(

**) usingthe dual representation, because explicit basis functions or features might be complicated to specify, and the number of features might be high or even infinite.**

*z*Two popular kernel functions and the corresponding function spaces are as follows: (1) *The dth Polynomial Kernel*: $K({\mathit{z}}_{1},{\mathit{z}}_{2})={({\mathit{z}}_{1}^{T}{\mathit{z}}_{2}+\rho )}^{d}$, where *ρ* and *d* are tuning parameters. The *d*th polynomial kernel generates the function space _{} spanned by all possible *d*th-order monomials of the components of ** z**. For example, if

*d*= 1, the first polynomial kernel generates the linear function space with basis functions {

*ϕ*(

_{j}**)} = {**

*z**z*

_{1}, … ,

*z*}. If

_{p}*d*= 2, the second polynomial kernel corresponds to the quadratic function space with basis functions {

*ϕ*(

_{j}**)} = {**

*z**z*,

_{k}*z*} (

_{k}z_{k′}*k*,

*k′*= 1, … ,

*p*), that is, the main effects, all two way interactions and quadratic main effects of the

*z*_{k}'s. (2)

*The Gaussian Kernel*: $K({\mathit{z}}_{1},{\mathit{z}}_{2})=\mathrm{exp}\{-\Vert {\mathit{z}}_{1}-{\mathit{z}}_{2}{\Vert}^{2}\u2215\rho \}$, where $\Vert {\mathit{z}}_{1}-{\mathit{z}}_{2}{\Vert}^{2}={\Sigma}_{k=1}^{p}{({z}_{1k}-{z}_{2k})}^{2}$. The Gaussian kernel generates the function space spanned by radial basis functions. See Buhmann (2003) for their mathematical properties and desirable features. Examples of other choices of kernel functions include the sigmoid and neural network kernels, and the B-spline kernel (Schölkopf and Smola, 2002). The choice of a kernel function determines which function space one would like to use to approximate

*h*(

**).**

*z*## 3. LSKM Estimation in the Semiparametric Model

Assume *h*(·) * _{K}*, the function space generated by a kernel function

*K*(·,·). Estimation of

**and**

*β**h*(·) in (1) proceeds by maximizing the scaled penalized likelihood function

where *λ* is a tuning parameter which controls the tradeoff between goodness of fit and complexity of the model. When *λ* = 0, the model interpolates the gene expression data, whereas when *λ* = ∞, the model reduces to a simple linear model without *h*(·).

By the Representer theorem (Kimeldorf and Wahba, 1970), the general solution for the nonparametric function *h*(·) in (2) can be expressed as

where ** α** = (

*α*

_{1}, …,

*α*)

_{n}^{T}are unknown parameters. Substituting (3) back into (2) we have

where *K* is an *n* × *n* matrix whose (*i*, *j*)th element is *K*(* z_{i}*,

*). Differentiating*

**z**_{j}*J*(

**,**

*β***) with respect to**

*α***and**

*β***, some calculations give**

*α*

where $\mathit{X}={({\mathit{x}}_{1}^{T},\dots ,{\mathit{x}}_{n}^{T})}^{T}$ and ** y** = (

*y*

_{1}, …,

*y*

_{n})

^{T}. Plugging (6) into (3), we have that the function

*h*(·) evaluated at the design points (

*z*_{1}, …,

*)*

**z**_{n}^{T}is estimated as

Using (3) and (6), *ĥ*(·) at an arbitrary ** z** is

Equivalently, if *h*(** z**) =

*ϕ*(

**)**

*z*^{T}

**, where {**

*ω**ϕ*(

_{j}**)} are orthogonal basis functions, the corresponding LSKM regression coefficients**

*z***. are**

The kernel function *K*(·,·) usually depends on an unknown parameter *ρ*, such as the scale parameter in Gaussian kernel. Inference on **, ***ĥ*(** z**) depends on

*λ*,

*ρ*and the residual variance

*σ*

^{2}, which need to be estimated. Cross-validation can be used to estimate

*λ*; however, its computation is often intensive. Little literature is available on the systematic estimation of

*ρ*. and

*σ*

^{2}. In the machine learning literature,

*ρ*is often preset at some fixed values. Further, estimation of

*σ*

^{2}needs to properly account for the loss of degrees of freedom from estimating

**and**

*β**h*(·). Hence it is desirable to develop a systematic method to estimate these parameters simultaneously. We accomplish this by establishinga connection between LSKM and linear mixed models.

## 4. LSKMs and Linear Mixed Models

### 4.1 Connection Between LSKMs and Linear Mixed Models

Linear mixed models have commonly been used for analyzing longitudinal and hierarchical data (Harville, 1977; Laird and Ware, 1982). A connection between smoothing splines and linear mixed models has been established (Speed, 1991; Wang, 1998; Zhang et al., 1998). We show here that the LSKM estimator in model (1) corresponds to the best linear unbiased predictor (BLUP) estimator from a linear mixed model, and the regularization parameters (*τ*, *ρ*) and the residual variance *σ*^{2} can be treated as variance components and estimated simultaneously usingrestricted maximum likelihood (REML).

To see this connection, simple calculations show that ** and **** ĥ** from equations (5) and (7) can be equivalently obtained from the equations

where ** R** =

*σ*

^{2}

**and**

*I**τ*=

*λ*

^{−1}

*σ*

^{2}. Equation (10) corresponds exactly to the normal equation of the linear mixed model

where ** β** is a

*q*× 1 vector of regression coefficients,

**is an**

*h**n*× 1 vector of random effects with distribution

*N*(

**0**,

*τ*), and

**K****~**

*e**N*(

**0**,

*σ*

^{2}

**). A comparison of (11) with model (1) indicates that they have exactly the same form except that**

*I***is now treated as random effects. It follows that the BLUPs of the regression coefficients**

*h***and the random effects**

**under the linear mixed model (11) correspond to the LSKM estimator given in Section 3. In fact, one can easily see that the regression coefficient estimator**

*ĥ***in (5) is the weighted least-squares estimator under the linear mixed model representation (11) using the marginal covariance of**

**under (11) as**

*y***=**

*V**σ*

^{2}

**+**

*I**τ*, i.e.,

**K****= (**

**X**^{T}**V**^{−1}

**)**

*X*^{−1}

**X**^{T}**V**^{−1}

**.**

*y*The linear mixed model representation of the LSKM in the semiparametric model (1) can also be considered as a Bayesian Gaussian process regression (Schölkopf and Smola, 2002). Note that this Bayesian correspondence is finite-dimensional (Wahba, 1990; Green and Silverman, 1994). It is not strictly equivalent to a continuous Bayesian Gaussian process (Rasmussen and Williams, 2006), because the finite-dimensional representation of *h*(·) does not lead to a coherent Bayesian model (Green and Silverman, 1994; Tipping, 2001; Sollich, 2002. To see the Bayesian representation, we can treat {*h*(** z**)} as a random vector with a Gaussian process (GP) prior, with mean 0 and covariance cov{

*h*(

*z*_{1}),

*h*(

*z*_{2})} =

*τK*(

*z*_{1},

*z*_{2}). Note that the positive definiteness of the kernel function

*K*(·,·) ensures it is a proper covariance function. Now we assume

One can easily see that under this Bayesian model, the semiparametric model (1) becomes the linear mixed model representation (11). This connection extends the connection between scalar smoothing splines and mixed models and their Bayesian formulations (Wang, 1998; Zhang et al., 1998) to multidimensional regression problems under the kernel machine framework.

The covariances of ** and ***ĥ*(·) can be calculated in two ways. The first approach is to treat the true *ĥ*(·) as a fixed unknown function and the variance of *y _{i}* as

*σ*

^{2}. Using (5) and (7), the covariances of

**and**

*ĥ*(·) are

where ** P** =

*V*^{−1}−

*V*^{−1}

**(**

*X*

**X**^{T}**V**^{−1}

**)**

*X*^{−1}

**X**^{T}**V**^{−1}and

**= {**

*K*_{z}*K*(

**,**

*z*

*z*_{1}), …,

*K*(

**,**

*z**)}*

**z**_{n}^{T}for an arbitrary

**. We term these covariances as frequentist covariances.**

*z*The second approach is to use the linear mixed model representation (11) and treat the true *h*(·) as a random function following the mean zero Gaussian process with covariance *τK*(·,·). The covariances of ** and ***ĥ*(·) can then be calculated as a byproduct of the covariance of the fixed and random effects of the linear mixed model (11) and are

We term these covariances as Bayesian covariances.

### 4.2 Estimation of the Regularization Parameters and the Residual Variance

We discuss in this section estimation of the regularization parameter *τ*, the residual variance *σ*^{2} and the scale parameter *ρ* in *K*(·,·). Using the mixed model representation of LSKM, we propose to estimate (*τ*, *ρ*, *σ*^{2}) simultaneously by treating them as variance components in the linear mixed model (11) and estimating them using REML.

Specifically, the REML under the linear mixed model (11) can be written as

where ** θ** = (

*τ*,

*ρ*,

*σ*

^{2})

^{T}. The score equations of (

*τ*,

*ρ*,

*σ*

^{2}) are

where ** P** =

*V*^{−1}−

*V*^{−1}

**(**

*X*

**X**^{T}**V**^{−1}

**)**

*X*^{−1}

**X**^{T}**V**^{−1}. Let

**A**denote the hat matrix so that

*+*

**X**^{T}**=**

*ĥ***. Using the identities**

*Ay*

*V*^{−1}(

**−**

*y***) = {**

*Xβ**σ*

^{2}}

^{−1}(

**−**

*y**−*

**X**^{T}**) and**

*ĥ***= {**

*P**σ*

^{2}}

^{−1}(

**−**

*I***) (Harville, 1977), one can show using equation (17) that ${\widehat{\sigma}}^{2}={\{n-\mathrm{tr}\left(\mathit{A}\right)\}}^{-1}{\Sigma}_{i=1}^{n}{\{{y}_{i}-{\mathit{x}}_{i}^{T}\widehat{\beta}-\widehat{h}\left({\mathit{z}}_{i}\right)\}}^{2}$. Hence tr(**

*A***) represents the loss of degrees of freedom from estimating**

*A***and**

*β**h*(·) when estimating

*σ*

^{2}. The covariance of

**= (, ,**

^{2}) can be estimated using the information matrix of the REML likelihood ${\mathcal{I}}_{{\theta}_{l}{{\theta}_{l}}^{\prime}}=\frac{1}{2}\mathrm{tr}\left\{\mathit{P}\frac{\partial \mathit{V}\left(\theta \right)}{\partial {\theta}_{l}}\mathit{P}\frac{\partial \mathit{V}\left(\theta \right)}{\partial {\theta}_{{l}^{\prime}}}\right\}$.

### 4.3 Test for the Nonparametric Function

Because we are interested in the effect of a whole genetic pathway rather than individual genes, it is of significant practical interest to test H_{0}: *h*(** z**) = 0. In the PSA microarray example, this tests for a genetic pathway effect on PSA controlling for the effects of covariates. Assuming

*h*(

**)**

*z**, one can easily see from the linear mixed model representation (11) that H*

_{k}_{0}:

*h*(

**) = 0 is equivalent to testing the variance component**

*z**τ*as H

_{0}:

*τ*= 0 versus H

_{1}:

*τ*> 0. Note the null hypothesis places

*τ*on the boundary of the parameter space. Because the kernel matrix

**is not block diagonal, unlike the standard case considered by Self and Liang (1987), the likelihood ratio for H**

*K*_{0}:

*τ*= 0 does not following a mixture ${\chi}_{0}^{2}$ and ${\chi}_{1}^{2}$. We consider a score test in this article.

Zhang and Lin (2002) proposed a score test for H_{0}: *τ* = 0 to compare a polynomial model with a smoothing spline. Unlike the smoothing spline case, a general kernel function *K*(·,·) in LSKM might depend on an unknown scale parameter *ρ*. However, for smoothing splines, *K*(·,·) does not depend on any unknown parameter. One can easily see from the linear mixed model (11) that under H_{0} : *τ* = 0, the kernel matrix ** K** disappears, and hence the scale parameter

*ρ*disappears and becomes inestimable.

Davies (1987) studied the problem of a parameter disappearing under H_{0} and proposed a score test by treating the score statistic as a Gaussian process indexed by the nuisance parameter and then obtaining an upper bound to approximate the *p*-value of the score test. This approach, however, does not work for our setting due to the unboundedness of the parameter space.

We here propose to test for H_{0} : *τ* = 0 using the score test by fixing *ρ* and varying its value and examining sensitivity of the score test for H_{0} : *τ* = 0 with respect to *ρ*. The REML version of the score statistic of *τ* under H_{0} : *τ* = 0 can be written as *Q _{τ}*{

**,**

^{2},

*ρ*) − tr{

*P*_{0 }

**(**

*K**ρ*)}, where

**and**

^{2}are the MLEs of

**and**

*β**σ*

^{2}under the linear model

*=*

**y**_{i}*+*

**x**_{i}**β***e*, the model under H

_{i}_{0},

*P*_{0}=

**−**

*I***(**

*X**)*

**X**^{T}**X**^{−1}

**, and**

*X* which is a quadratic function of ** y** and follows a mixture of chi-squares under H

_{0}.

Following Zhang and Lin (2002), for each fixed *ρ*, we use the Satterthwaite method to approximate the distribution of *Q _{τ}*(·;

*ρ*) by a scaled chi-square distribution $k{\chi}_{\nu}^{2}$, where the scale parameter

*κ*and the degrees of freedom

*ν*are calculated by equating the mean and variance of

*Q*(·;

_{τ}*ρ*) and those of $k{\chi}_{\nu}^{2}$. Specifically, one can show that

*κ*=

*Ĩ*/2

_{ττ}*ẽ*and = 2

*ẽ*

^{2}/

*Ĩ*, where ${\stackrel{~}{I}}_{\tau \tau}={I}_{\tau \tau}-{I}_{\tau {\sigma}^{2}}{I}_{{\sigma}^{2}{\sigma}^{2}}^{-1}{I}_{\tau {\sigma}^{2}}^{T},\phantom{\rule{thinmathspace}{0ex}}{I}_{\tau \tau}=\mathrm{tr}{\left({\mathit{P}}_{0}\mathit{K}\left(\rho \right)\right)}^{2}\u22152,\phantom{\rule{thinmathspace}{0ex}}{I}_{\tau {\sigma}^{2}}=\mathrm{tr}\left({\mathit{P}}_{0}\mathit{K}\left(\rho \right){\mathit{P}}_{0}\right)\u22152$, and ${I}_{{\sigma}^{2}{\sigma}^{2}}=\mathrm{tr}\left({\mathit{P}}_{0}^{2}\right)\u22152.\phantom{\rule{thinmathspace}{0ex}}\stackrel{~}{e}=\mathrm{tr}\left({\mathit{P}}_{0}\mathit{K}\right)\u22152$. Computation of the proposed score test is quite simple, because one only needs to fit the simple linear model ${\mathit{y}}_{i}={\mathit{x}}_{i}^{T}\beta +{e}_{i}$. We evaluate the performance of the score test using simulations.

_{ττ}## 5. Model Selection within the Kernel Machine Framework

The kernel machine method requires a kernel function to be explicitly specified. Section 2.2 provides wide choices of kernel functions. A question of substantial interest is which kernel function to choose. This kernel selection problem has much broader implications. We consider two types of kernel selection problems. The first is to choose between different parametric and nonparametric models with different smoothness properties. The second problem involves variable selection.

As stated in Section 2.2, a kernel function fully specifies a function space * _{K}* where the unknown function

*h*(·) resides. Hence this function space determines the type of models used to fit

*h*(·). For example, a

*d*th-degree polynomial kernel specifies a parametric model with

*d*th order monomials; the kernel $K(s,u)={\int}_{0}^{1}{(s-t)}_{+}{(t-u)}_{+}\mathit{dt}$ specifies a cubic smoothing spline model (Wahba, 1990); and the Gaussian kernel assumes an infinitely smooth function. It is therefore clear that model selection within the kernel machine framework is in fact a special case of kernel selection.

Variable selection can also be treated as a kernel selection problem within the kernel machine framework. For example, let * z_{p}* be a

*p*-dimensional vector and

*a*

**z**_{p′}*p′*dimensional sub-vector of

*with*

**z**_{p}*p′*<

*p*. Then two kinds of kernel functions can be specified: one based on

*and another one based on*

**z**_{p}*. The unknown function can then be fitted separately based on each kernel. If the fitted curves are not “far away” from each other, then the model using*

**z**_{p′}*provides an equally good but more parsimonious fit than that using*

**z**_{p′}*. This demonstrates that variable selection is also a special case of kernel selection.*

**z**_{p}These discussions show that model selection is a very interesting and important topic within the kernel machine framework. However, little work has been done in this area. We propose AIC and BIC as kernel selection criteria within the kernel machine framework. Equations (5) and (7) show that the estimated response ** ŷ** can be expressed as

**=**

*ŷ***, where**

*Ay***= (**

*A***+**

*I**λ*

^{−1}

**)**

*K*^{−1}[

*λ*

^{−1}

**+**

*K***{**

*X**(*

**X**^{T}**+**

*I**λ*

^{−1}

**)**

*K*^{−1}

**}**

*X*^{−1}

*(*

**X**^{T}**+**

*I**λ*

^{−1}

**)**

*K*^{−1}] is the LSKM smoothing matrix. Let

*r*= trace(

**) be the degree-of-freedom of the kernel machine smoother**

*A***. We define the least squares kernel machine (KM) AIC and BIC as**

*A* where RSS = (** y** −

**)**

*ŷ*^{T}(

**−**

*y***). Models with smaller KM_AIC/KM_BIC values are preferred.**

*ŷ*## 6. Application to the Prostate Cancer Genetic Pathway Data

We applied the proposed semiparametric model to the analysis of prostate cancer genetic pathway data described in Section 1. The data set contained 59 patients who were clinically diagnosed with local or advanced prostate cancer. The objective of the study was to evaluate whether a genetic pathway has an overall effect on PSA after adjusting for covariates. We focus in this article on the cell growth pathway, which contains five genes. The outcome was pre-surgery PSA level. A log transformation was performed to make the normality assumption plausible. Two covariates included age and Gleason score, a well-established histological grading system for prostate cancer.

The semiparametric model (1) provides a convenient framework to evaluate the effect of the cell growth pathway on PSA by allowing for complicated interactions among the genes within the pathway. Specifically, we consider the model

where *h*(·) is a nonparametric function and *e* ~ *N*(0, *σ*^{2}). We fit this model using the LSKM method via the linear mixed model representation (11) and using the Gaussian kernel in estimating *h*(·). Under the linear mixed model representation, we estimated (*β*_{0}, *β*_{1}, *β*_{2}) and *h*(·) using BLUPs, and estimated the smoothing parameter *τ*, ·the kernel parameter *ρ* and the residual variance *σ*^{2} simultaneously using REML. The results are presented in Table 1, indicating Gleason score was highly significant, while age was not.

We tested for the cell growth pathway effect on PSA, H_{0} : *h*(** z**) = 0 versus H

_{1}:

*h*(

**)**

*z**H*using the score test described in Section 4.3. Table 1 gives the score test statistics and

_{K}*p*-values for a range of

*ρ*values. The

*p*-values are not sensitive to the choice of

*ρ*and range from 0.0007 to 0.0085, suggesting a strong cell growth pathway effect on PSA.

Even though the five genes are believed to function together biologically, it is of interest to investigate whether there are a small number of relatively important genes in the cell growth pathway that most affect PSA. We investigated this problem using the proposed variable selection method. An all-possible-subset selection procedure of genes was performed using the Gaussian kernel. The kernel machine AIC and BIC proposed in Section 5 were used as the model selection criteria. The result shows that the model with the lowest AIC and BIC values is the one containing genes FGF2 and IGFBP1. The detailed results are given in Web Table 1 in the Supplementary Materials. These two genes can be studied further in laboratory settings to explore their detailed relationship with PSA.

## 7. Simulation Studies

### 7.1 Simulation Study for the Parameter Estimates

We conducted a simulation study to evaluate the performance of the proposed LSKM estimation method for the semiparametric model (1) by fittingthe linear mixed model (11). We considered the followingmodel

where *e _{i}* ~

*N*(0, 1). To allow for

*x*and (

_{i}*z*

_{1}, …,

*z*) to be correlated,

_{ip}*x*was generated as

_{i}*x*= 3cos(

_{i}*z*

_{i1}) + 2

*u*with

_{i}*u*being independent of

_{i}*z*

_{i1}and following

*N*(0, 1),

*z*(

_{ij}*j*= 1, …,

*p*) were generated from Uniform(0, 1). The nonparametric function h(·) was allowed to have a complex form with nonlinear functions of the

*z*'s and interactions amongthe

*z*'s. In our simulations, we first fit the model usingthe same set of

*z*'s as that in the true model. In practice, without advanced knowledge, the true set of

*z*'s is often unknown and the set of

*z*'s that is used might be larger than the true set and contains some noisy

*z*'s that are irrelevant to the outcome

*y*. To mimic such a scenario, in the second set of simulations, we added some noisy

*z*'s in the set of

*z*'s and fit (19).

We considered four configurations by varying *n* (the sample size) and *p* (the number of covariates *z*'s). For each setting, only the Gaussian kernel is used and 300 simulations were run.

Setting 1: *n* = 60, *p* = 5, true $h\left(\mathit{z}\right)=10\mathrm{cos}\left({z}_{1}\right)-15{z}_{2}^{2}+10\mathrm{exp}(-{z}_{3}){z}_{4}-8\mathrm{sin}\left({z}_{5}\right)\mathrm{cos}\left({z}_{3}\right)+20{z}_{1}{z}_{5}$. Fit the model with the five true *z*'s. This settingmimics the PSA data.

Setting 2: *n* = 100, *p* = 8, *h*(·) is the same as setting 1. Fit the model (19) by including 3 additional irrelevant *z*_{6}, *z*_{7}, *z*_{8} besides the true *z*_{1}, …, *z*_{5}.

Setting 3: *n* = 200, *p* = 10, true $h({z}_{1},\dots ,{z}_{10})=10\mathrm{cos}\left({z}_{1}\right)-15{z}_{2}^{2}+10\mathrm{exp}(-{z}_{3}){z}_{4}\phantom{\rule{thickmathspace}{0ex}}-\phantom{\rule{thickmathspace}{0ex}}8\mathrm{sin}\left({z}_{5}\right)\mathrm{cos}\left({z}_{3}\right)\phantom{\rule{thickmathspace}{0ex}}+\phantom{\rule{thickmathspace}{0ex}}20{z}_{1}{z}_{5}\phantom{\rule{thickmathspace}{0ex}}+\phantom{\rule{thickmathspace}{0ex}}9{z}_{6}\mathrm{sin}\left({z}_{7}\right)\phantom{\rule{thickmathspace}{0ex}}-\phantom{\rule{thickmathspace}{0ex}}8\mathrm{cos}\left({z}_{6}\right){z}_{7}\phantom{\rule{thickmathspace}{0ex}}+\phantom{\rule{thickmathspace}{0ex}}20{z}_{8}\mathrm{sin}\left({z}_{9}\right)\mathrm{sin}\left({z}_{10}\right)\phantom{\rule{thickmathspace}{0ex}}-\phantom{\rule{thickmathspace}{0ex}}15{z}_{8}^{3}-10{z}_{8}{z}_{9}\phantom{\rule{thickmathspace}{0ex}}-\phantom{\rule{thickmathspace}{0ex}}\mathrm{exp}\left({z}_{10}\right)\mathrm{cos}\left({z}_{10}\right)$. Fit the model assuming these 10 true *z*'s are used.

Setting 4: *n* = 300, *p* = 15, *h*(·) is the same as that in setting 3. Fit the model with additional 5 irrelevant noisy predictors *z*_{11}, …, *z*_{15} besides the true *z*_{1}, …, *z*_{10}.

The point estimate results are presented in Table 2. Because it is difficult to graphically display the fitted value of *h*(·) as a function of ** z**, we summarized the goodness of fit of

*h*(·) in the following way. For each simulation data set, we regressed the true

*h*on the fitted

*ĥ*, both evaluated at the design points. We then empirically summarized the goodness of fit of

*ĥ*(·) by reportingthe average intercepts, slopes, and

*R*

^{2}'s obtained from these regressions over the 300 simulations. If the intercept from this regression is close to zero and the slope is close to one and

*R*

^{2}is close to one, it would provide empirical evidence that the estimated multi-dimensional function

*h*(·) is close to the true manifold.

^{2}= 1

The results in Table 2 show that, when the true set of *z*'s was included in fitting *h*(·) and all the model parameters {*β*, *h*(·), *τ*, *ρ*, *σ*^{2}} were estimated simultaneously, the LSKM method via the mixed model framework performed well in estimating *β*, *h*(·) and *σ*^{2}. However, if the scale parameter *ρ* in the Gaussian kernel was fixed, which is often done in traditional machine learning, the model estimators could be subject to considerable bias, especially for the estimate of *σ*^{2}. When *ρ* was fixed at values close to the estimated one, the bias was small. Because in practice, *ρ* is unknown, our results suggest it is useful to estimate the scale parameter *ρ* using the data. When extra irrelevant covariates *z*'s besides the true set of *z*'s were used in fitting *h*(·), the proposed method still performed well if all model parameters were estimated.

Table 3 compares the estimated standard errors of using the frequentist method (12) and the Bayesian method (14) with the empirical ones. The results show that both the frequentist and the Bayesian standard error estimates were close to their empirical counterparts. Table 3 also compares the estimated standard errors of *ĥ* (including intercept) using the frequentist method (13) and the Bayesian method (15) with the empirical standard errors. For the ease of presentation, for each setting, we averaged the SE estimates across all the grid points and presented these averages. The results show that when the scale parameter *ρ* was estimated, both the frequentist and the Bayesian standard error estimates were close to their empirical counterparts. When the scale parameter was fixed, the Bayesian and frequentist SEs were still close but could be quite different from the empirical SEs. These results further indicate that it is useful to estimate the scale parameter *ρ* in practice.

### 7.2 The Simulation Study for the Score Test

We next conducted a simulation study to evaluate the performance of the proposed variance component score test for H_{0} : *h*(·) = 0 versus H_{1} : *h*(·) * _{K}*. The true model is the same as (19), where

*x*and

*z*'s were generated in the same way as that in Section 6.1 and $h\left(\mathit{z}\right)={\mathit{ah}}_{1}\left(\mathit{z}\right),{h}_{1}\left(\mathit{z}\right)=2\phantom{\rule{thinmathspace}{0ex}}\mathrm{cos}\left({z}_{1}\right)-3{z}_{2}^{2}+2{e}^{-{z}_{3}}{z}_{4}-1.6\phantom{\rule{thinmathspace}{0ex}}\mathrm{sin}\left({z}_{5}\right)\mathrm{cos}\left({z}_{3}\right)+4{z}_{1}{z}_{5}$ and

*a*= 0, 0.2, 0.4, 0.6, 0.8, 1. We studied the size of the test by generating data under

*a*= 0, and studied the power by increasing

*a*. The kernel parameter

*ρ*was fixed at a wide range of values: 0.5, 1, 5, 10, 25, 50, 100, 200. The sample size was 60, mimicking the PSA data example. For the size calculations, the number of simulations was 2000, whereas for the power calculations, the number of runs was 1000.

Table 4 reports the empirical size (*a* = 0) and power (*a* > 0) of the variance component score test for H_{0}. The results show that the size of the test was very close to the nominal value 0.05 and was not sensitive to the choice of the scale parameter *ρ*. As *a* increased, the power quickly approached 1. The power was not much affected by the value of *ρ* if a moderate *ρ* was specified, but was more affected if a large value of *ρ* was specified

### 7.3 The Simulation Study for Kernel Selection

A simulation study was also conducted to assess the performance of kernel selection using the kernel machine AIC and BIC criteria. The true model we considered is

where *e* ~ *N*(0, 1), *x* was generated as *x* = 3 cos(*z*_{1}) + 2*u* with *u* being independent of *z*_{1}. All *u* and *z _{j}* (

*j*= 1, …, 5) were generated from

*N*(0, 1). The sample size was 50, and the number of runs was 300. Three types of kernel functions were used in the simulation: the Gaussian kernel

*K*(

**,**

*u***) = exp(−**

*v***−**

*u*

*v*^{2}

*/ρ*), the second-degree polynomial kernel

*K*(

**,**

*u***) = (**

*v**+ 1)*

**u**^{T }**v**^{2}, and the first-degree polynomial kernel that corresponds to ridge regression

*K*(

**,**

*u***) =**

*v**. For each simulated data set, the AIC and the BIC were calculated based on the model with three different kernels.*

**u**^{T }**v**The mean AIC and BIC across 300 simulations for the Gaussian kernel are 190.79 (51.31) and 284.21 (50.21), respectively (the numbers within parenthesis are standard deviations), those for the second-degree polynomial kernel are 269.07 (10.00) and 308.91 (9.58), respectively, and those for the ridge regression are 363.67 (2.63) and 371.61 (2.51), respectively. The AIC and BIC values from each simulated data set are plotted in Figures Figures11 and and2.2. These results show that the kernel machine AIC and BIC of the model with Gaussian kernel are the smallest, whereas those of ridge regression are the largest. Hence the Gaussian kernel is preferred to both the second-degree polynomial kernel and the ridge regression kernel, which is desired in light of the complicated functional forms of the *x*'s.

## 8. Discussion

In this article, we have developed the LSKM method for semiparametric regression with Gaussian outcomes, where we model the covariate effects parametrically and the genetic pathway effect parametrically or nonparametrically. The kernel machine method does not require an explicit analytical specification of the smoothness conditions on the nonparametric function and unifies the model building procedure in both one- and multiple-dimensional settings. Therefore, it is a more general and flexible method for multi-dimensional smoothing.

A key contribution of this article is that we have established a close connection between kernel machine methods and linear mixed models and all the model parameters can be estimated within the unified linear mixed model framework. This mixed model connection greatly facilitates the estimation and inference for multi-dimensional nonparametric regressions and can be easily implemented using familiar statistical software such as SAS PROC MIXED or Splus NLME.

We proposed a score test for the genetic pathway effect. This can be easily implemented using existing software. Although it requires fixing the scale parameter *ρ*, our results show that the test is not sensitive to the choice of *ρ* and has good performance. Alternatively, a Bayesian approach, such as the one proposed by Chen and Dunson (2003), might be used. This method has the advantage that there is no need to fix the scale parameter by proper prior specifications. However, its theoretical properties are unknown. It is of further research interest to study the performance of this Bayesian method and to develop better frequentist methods of testing *τ* in the kernel machine setting.

Kernel selection within the kernel machine framework is an important and complicated problem. It includes model selection and variable selection as special cases. In this article we propose to use kernel machine AIC/BIC as kernel selection criteria. Our simulation results show AIC/BIC performs well. Further research is still needed to examine their theoretical properties in detail before they can be adopted as a universal criteria.

We have considered in this article a single nonparametric function of multi-dimensional covariates. One could generate the proposed semiparametric model to incorporate multiple multi-dimensional nonparametric functions. For example, if one is interested in modeling multiple genetic pathway effects, one could consider an semiparametric additive model

where * z_{j}*(

*j*= 1, …, m) denotes a

*p*× 1 vector of genes in the

_{j}*j*th pathway and

*h*(·) denotes the nonparametric function associated with the

_{j}*j*th genetic pathway.

Machine learning is an emerging area of research in statistics. The field has experienced a rapid development in the past decade mainly by computer scientists dealing with multi-dimensional data. It has shown increasing promises and wide applications in biomedical research, especially in bioinformatics. These techniques however are somewhat disconnected with well-established biostatistical methods. Our effort of establishing a close connection between LSKMs and linear mixed models is an attempt to build a bridge between kernel machines that are familiar to computer scientists but less familiar to biostatisticians. This connection opens a door for adopting other well-established statistical techniques used in mixed models, such as Bayesian approaches, to handle multi-dimensional data via the machine learning framework. It also opens a new research direction for model/variable selection methods within the kernel machine framework. Such an interface is still in its infancy and has a lot of room for further developments.

## Acknowledgements

DL and XL's research was supported by a grant from the National Cancer Institute (CA–76404). DG's research was supported by a grant from the National Institute of Health (GM072007). We thank the associate editor and three reviewers for their helpful comments that have improved the article.

## Footnotes

The kernel machine AIC and BIC estimates of models containing all the subsets of genes in the cell growth pathway for the analysis of the prostate cancer data are given in Web Table 1 at the *Biometrics* website http://www.tibs.org/biometrics.

## References

- Buhmann MD. Radial Basis Functions. Cambridge University Press; Cambridge, U.K.: 2003.
- Chen Z, Dunson DB. Random effects selection in linear mixed models. Biometrics. 2003;59:762–769. [PubMed]
- Cristianini N, Shawe-Taylor J. An Introduction to Support Vector Machines. Cambridge University Press; Cambridge, U.K.: 2000.
- Davies RB. Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika. 1987;74:33–43. [PubMed]
- Dhanasekaran SM, Barrette TR, Ghosh D, Shah R, Varambally S, Kurachi K, Pienta KJ, Rubin MA, Chinnaiyan AM. Delineation of prognostic biomarkers in prostate cancer. Nature. 2001;412:822–826. [PubMed]
- Efron B, Tibshirani R, Storey J, Tusher V. Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association. 2001;96:1151–1160.
- Fortunel NO, Otu HH, Ng HH, Chen J, Mu X, Chevassut T, Li X, Joseph M, et al. Comment on “ ‘Stemness’: Transcriptional Profiling of Embryonic and Adult Stem Cells” and “A Stem Cell Molecular Signature.” Science. 2003;302:393. [PubMed]
- Friedman JH. Multivariate adaptive regression splines (with discussion) Annals of Statistics. 1991;19:1–141.
- Friedman JH, Stuetzle W. Projection pursuit regression. Journal of the American Statistical Association. 1981;76:817–823.
- Goeman JJ, Oosting J, Cleton-Jansen A-M, Anninga JK, van Houwelingen HC. Testing association of a pathway with survival using gene expression data. Bioinformatics. 2005;21:1950–1957. [PubMed]
- Green PJ, Silverman BW. Nonparametric Regression and Generalized Linear Models. Chapman and Hall; London: 1994.
- Gu C. Smoothing Spline ANOVA Models. Springer; New York: 2002.
- Harville D. Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American Statistical Association. 1977;72:320–340.
- Hastie TJ, Tibshirani RJ. Generalized Additive Models. Chapman and Hall; London: 1990.
- Kimeldorf GS, Wahba G. Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications. 1970;33:82–95.
- Laird N, Ware JH. Random-effects models for longitudinal data. Biometrics. 1982;38:963–974. [PubMed]
- Mootha VK, Lindgren CM, Eriksson KF, et al. PGC-1alpha responsive genes involved in oxidative phosphorylation are coordinately Downregulated in human diabetes. Nature Genetics. 2003;34:267–273. [PubMed]
- Rasmussen CE, Williams CKI. Gaussian Processes for Machine Learning. MIT Press; Cambridge, Massachusetts: 2006.
- Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression. Cambridge University Press; Cambridge, U.K.: 2004.
- Schölkopf B, Smola AJ. Learning with Kernels. MIT Press; Cambridge, Massachusetts: 2002.
- Self SG, Liang KY. Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under non-standard conditions. Journal of the American Statistical Association. 1987;82:605–610.
- Sollich P. Bayesian methods for support vector machines: Evidence and predictive class probabilities. Machine Learning. 2002;46:21–52.
- Speed T, Robinson GK. BLUP is a good thing: The estimation of random effects. Statistical Sciences. 1991;6:15–51. Discussion to.
- Subramanian A, Tamayo P, Mootha V, Mukherjee S, Ebert B, Gillette M, Paulovich A, Pomeroy S, Golub T, Lander E, Mesirov J. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences. 2005;102:15545–15550. [PMC free article] [PubMed]
- Suykens JAK, Van Gestel T, De Brabanter J, De Moor J, Vandewalle J. Least Squares Support Vector Machines. World Scientific; Singapore: 2002.
- Tipping ME. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research. 2001;1:211–244.
- Tusher V, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences. 2001;98:5116–5124. [PMC free article] [PubMed]
- Vapnik V. Statistical Learning Theory. Wiley; New York: 1998.
- Wahba G. Spline Models for Observational Data. SIAM Press; Philadelphia: 1990.
- Wand MP, Jones MC. Kernel Smoothing. Chapman and Hall; London: 1995.
- Wang Y. Smoothing spline models with correlated random errors. Journal of the American Statistical Association. 1998;93:341–348.
- Zhang D, Lin X. Hypothesis testing in semiparametric additive mixed models. Biostatistics. 2002;4:57–74. [PubMed]
- Zhang D, Lin X, Raz J, Sowers M. Semiparametric stochastic mixed models for longitudinal data. Journal of the American Statistical Association. 1998;93:710–719.

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (825K)

- Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models.[BMC Bioinformatics. 2008]
*Liu D, Ghosh D, Lin X.**BMC Bioinformatics. 2008 Jun 24; 9:292. Epub 2008 Jun 24.* - Bayesian inference in semiparametric mixed models for longitudinal data.[Biometrics. 2010]
*Li Y, Lin X, Müller P.**Biometrics. 2010 Mar; 66(1):70-8. Epub 2009 May 7.* - Variable selection for semiparametric mixed models in longitudinal studies.[Biometrics. 2010]
*Ni X, Zhang D, Zhang HH.**Biometrics. 2010 Mar; 66(1):79-88. Epub 2009 Apr 13.* - Getting started in biological pathway construction and analysis.[PLoS Comput Biol. 2008]
*Viswanathan GA, Seto J, Patil S, Nudelman G, Sealfon SC.**PLoS Comput Biol. 2008 Feb; 4(2):e16.* - Inference of gene regulatory networks using boolean-network inference methods.[J Bioinform Comput Biol. 2009]
*Hickman GJ, Hodgman TC.**J Bioinform Comput Biol. 2009 Dec; 7(6):1013-29.*

- Functional Linear Models for Association Analysis of Quantitative Traits[Genetic epidemiology. 2013]
*Fan R, Wang Y, Mills JL, Wilson AF, Bailey-Wilson JE, Xiong M.**Genetic epidemiology. 2013 Nov; 37(7)726-742* - Kernel score statistic for dependent data[BMC Proceedings. ]
*Malzahn D, Friedrichs S, Rosenberger A, Bickeböller H.**BMC Proceedings. 8(Suppl 1)S41* - A comparative analysis of family-based and population-based association tests using whole genome sequence data[BMC Proceedings. ]
*Zhou JJ, Yip WK, Cho MH, Qiao D, McDonald ML, Laird NM.**BMC Proceedings. 8(Suppl 1)S33* - FFBSKAT: Fast Family-Based Sequence Kernel Association Test[PLoS ONE. ]
*Svishcheva GR, Belonogova NM, Axenovich TI.**PLoS ONE. 9(6)e99407* - Rare Variants Detection with Kernel Machine Learning Based on Likelihood Ratio Test[PLoS ONE. ]
*Zeng P, Zhao Y, Zhang L, Huang S, Chen F.**PLoS ONE. 9(3)e93355*

- Semiparametric Regression of Multidimensional Genetic Pathway Data: Least-Square...Semiparametric Regression of Multidimensional Genetic Pathway Data: Least-Squares Kernel Machines and Linear Mixed ModelsNIHPA Author Manuscripts. Dec 2007; 63(4)1079PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...