- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# A Powerful and Flexible Multilocus Association Test for Quantitative Traits

^{1}Dawei Liu,

^{3}Xihong Lin,

^{4}Debashis Ghosh,

^{5}and Michael P. Epstein

^{2,}

^{}

^{1}Department of Biostatistics, Emory University, Atlanta, GA 30322, USA

^{2}Department of Human Genetics, Emory University, Atlanta, GA 30322, USA

^{3}Center for Statistical Sciences, Brown University, Providence, RI 02912, USA

^{4}Department of Biostatistics, Harvard University, Boston, MA 02115, USA

^{5}Department of Statistics, Pennsylvania State University, University Park, PA 16802, USA

This document may be redistributed and reused, subject to certain conditions.

## Abstract

Association mapping of complex traits typically employs tagSNP genotype data to identify a trait locus within a region of interest. However, considerable debate exists regarding the most powerful strategy for utilizing such tagSNP data for inference. A popular approach tests each tagSNP within the region individually, but such tests could lose power as a result of incomplete linkage disequilibrium between the genotyped tagSNP and the trait locus. Alternatively, one can jointly test all tagSNPs simultaneously within the region (by using genotypes or haplotypes), but such multivariate tests have large degrees of freedom that can also compromise power. Here, we consider a semiparametric model for quantitative-trait mapping that uses genetic information from multiple tagSNPs simultaneously in analysis but produces a test statistic with reduced degrees of freedom compared to existing multivariate approaches. We fit this model by using a dimension-reducing technique called least-squares kernel machines, which we show is identical to analysis using a specific linear mixed model (which we can fit by using standard software packages like SAS and R). Using simulated SNP data based on real data from the International HapMap Project, we demonstrate that our approach often has superior performance for association mapping of quantitative traits compared to the popular approach of single-tagSNP testing. Our approach is also flexible, because it allows easy modeling of covariates and, if interest exists, high-dimensional interactions among tagSNPs and environmental predictors.

## Introduction

The arrival of improved high-throughput genotyping technology has accelerated the use of association methods for dissection of the genetic mechanisms of complex traits. Using panels of single-nucleotide polymorphisms (SNPs), association methods seek to identify those genetic markers that either are a trait locus or are in linkage disequilibrium (LD) with a trait locus. In the process of association mapping of a complex trait, interest will eventually focus on regions or genes that are identified either from interesting signals from previous gene-mapping work or from perceived biological relevance to the trait of interest. To examine whether such a region harbors a trait locus, a study could genotype and subsequently analyze all polymorphic SNPs in the genetic interval. However, the probable existence of LD in the region will induce correlation among such SNPs such that many of the genetic markers provide redundant information for association analysis. Therefore, many association studies instead genotype a reduced set of SNPs—called tagSNPs—within the region that effectively captures the genetic variation from all SNPs within the region but substantially reduces the genotype cost. Studies can identify relevant tagSNPs by applying existing selection algorithms1–3 to SNP genotype data from existing public databases of human genetic variation, such as the International HapMap Project.4

In this article, we focus on the use of tagSNP data to identify genetic regions that influence a quantitative trait of interest by using samples collected under a population-based study design. Currently, considerable debate exists regarding the most powerful manner by which to utilize such tagSNP data in association analysis. A simple and popular approach considers association testing of each individual tagSNP with the quantitative trait of interest (via regression or ANOVA methods) followed by inference on the maximum of the resulting single-tagSNP statistics. Because of the testing of multiple correlated tagSNPs within a region, one must implement an appropriate multiple-testing procedure to ensure appropriate significance levels. Such multiple-testing corrections may include permutation procedures, efficient Monte Carlo procedures,5 or a Bonferroni correction based on the effective number of independent tests within the region.6,7

Although the testing of individual tagSNPs is simple to implement, such methods may have low power if each tested tagSNP is in incomplete LD with the (untyped) quantitative-trait locus (QTL). This potential liability of single-tagSNP approaches led to the development of novel statistical approaches that consider the joint effects of tagSNPs simultaneously within analysis. Such multivariate tagSNP analyses of quantitative traits typically apply multilinear regression to model a subject's trait as a function of a vector of covariates corresponding to either the subject's genotypes at the various tagSNPs or the subject's pair of tagSNP-based haplotypes.8–10 Such regression procedures produce omnibus test statistics that follow a *χ*^{2} distribution with degrees of freedom equal to either the number of modeled tagSNPs (for a genotype-based analysis) or the number of observed haplotypes minus one (for a haplotype-based analysis).

Because these multivariate approaches combine genetic information from multiple tagSNPs simultaneously into analysis, they intuitively should provide greater power to detect QTLs than do tests of individual tagSNPs. However, many simulation studies have found the opposite result to be true: Multivariate approaches typically have similar or reduced power relative to single-SNP procedures11–13 unless the trait originates from the effect of a specific haplotype rather than a specific SNP.13 An explanation for this surprising finding is that multivariate procedures produce test statistics with degrees of freedom that will increase substantially (particularly in the situation of haplotype analysis) with the number of modeled tagSNPs within the region.10 As the degrees of freedom of the test statistic increases, it follows that the power of the omnibus test will decrease. Therefore, it is likely that any information gained from joint consideration of multiple tagSNPs in association analysis of a quantitative trait will subsequently be lost by dealing with test statistics with large degrees of freedom.

Given these results, we seek to develop a novel statistical approach for association mapping of quantitative traits that incorporates all tagSNPs (and, hence, all valuable genetic information) within a region into the association analysis but produces test statistics with smaller degrees of freedom than the multivariate approaches described earlier. Existing statistical work in this area generally approaches the problem in one of two broad ways. The first strategy applies a dimension-reduction procedure such as a Fourier transformation14 or principal components15 to the tagSNP data in the region to produce a reduced set of orthogonal genetic predictors that contain the majority of information found in the original tagSNPs. One then models this reduced set of genetic predictors within a multilinear regression framework and constructs appropriate omnibus tests for inference (which should have smaller degrees of freedom than a standard multivariate test). The second strategy calculates a measure of average tagSNP similarity for each pair of subjects and compares the pairwise genetic similarity with the pairwise trait similarity.16,17 One can measure such tagSNP similarity by using a “kernel” function that reduces a comparison of multiple tagSNPs for a pair of subjects into a single scalar factor. Because of this phenomenon, resulting statistics using kernel functions typically have small degrees of freedom; for example, Schaid et al.16 constructed a kernel-based U-statistic for case-control association analysis that has only 1 degree of freedom. In addition, the use of a kernel function is appealing because it allows for the inclusion of prior information (such as bioinformatic relevance or association signals from tagSNPs in an independent study) in the form of weights to assist in the evaluation of the tagSNP similarity. One drawback of these existing similarity-based approaches is that they do not easily allow for covariates and sometimes require computationally intensive permutation procedures to establish significance.17

In this article, we propose a novel approach for association mapping of quantitative traits that uses all tagSNP data simultaneously in analysis but produces test statistics with smaller degrees of freedom than multivariate tagSNP approaches. We base our approach on a semiparametric-regression framework18 that regresses the quantitative trait of interest on a smooth nonparametric function of the tagSNP genotypes within the region, adjusting for the parametric effects of any covariates of interest. As we will show, we can model this nonparametric function of the tagSNP data in a reduced-dimension space that is induced by a user-defined kernel function. As a result, statistics that test for association between the trait and the nonparametric function of the tagSNP effects should have reduced degrees of freedom compared to existing multivariate tests and, hence, should have improved power to detect QTLs. Unlike existing dimension-reduction techniques, we will show that our approach permits us to incorporate valuable prior information in the analysis via the kernel function. Unlike existing similarity approaches, we will show that our approach can easily allow for covariates and interaction terms. Further, we can rely on asymptotic theory to establish significance of the resulting tests, avoiding computationally intensive permutation procedures.

We estimate the parameters in our proposed semiparametric model by using a flexible high-dimensional technique called least-squares kernel machines (LSKM).19,20 Previously, LSKM methods have been applied to continuous variables, such as expression data from microarray analysis.20 Here, we propose the novel use of kernel functions that are designed for categorical tagSNP data. The kernels we discuss incorporate relevant weights as well as appropriate measures of genetic similarity between subjects. Although LSKM fitting of a semiparametric model appears complicated, Liu et al.20 noted that one can represent the LSKM procedure by using a specific form of a linear mixed model, such that one can estimate and test the nonparametric function of the tagSNP data by using simple restricted-maximum-likelihood procedures that are typically applied to mixed models and are available in common statistical software packages such as SAS and R.

In subsequent sections, we develop our semiparametric model and show how we can estimate model parameters by using the LSKM maximization approach of Liu et al.20 We then show how one can represent the LSKM approach in terms of a linear mixed model that facilitates testing of the nonparametric function of the tagSNP genotype data. Using simulated tagSNP data based on real data from the International HapMap Project,4 we show that our proposed semiparametric approach often has improved power to detect an association between a genetic region and a quantitative trait compared to the popular single-tagSNP testing approach. We also describe a variety of valuable gene-mapping extensions of our semiparametric approach in the Discussion.

## Material and Methods

### Notation

Using a population-based study design, we assume a sample of *N* unrelated subjects. Let *Y _{j}* denote the quantitative trait value for subject

*j*(

*j*= 1, …,

*N*). We assume that each subject is genotyped at

*S*tagSNPs within the region of interest. We let

*G*denote the genotype of subject

_{j,s}*j*at tagSNP

*s*(

*s*= 1, …,

*S*) and let

*G*= (

_{j}*G*

_{j,}_{1},

*G*

_{j,}_{2}, …,

*G*

_{j,}_{S}) denote an (

*S*x 1) vector of all tagSNP genotypes for subject

*j*. For tagSNP

*s*, we code

*G*

_{j,}_{s}to be the number of copies of the minor allele that the subject

*j*possesses at the tagSNP such that the predictor takes values of 0, 1, or 2. These values correspond to an additive model of allelic effect; we can consider alternative coding scenarios for

*G*

_{j,}_{s}under dominant and recessive models, if desired. Finally, we let

*X*denote a (

_{j}*p*× 1) vector of measured environmental covariates for subject

*j*.

### Semiparametric-Regression Model

We propose the use of semiparametric regression to model the relationship between the outcome *Y _{j}* and the tagSNPs

*G*, adjusting for potential covariates in

_{j}*X*. We can write this semiparametric model as the following:

_{j}Here, *h*(*G _{j}*) denotes a nonparametric function of the tagSNP genotype data

*G*that resides in some function space

_{j}*κ. β*is a (

*p*× 1) vector of regression coefficients describing the effects of

*X*, which are modeled parametrically. Finally,

_{j}*e*is a random subject-specific environmental effect, which we assume to be normally distributed with mean 0 and variance

_{j}*σ*

^{2}.

Within the model in Equation 1, interest focuses primarily on the estimation of the nonparametric function of the tagSNP data *h* and its relationship to the trait outcome *Y _{j}*. Secondary interest focuses on the estimation and testing of

*β*to assess the effects of the covariates in

*X*on

_{j}*Y*. Because we are using a semiparametric framework in Equation 1, traditional maximization procedures for linear regression models are not applicable in this setting. To estimate

_{j}*h*and

*β*, we instead propose the use of the flexible LSKM procedure to analyze our high-dimensional data (which, in our context, refers to the tagSNP genotype data in

*G*). Using the LSKM approach of Liu et al.,20 we obtain the following estimates of

_{j}*h*and

*β*in Equation 1:

Here, $\mathit{Y}={({\mathit{Y}}_{1},\dots ,{\mathit{Y}}_{\mathit{N}})}^{\mathit{T}}$ is an (*N* × 1) vector of the trait values for all subjects and *X* is an (*N* × *p*) matrix of environmental covariates for all subjects. Further, *I* denotes an (*N* x *N*) identity matrix. Finally, there are two additional terms in Equations 2 and 3 that are important to discuss. The first term is the parameter *λ*, which denotes a scalar smoothing parameter. As we will show in subsequent sections, *λ* plays an important role in constructing appropriate test statistics to assess whether the nonparametric function *h* of the tagSNP genotype data influences *Y*.

The second important term in Equations 2 and 3 is *K*, which denotes an (*N* × *N*) kernel matrix that is a function of the tagSNP genotype data in the region. In particular, the (*j*, *l*)* ^{th}* element of

*K*denotes a kernel

*k*(

*G*,

_{j}*G*) that is a scalar function of the tagSNP genotypes of subjects

_{l}*j*and

*l*. Broadly speaking,

*k*(

*G*,

_{j}*G*) will often be a measure of pairwise tagSNP-genotype similarity across the region. Because

_{l}*k*(

*G*,

_{j}*G*) is scalar, the kernel intuitively serves as a dimension-reducing function as it collapses the comparison of the multidimensional tagSNP vectors

_{l}*G*and

_{j}*G*into a simple scalar factor. A variety of choices exist for the kernel function

_{l}*k*(

*G*,

_{j}*G*). However, the choice of kernel is not arbitrary. In particular, the kernel function in

_{l}*K*within Equations 2 and 3 must satisfy the conditions of Mercer's Theorem,21 which includes the condition that the

*K*matrix must be positive semidefinite (i.e., the eigenvalues of

*K*must be positive).

For this article, we focus on kernel functions that are based on the number of alleles shared identical by state (IBS) by subjects *j* and *l* at the tagSNPs within the region.17 The IBS kernel takes the form

where *IBS*(*G _{j,s}*,

*G*) denotes the number of alleles shared IBS (0, 1, or 2) by subjects

_{l,s}*j*and

*l*at tagSNP

*s*. An appealing feature of the IBS kernel is that we can augment it to include tagSNP-specific weights that can incorporate valuable prior information into analysis to potentially improve performance. Define

*w*as a scalar weight for tagSNP

_{s}*s*. We can then define a weighted-IBS kernel based on Equation 4 as the following:

We focus on two potentially valuable weights for use in the IBS kernel in Equation 5. First, we consider a weight that upweights tagSNPs with a rare minor-allele frequency (MAF) and downweights tagSNPs with more common MAFs. Such a weight could be valuable because of the potential for the information from tagSNPs with rare MAFs to be smoothed over by the information from surrounding tagSNPs with more common MAFs. To upweight tagSNPs with rare MAFs, we apply the weight ${\mathit{w}}_{\mathit{s}}=1/\sqrt{{\mathit{q}}_{\mathit{s}}}$, where *q _{s}* denotes the MAF of tagSNP

*s*(

*s*= 1, …,

*S*). Other MAF weights are certainly possible, such as

*w*= 1/

_{s}*q*, but there is concern that such stronger weights may substantially diminish the information provided by those tagSNPs with common MAF.

_{s}In addition to weights based on MAF, we can use weights based on prior evidence of association between the tagSNP and the trait (or a related trait of interest) in an independent dataset. Here, we let *w _{s}* = −

*log*

_{10}(

*p*) where

_{s}*p*is the p value for the test of tagSNP

_{s}*s*with the trait in the independent dataset. Intuitively, such weights will upweight SNPs showing stronger prior evidence of association and downweight SNPs that demonstrate weaker prior evidence of association. As noted in the Discussion, we feel that such weights are, or will be, readily available from relevant genetic literature or public release of data from whole-genome association studies.

### Relationship to Linear Mixed Models

Inspection of $\widehat{\mathit{h}}$ in Equation 2 shows that the nonparametric function in Equation 1 models the tagSNP genotype data in a reduced-dimension space κ induced by the chosen kernel function in *K*. Next, we focus on constructing an appropriate test statistic to evaluate whether the function *h* of the tagSNP genotype data is associated with the trait of interest. That is, we wish to construct a test statistic to evaluate the null hypothesis *H*_{0}: *h* = 0, where we model *h* by using Equation 1. To facilitate the construction of such a test statistic, Liu et al.20 noted that LSKM-based estimation of $\widehat{\mathit{h}}$ and $\widehat{\mathit{\beta}}$ is analogous to the estimation of random and fixed effects, respectively, within a specific linear mixed model. Therefore, rather than employ complicated procedures to directly test *H*_{0}: *h* = 0, we can exploit the LSKM relationship with a mixed model to apply a likelihood framework to construct an appropriate test statistic for inference. Additionally, the use of a linear mixed model for inference is appealing because it allows implementation of our approach with any common software package for mixed-model analysis (e.g., SAS PROC MIXED).

To apply the results from Liu et al.20 and develop the mixed-model representation of the LSKM analysis by using the semiparametric model in Equation 1, we consider the following linear mixed model:

where *Y* denotes the earlier trait vector and *X* denotes the earlier matrix of fixed environmental covariates with related regression-coefficient vector *β*. Within Equation 6, we denote *h* as a (*N* × 1) vector of random effects belonging to the tagSNP genotype data and denote *E* as a vector of random effects due to subject-specific environment.

Suppose we assume that the random tagSNP effects in *h* follow a multivariate normal distribution with mean 0 and variance-covariance matrix $\frac{{\mathit{\sigma}}^{2}}{\lambda}\mathit{K}$, where *K* is our kernel matrix, *λ* denotes the smoothing parameter discussed earlier, and *σ*^{2} denotes the variance due to subject-specific environment. Further, suppose we assume that *E* also follows a multivariate normal distribution with mean vector 0 and variance-covariance matrix *σ*^{2}*I*, where *I* denotes the identity matrix. Under these assumptions, we can use restricted maximum likelihood (REML) procedures commonly applied to linear mixed models to estimate (*β*, *λ*, *σ*^{2}). After applying REML procedures, we can show, following Liu et al.,20 that the best-linear unbiased estimators of the random effects *h* and the fixed effects *β* in the linear mixed model are

where *λ* can be estimated with REML procedures. One can see that the estimates of $\widehat{\mathit{h}}$ and $\widehat{\mathit{\beta}}$ in Equations 7 and 8 are exactly the same as the estimates of $\widehat{\mathit{h}}$ and $\widehat{\mathit{\beta}}$ in Equations 2 and 3, respectively, derived via LSKM estimation of the semiparametric model in Equation 1. The equivalence of these estimates shows that we can perform our LSKM multilocus analysis by using a straightforward linear mixed model that is easy to implement with existing statistical software packages for mixed models.

### Testing the Nonparametric Function

The relationship between LSKM and the linear mixed model implies that we can test *H*_{0}: *h* = 0 in the semiparametric model by appropriate testing of the existence of the random tagSNP effect *h* in the linear mixed model in Equation 6. As noted earlier, we assume that *h* follows a multivariate-normal distribution with mean vector 0 and covariance matrix $\frac{{\mathit{\sigma}}^{2}}{\mathit{\lambda}}\mathit{K}$. Assume $\mathit{\tau}={\mathit{\sigma}}^{2}/\mathit{\lambda}$ such that we rewrite the covariance matrix as *τK*. If τ = 0, then this directly implies that *h* = 0. Because *K* must be positive semidefinite under the LSKM model21 (with diagonal elements equaling 1 with any of the suggested kernel functions), it also follows that *h* = 0 only when τ = 0. Therefore, a test of *H*_{0}: τ = 0 in the linear mixed model (Equation 6) is equivalent to testing *H*_{0}: *h* = 0 in the semiparametric model (Equation 1).

To test *H*_{0}: *τ* = 0, we propose the use of the score statistic of Liu et al.20 The score statistic takes the form

where $\widehat{\mathit{\beta}}$ and ${\widehat{\mathit{\sigma}}}^{2}$ are the maximum-likelihood estimates of *β* and *σ*^{2} under *H*_{0}, which are obtained from the linear-regression model *Y* = *Xβ* + *E*. Because *τ* ≥ 0, we are testing the parameter of interest on its boundary value. As a result, *S _{τ}* does not follow a standard ${\chi}_{1}^{2}$ distribution under

*H*

_{0}and, instead, follows a complicated mixture of ${\chi}_{1}^{2}$ distributions. To simplify inference, we use a Satterthwaite procedure (described in Appendix A) to approximate the distribution of

*S*.

_{τ}### Simulations

We used simulations to assess the performance of our semiparametric approach in a typical candidate-gene study. For genetic data, we used simulated tagSNP data based on the Centre d'Etude du Polymorphisme Humain (CEU) genotypes from build 35 of the International HapMap Project.4 We based our simulations on the LD structure of two genes: *CHI3L2* (MIM 601526) and *NAT2* (MIM 243400). *CHI3L2* is 15.8 kb long, with 37 polymorphic SNPs in the CEU sample. *NAT2* spans 9.9 kb, with 20 polymorphic SNPs in the same sample. Within each gene, we selected tagSNPs by using the Tagger program.3 We allowed for multimarker tagging and captured all polymorphic markers in each gene with *R*^{2} > 0.8, regardless of the marker's minor-allele frequency. Using these criteria, we identified ten tagSNPs for *CHI3L2* and seven tagSNPs for *NAT2*. We show the LD structure of the tagged and nontagged SNPs within *CHI3L2* and *NAT2* in Figures 1 and 2, respectively. Within each gene, we applied PHASE22–24 to the genetic data to estimate haplotype frequencies for the encompassed SNPs. We then generated relevant SNP genotype data at each gene for each subject by using these estimated haplotype frequencies under the assumption of Hardy-Weinberg equilibrium.

To ensure that our semiparametric approach had appropriate size, we first considered simulations under null models where none of the SNPs within the gene had an effect on our trait of interest. However, we did allow for trait-influencing effects from environmental predictors. Therefore, we simulated trait data under the following null model:

Here, ${\mathit{X}}_{{\mathit{E}}_{\mathit{j}}}$ denotes the coding vector of environmental covariates for subject *j* with respective effect-size vector *β _{E}*. We assumed that ${\mathit{X}}_{{\mathit{E}}_{\mathit{j}}}$ contained both a binary covariate (with frequency of exposure of 0.506) and a continuous covariate (assumed to be normally distributed with mean 29.2 and variance 21.1). The assumed parameterization for the covariates closely mirrored those of relevant covariates in the FUSION study of type 2 diabetes.25 We assumed that the effect size was 0.50 for the binary covariate and 0.03 for the continuous covariate. Finally, we let

*e*denote a random subject-specific error term for subject

_{j}*j*, which we generated under a normal distribution with mean 0 and variance 1.

We next considered simulations under alternative models where we selected one of the SNPs within the gene to serve as the QTL. We allowed the QTL to be either a typed tagSNP or an untyped SNP but required the variant to have MAF greater than 0.05 (as done elsewhere10,12,14). Within *CHI3L2*, 30 of the 37 polymorphic SNPs fulfilled this criteria, with six of these 30 polymorphisms being tagSNPs. Within *NAT2*, 17 of the 20 polymorphic SNPs fulfilled this criteria, with three of the 17 polymorphisms being tagSNPs. Denoting the QTL as *S*^{}, we generated the trait outcome for subject *j* with the following model:

Here, ${\mathit{X}}_{{\mathit{G}}_{\mathit{j},{\mathit{S}}^{\ast}}}$ denotes the coding of the genotype at QTL *S*^{} for subject *j* with respective effect size ${\mathit{\beta}}_{{\mathit{S}}^{\ast}}$. We considered additive, dominant, and recessive effects of the minor allele and chose ${\mathit{\beta}}_{{\mathit{S}}^{\ast}}$ in each case such that the QTL *S*^{} explained 3% of the trait variation, which is reasonable given that many complex traits originate from the effects of multiple genes each with small effect. We assumed values for ${\mathit{X}}_{{\mathit{E}}_{\mathit{j}}}$ and *β _{E}* that were the same as those used in the null simulations.

For a given simulation design, we generated either 5000 datasets (for null models) or 1000 datasets (for alternative models), each consisting of 300 unrelated subjects. Each dataset contained trait data on all subjects, genotype data for the tagSNPs in the candidate gene, and environmental data on the covariates mentioned earlier. We assumed that we did not observe genotypes at untyped SNPs (even though such untyped SNPs may be QTLs). We analyzed each dataset by using both our proposed semiparametric approach and, as a benchmark, traditional single-tagSNP statistics (modeled under an additive model of allelic effect).

For our semiparametric approach, we analyzed the data three times. First, we used the unweighted IBS kernel in Equation 4. Next, we used the weighted IBS kernel in Equation 5 with weights based on the MAF of the tagSNP. Finally, we used a weighted IBS kernel with weights based on single-tagSNP p values from an independently generated dataset. We wished to evaluate the performance of this last kernel when we simulated the independent dataset both under the same genetic model as and under a different genetic model than that used in our dataset under study. The primary purpose of a independent-dataset simulation under a different genetic model than the one used for the dataset of interest was to address whether inappropriate prior p value weights from an independent dataset affected the size of our semiparametric approach. We investigated this issue by generating the dataset under study with the null model in Equation 10 but generating the independent dataset with the alternative model (Equation 11) assuming a particular SNP as the QTL.

For the single-tagSNP tests, we performed least-squares regression at each tagSNP in the gene under an additive model (allowing for the binary and continuous covariates) and tested the effect of the tagSNP by using a Wald statistic. We retained the largest Wald statistic across the tested tagSNPs and used 5000 permutations of the data to establish the significance of this maximum statistic. We examined type I error and power of the semiparametric and single-tagSNP approaches assuming a nominal significance level of α = 0.05.

## Results

Table 1 provides the empirical type I error results at nominal α = 0.05 for our semiparametric method assuming the different IBS-based kernels described in the Material and Methods. These results suggest our semiparametric approach has appropriate size regardless of the choice of kernel. In particular, we note that our semiparametric approach using p value weights has appropriate size when we select weights by using a dataset that is generated under a different model (i.e., is genetically heterogeneous) from that used for the dataset under study. This result is important because it suggests that the choice of inappropriate p value weights does not affect the size of our score statistic and, hence, does not affect the validity of our semiparametric approach. For comparison, we analyzed the same datasets by using the maximum of the single-tagSNP statistics, which also had appropriate size.

Figure 3 shows power results for simulations based on the *CHI3L2* gene. The x axis of the figure shows the *CHI3L2* SNP used as the QTL in the simulation, as well as the SNP's MAF. The y axis shows the power of our semiparametric approach using IBS kernels weighted by either the tagSNPs' MAFs or the tagSNPs' p values from an independently generated dataset. The y axis also shows the power of the maximum of the single-tagSNP statistics, which serves as a benchmark for our proposed semiparametric approaches. The plots show that our proposed semiparametric approach using a weighted IBS kernel based on tagSNPs' p values clearly has optimal performance relative to the other approaches shown in the figure, regardless of the genetic model used to simulate the data, the nature of the SNP used as the QTL (i.e., tagSNP or untyped SNP), and the SNP's MAF. This increased power is hardly surprising, given that the approach using a kernel weighted by p values is the only one of the three shown that uses additional information from an independent dataset to assist in inference.

Although the IBS kernel weighted by MAF displays lower power than the IBS kernel weighted by p values, Figure 3 shows that the former kernel is still generally more powerful than the maximum of the single-tagSNP statistics across QTLs and genetic models. There are a few situations where this condition does not hold, however. In particular, under an additive model, results show that the maximum of single-tagSNP statistics is more powerful than the weighted IBS kernel based on MAF for QTL SNPs with MAF < 0.10 (e.g., SNP rs2182115, MAF = 0.085). However, this power difference between the two approaches substantially decreases for dominant and recessive genetic models.

Figure 4 shows analogous power results for simulations based on the *NAT2* gene. Overall, we observed similar power results for this gene compared to that of the *CHI3L2* gene. Our semiparametric method using the IBS kernel weighted by p values substantially outperformed the other competing approaches across all genetic models tested, although the difference was most pronounced under a dominant model. The semiparametric approach weighted by MAF generally exhibited greater power than the maximum of the single-tagSNP statistics across the tested SNPs and genetic models. The differences in power were most pronounced under dominant and recessive models. We anticipate this finding because the semiparametric approach uses a nonparametric approximation of the tagSNP effect [via *h*(·) in Equation 1] that makes the approach robust to the effects of model misspecification (unlike traditional tag-SNP tests that typically assume a parametric additive model). We also note the low power observed for all methods at one particular marker, rs1961456. As seen in Figure 2, this marker displays comparatively weak LD with the other SNPs in the gene, which leads to relatively low power by all methods to detect the association between the trait and this particular SNP.

To simplify presentation, we did not show power results for the unweighted IBS kernel (Equation 4) in Figures 3 and 4. Overall, the performance of the unweighted IBS kernel was similar to that of the IBS kernel weighted by MAF with a few notable differences. For QTL SNPs with MAF > 0.10, we found that the unweighted IBS kernel had equivalent or slightly improved power compared to the IBS kernel weighted by MAF. However, for QTL SNPs with MAF < 0.10, we found that the unweighted IBS kernel could have substantially reduced power relative to the IBS kernel weighted by MAF. For example, assuming an additive model where the QTL SNP was rs2182115 (MAF = 0.085) in *CHI3L2*, we found that the power of the unweighted IBS kernel was 0.327 compared to 0.498 for the IBS kernel weighted by MAF. This result suggests that, without weighting, the effects of QTL SNPs with rare MAFs may be smoothed over by information from surrounding SNPs with more common MAFs. Because the IBS kernel weighted by MAF appears to have better performance averaged across the range of MAF compared to the unweighted IBS kernel, we recommend the use of the former kernel over the latter in association analysis.

Although primary interest focuses on the testing of the nonparametric function *h*, secondary interest may focus on the estimation and testing of environmental covariate effects. Table 2 shows estimates of the mean and standard deviation, along with the empirical standard deviation, of the regression parameters related to the binary and continuous covariates used in our simulations. Because of the large number of SNPs and models examined, we display results only for one representative configuration of both the *NAT2* and *CHI3L2* genes. These examples show that the semiparametric-regression method produces unbiased estimates of the covariate effects with empirical standard deviations that closely match the LSKM-based standard deviations. We observed similar results for other simulation models (results not shown).

## Discussion

In this article, we have proposed a flexible semiparametric-regression framework for association mapping of quantitative traits that uses genotype data from multiple tagSNPs within a region of interest. Using simulated genetic data based on real data from the International HapMap Project,4 we demonstrated that our approach often has superior performance compared to tests of individual tagSNPs, which is the most common approach for association mapping of complex traits. Our method's improved performance results from modeling the effects of multiple tagSNPs within a reduced-dimension function, thereby using more genetic information in analysis but producing test statistics (based on the function) with smaller degrees of freedom than typical multivariate methods. In addition to improved power, our approach is also quite flexible because it can easily adjust for the effects of potential confounders (such as subpopulation assignment in a stratified population) and, further, can evaluate interaction effects among tagSNPs and environmental factors (by modeling such interactions parametrically or nonparametrically with the function *h* in Equation 1). By maximizing the semiparametric model with LSKM, we show that we can fit the model easily by using common maximization procedures—available in a variety of software packages —for linear mixed models. The approach is computationally efficient to implement; analysis of 1000 replicates of simulated data (with the design described in the Simulations section) took only 5 min to run on a Dell Latitude D810 with a 2.26 GHz processor. We provide SAS and Fortran code for implementing the approach on our website (Epstein Software).

We applied our semiparametric approach to the problem of testing whether a specific region influenced a quantitative trait of interest. However, with some effort, we can extend our approach to create a multilocus association test for genome-wide association studies. Specifically, we can implement our approach by using a sliding-window process that considers overlapping or nonoverlapping sets of tagSNPs across each chromosome. Within a particular window, we can apply our approach to the genotype data from the multiple tagSNPs and produce a statistic for testing whether the tagSNPs within the given window are associated with the trait of interest. After constructing test statistics for each window across the genome, we can establish significance of a particular statistic (taking into account the adjustment for multiple correlated tests) by using either permutations or a more computationally efficient approach based on adjustment of correlated p values.26,27 We will investigate this latter approach in a subsequent paper.

As with traditional multilocus genotype and haplotype analyses, we were primarily interested in applying our semiparametric approach to regions of modest size containing tagSNPs in various degrees of LD with one another and, presumably, the QTL of interest. Nevertheless, we conducted additional simulations examining the stability and performance of our semiparametric approach in situations where the region of interest (and the number of modeled tagSNPs) was considerably larger. For example, using the HapMap CEU sample, we conducted simulations using 33 tagSNPs contained within the 74 kb HNF4α gene (MIM 600281) and found that our approach always converged properly and had appropriate type I error (results not shown). Regarding power, we found that the performance of our semiparametric approach using p value weights was still improved over the single-locus approach as the number of tagSNPs and the length of region considered increased. However, using MAF weights, we found that the performance of our method became quite similar to the single-locus method as the length of the region of interest (and the number of tagSNPs) increased. We explain this result by noting that, as the size of the region of interest increases, the chance of including tagSNPs that are uncorrelated with the true QTL also increases. Such uncorrelated tagSNPs only introduce noise into our method, which makes the true signal from the QTL more challenging to find. In these situations, we recommend applying our approach within a sliding-window framework, described in the previous paragraph, that considers smaller sets of tagSNPs and thereby decreases the chance of including tagSNPs uncorrelated with the QTL within analysis.

An appealing feature of our semiparametric approach is that it can utilize prior information (in the form of weights) to improve one's ability to detect trait-influencing regions. Within this article, we considered both MAF weights and p value weights for inference. Other weights are certainly possible (e.g., when gene information is used) and, further, such weights could actually be composite weights that combine information from different sources (e.g., MAF and p values). In this situation, we would first normalize the separate weights to be on the same scale and then develop the composite weight as an average of these scaled weights. We could further modify these composite weights to emphasize one particular source (e.g., p values) over the others in analysis, if so desired.

Of the weights we considered, the most appealing choice is to use the strength of evidence for association between that tagSNP and the trait of interest (or a correlated trait) from an independent study. We quantify this strength on the basis of the −log_{10} of the relevant p value. To obtain such p values, one could conduct an exhaustive literature search of relevant genetic studies of interest. However, we note that such p value weights will become increasingly available with the public release of tagSNP genotype and phenotype data from whole-genome association studies into free databases (often a requirement for National Institutes of Health [NIH] funding of such projects). An example of such a database is the NIH-sponsored dbGaP, which will eventually contain information on at least ten whole-genome association studies of complex traits. Also, if a study happens to have p value weights available for certain tagSNPs but not others, then one can apply imputation procedures28,29 to obtain p values for these untyped variants by using information from nearby SNPs coupled to LD patterns from references sampled from the HapMap project.4 Finally, we strongly recommend against using p value weights based on single-tagSNP analysis of the same dataset upon which one intends to apply the proposed semiparametric approach. Such an application will lead to anticonservative tests (results not shown).

In implementing our approach, we assumed no missing genotype data for the tagSNPs in the region of interest. Although our approach doesn't naturally accommodate missing genotype data within the nonparametric function, we note that we can use existing statistical procedures for imputing genotype data for a given subject to resolve this issue. Such imputation procedures can rely on the LD structure of nearby SNPs to predict a subject's missing genotype by using either observed genotype data from the study sample30 or appropriate genotype data from the International HapMap project.31 Once we impute missing genotypes, we can then incorporate them within our nonparametric function and proceed with analysis as we previously described.

Although we have developed our approach for association analysis of quantitative trait data, we note that we can extend our approach to conduct similar multi-SNP association analysis in case-control studies of disease. For such analyses, we would consider a semiparametric logistic-regression model for a binary outcome (*Y _{j}* = 1 and 0 for cases and controls, respectively) with the form $\mathit{log}({\mathit{\mu}}_{\mathit{j}}/1-{\mathit{\mu}}_{\mathit{j}})={\mathit{X}}_{\mathit{j}}^{\mathit{T}}\mathit{\beta}+\mathit{h}\left({\mathit{G}}_{\mathit{j}}\right),$ where

*μ*=

_{j}*P*[

*Y*= 1|

_{j}*G*,

_{j}*X*] and

_{j}*G*,

_{j}*X*,

_{j}*β*,

*h*(·) are defined previously as in Equation 1. Maximization of parameters in this semiparametric logistic-regression model requires the use of a modified LSKM algorithm that is similar to Liu et al.20 but correctly models the categorical nature of the disease data. As we will describe more thoroughly in a subsequent paper, we can conduct this LSKM analysis analogously by using a logistic mixed model with the form $\mathit{log}({\mathit{\mu}}_{\mathit{j}}^{\mathit{h}}/1-{\mathit{\mu}}_{\mathit{j}}^{\mathit{h}})={\mathit{X}}_{\mathit{j}}^{\mathit{T}}\mathit{\beta}+\mathit{h},$ where

*X*,

_{j}*β*, and

*h*are defined as previously and

*μ*=

_{j}^{h}*E*[

*Y*|

_{j}*X*,

_{j}*h*]. We assume that the random tagSNP effects in

*h*follow a multivariate normal distribution with mean vector 0 and variance-covariance matrix

*λ*

^{−1}

*K*, where

*λ*denotes the smoothing parameter and

*K*denotes the chosen kernel matrix. Under these conditions, we can maximize this nonlinear mixed model with a corrected penalized quasi-likelihood algorithm32 and estimate the nonparametric function by $\widehat{\mathit{h}}$ in the LSKM model by $\widehat{\mathit{h}}$ in the logistic mixed model. We can then apply a score statistic similar to that of Liu et al.20 to test the nonparametric function of the genotype data. Although the iterative nature of the penalized quasi-likelihood algorithm will increase the numerical complexity of the semiparametric analysis, it should still be computationally efficient for candidate-gene or whole-genome association analysis.

Our approach fits a semiparametric regression model using LSKM, which we show corresponds to inference via a specific linear mixed model. Although mixed-modeling procedures often are connected to pedigree analysis,33–35 we note that their elegance and flexibility make them increasingly popular tools for association mapping in population-based or case-control studies. Tzeng and Zhang36 have proposed a powerful mixed model for SNP-based haplotype analysis of complex traits that models the covariance of the outcomes among a pair of subjects as a function of their (inferred) haplotype similarity along a region of interest. The distribution of the authors' random effect has similarity to the distribution of the random tagSNP effect in our linear mixed model, although the authors' approach is not based on the use of reproducing kernels in a LSKM framework. Further, their approach focuses primarily on use of SNP-based haplotypes in their covariance structure and does not consider the use of influential and valuable prior weights in analysis. Another mixed-model tool for such a study consists of a two-level hierarchical model.37,38 The first level of the hierarchical model regresses the trait outcome on the SNPs of interest (and potential confounders), whereas the second level models the SNP-related risk parameters as a function of influential covariates including the underlying haplotype structure39 or available pathway information.40,41 Such second-level information can improve the precision and accuracy of SNP-based risk estimates.

Because our semiparametric approach is implemented in a linear mixed model, we implicitly assume that the trait data follow or can be transformed to follow approximate normality. With mixed-model-based linkage analysis of quantitative traits,34 violation of this normality assumption can yield inflated type I error rates to detect linkage if the trait distribution is leptokurtic in nature.42 To examine whether our semiparametric approach is similarly sensitive to nonnormality of the trait outcome, we conducted additional type I error simulations that generated trait data under various nonnormal distributions (e.g., gamma and log-normal distributions) with large kurtosis values. In all trait simulations, we found that our semiparametric approach had appropriate type I error under the null hypothesis (results not shown) and hence does not appear to be sensitive to nonnormality of the trait data.

## Appendix A

### Approximate Distribution of the Score Statistic ${\mathrm{S}}_{\mathit{\tau}}$ in Equation 9

We consider the linear mixed model described previously in Equation 6:

where *Y* is the vector of quantitative trait values, *X* is the vector of fixed effects, *h* is the vector of random tagSNP effects and follows a multivariate normal distribution with mean 0 and variance-covariance matrix *τK*, and *E* is a vector of subject-specific random effects and follows a multivariate normal distribution with mean 0 and variance-covariance matrix *σ*^{2}*I*.

Using the mixed model in Equation 6, we seek to determine the distribution of the score statistic in Equation 9 for testing *H*_{0}: *τ* = 0. Zhang and Lin43 noted that, because *τ* ≥ 0, we are testing the parameter on its boundary value, and, as a result, the distribution of *S _{τ}* follows a mixture of χ

_{1}

^{2}distributions. To facilitate inference, the authors showed that one can approximate this complicated mixture distribution with a scaled χ

^{2}distribution

*δχ*

_{ν}^{2}, where δ denotes the scale parameter and ν denotes the degrees of freedom. To estimate

*δ*and

*ν*, the authors suggested the use of the Satterthwaite method, which equates the mean and variance of the score statistic

*S*in Equation 9 with the mean and variance of

_{τ}*δχ*

^{2}

*.*

_{ν}Let *e* denote the mean of *S _{τ}* and let

*I*denote the variance of the score statistic. When calculating the mean and variance of

_{ττ}*S*, we must account for the fact that we use estimates of

_{τ}*σ*

^{2}and

*β*instead of the true values of these parameters in Equation 9. Therefore, we replace the mean

*e*with $\tilde{\mathit{e}}$ = tr(

*P*

_{0}

*K*)/2, where

*P*

_{0}=

*I*−

*X*(

*X*)

^{T}X^{−1}

*X*is the projection matrix under the null hypothesis. Also, we replace the variance

^{T}*I*with the efficient information ${\tilde{\mathit{I}}}_{\mathit{\tau}\mathit{\tau}}$ as follows:

_{ττ}where ${\mathit{I}}_{\mathit{\tau}\mathit{\tau}}=\mathit{tr}{\left({\mathit{P}}_{0}\mathit{K}\right)}^{2}/2,{\mathit{I}}_{\mathit{\tau}{\mathit{\sigma}}^{2}}=\mathit{tr}\left({\mathit{P}}_{0}\mathit{K}{\mathit{P}}_{0}\right)/2$, and ${\mathit{I}}_{{\mathit{\sigma}}^{2}{\mathit{\sigma}}^{2}}=\mathit{tr}\left({\mathit{P}}_{0}^{2}\right)/2.$

Once we obtain $\tilde{\mathit{e}}$ and ${\tilde{\mathit{I}}}_{\mathit{\tau}\mathit{\tau}}$, we can set the former equal to *δν* (the mean of a *δχ*^{2}* _{ν}* random variable) and the latter equal to 2

*δ*

^{2}

*ν*(the variance of a

*δχ*

^{2}

*random variable). After solving the system of equations, we calculate the scale parameter for the approximate distribution as $\delta ={\tilde{\mathit{I}}}_{\mathit{\tau}\mathit{\tau}}/2\tilde{\mathit{e}}$ and calculate the degrees of freedom as $\nu =2{\tilde{\mathit{e}}}^{2}/{\tilde{\mathit{I}}}_{\mathit{\tau}\mathit{\tau}}$. We can then compare the value of the resulting scaled score statistic,*

_{ν}*S*/

_{τ}*δ*, to a chi-square distribution with

*ν*degrees of freedom in order to assess significance of the test of

*H*

_{0}:

*τ*= 0.

## Web Resources

The URLs for data presented herein are as follows:

- Epstein Software, http://www.genetics.emory.edu/labs/epstein/software
- Online Mendelian Inheritance in Man (OMIM), http://www.ncbi.nlm.nih.gov/Omim/

## Acknowledgments

This work was sponsored by NIH grants GM074909 (to L.C.K.), HG003618 (to L.C.K and M.P.E.), and CA76404 (to X.L.).

## References

**American Society of Human Genetics**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (1.2M)

- Haplotype-based quantitative trait mapping using a clustering algorithm.[BMC Bioinformatics. 2006]
*Li J, Zhou Y, Elston RC.**BMC Bioinformatics. 2006 May 18; 7:258. Epub 2006 May 18.* - Detection of genes for ordinal traits in nuclear families and a unified approach for association studies.[Genetics. 2006]
*Zhang H, Wang X, Ye Y.**Genetics. 2006 Jan; 172(1):693-9. Epub 2005 Oct 11.* - Multilocus association testing of quantitative traits based on partial least-squares analysis.[PLoS One. 2011]
*Zhang F, Guo X, Deng HW.**PLoS One. 2011 Feb 3; 6(2):e16739. Epub 2011 Feb 3.* - Improved power by use of a weighted score test for linkage disequilibrium mapping.[Am J Hum Genet. 2007]
*Wang T, Elston RC.**Am J Hum Genet. 2007 Feb; 80(2):353-60. Epub 2006 Dec 21.* - Association genetics of complex traits in conifers.[Trends Plant Sci. 2004]
*Neale DB, Savolainen O.**Trends Plant Sci. 2004 Jul; 9(7):325-30.*

- Adjustment for Population Stratification via Principal Components in Association Analysis of Rare Variants[Genetic epidemiology. 2013]
*Zhang Y, Guan W, Pan W.**Genetic epidemiology. 2013 Jan; 37(1)99-109* - FFBSKAT: Fast Family-Based Sequence Kernel Association Test[PLoS ONE. ]
*Svishcheva GR, Belonogova NM, Axenovich TI.**PLoS ONE. 9(6)e99407* - Supervised categorical principal component analysis for genome-wide association analyses[BMC Genomics. ]
*Lu M, Lee HS, Hadley D, Huang JZ, Qian X.**BMC Genomics. 15(Suppl 1)S10* - JOINT ANALYSIS OF SNP AND GENE EXPRESSION DATA IN GENETIC ASSOCIATION STUDIES OF COMPLEX DISEASES[The annals of applied statistics. 2014]
*Huang YT, VanderWeele TJ, Lin X.**The annals of applied statistics. 2014 Mar 1; 8(1)352-376* - Rare Variants Detection with Kernel Machine Learning Based on Likelihood Ratio Test[PLoS ONE. ]
*Zeng P, Zhao Y, Zhang L, Huang S, Chen F.**PLoS ONE. 9(3)e93355*

- A Powerful and Flexible Multilocus Association Test for Quantitative TraitsA Powerful and Flexible Multilocus Association Test for Quantitative TraitsAmerican Journal of Human Genetics. Feb 8, 2008; 82(2)386PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...