- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# Haplotype-Based Association Analysis via Variance-Components Score Test

## Abstract

Haplotypes provide a more informative format of polymorphisms for genetic association analysis than do individual single-nucleotide polymorphisms. However, the practical efficacy of haplotype-based association analysis is challenged by a trade-off between the benefits of modeling abundant variation and the cost of the extra degrees of freedom. To reduce the degrees of freedom, several strategies have been considered in the literature. They include (1) clustering evolutionarily close haplotypes, (2) modeling the level of haplotype sharing, and (3) smoothing haplotype effects by introducing a correlation structure for haplotype effects and studying the variance components (VC) for association. Although the first two strategies enjoy a fair extent of power gain, empirical evidence showed that VC methods may exhibit only similar or less power than the standard haplotype regression method, even in cases of many haplotypes. In this study, we report possible reasons that cause the underpowered phenomenon and show how the power of the VC strategy can be improved. We construct a score test based on the restricted maximum likelihood or the marginal likelihood function of the VC and identify its nontypical limiting distribution. Through simulation, we demonstrate the validity of the test and investigate the power performance of the VC approach and that of the standard haplotype regression approach. With suitable choices for the correlation structure, the proposed method can be directly applied to unphased genotypic data. Our method is applicable to a wide-ranging class of models and is computationally efficient and easy to implement. The broad coverage and the fast and easy implementation of this method make the VC strategy an effective tool for haplotype analysis, even in modern genomewide association studies.

Haplotypes of multiple SNPs are considered a more informative format of polymorphisms for genetic association analysis than single SNPs.^{1} Haplotypes are more informative because they preserve the joint linkage disequilibrium (LD) structure among multiple adjacent markers.^{2} Even when only tag SNPs are used, haplotypes serve as a proxy for unobserved SNPs and increase the predictive power for the genomic variation.^{3}^{,}^{4} However, in terms of practical efficacy, the power of haplotype-based association analysis is challenged by a trade-off between the benefits of modeling abundant variation and the cost of the extra degrees of freedom for modeling the multimarker variations. To avoid the curse of dimensionality encountered in haplotype association analysis, various strategies have been proposed in the literature. They include (1) clustering evolutionarily close haplotypes,^{5}^{–}^{8} (2) modeling the level of haplotype sharing instead of the haplotypes themselves,^{9}^{–}^{11} and (3) smoothing haplotype effects by introducing a correlation structure for the effects of similar haplotypes.^{12}^{–}^{14} Although these strategies appear to be different, the fundamental principle is to use the evolutionary history of haplotypes to reduce the parameter space from individual haplotypes to haplotypes with similar ancestry. However, although the approaches of haplotype clustering and haplotype sharing enjoy a fair amount of power gain, empirical studies found that the smoothing approach may exhibit only similar or less power than the standard methods that regress trait values on haplotypes and impose no assumptions on haplotypes, even when there are many haplotypes.^{14}

In haplotype smoothing, a dependence structure is introduced to the effects of different haplotypes, according to the similarity between haplotypes, under a Bayesian hierarchical model or a mixed-model framework, and the overall gene-trait association can be studied via the variance components (VC).^{12}^{–}^{14} The idea of correlating haplotype effect is based on the assumption that the present mutation-bearing haplotypes have descended from a small number of ancestral haplotypes, and, as a result, the disease haplotypes tend to be correlated because of this shared ancestry. Without losing generality, in this work, we refer to these methods as “VC” approaches and discuss them under a mixed-model framework. We also refer to the standard haplotype regression method as a “fixed-effect” approach. Schaid^{14} first noted the underpowered phenomenon of the VC method, using the likelihood-ratio test (LRT), and explored potential reasons based on the noncentrality (NC) parameter of the distribution of the LRT statistics. The NC parameter reflects the distance between the alternative distribution and the null distribution of the test statistics, and the larger the null-to-alternative distance is, the higher the power a test possesses. By expressing the NC parameter as a function of heritability (*h*^{2}), it can be seen that, although the NC parameter of a fixed-effect model is proportional to , the NC parameter of a VC model is much smaller (proportional to *h*^{4}). As a result, the power gain brought by the low degrees of freedom can be compromised with the small NC parameter in a VC-LRT approach.

Here, we report other key factors that contribute to this underpowered phenomenon. In brief, unlike the usual VC model in which the VC represents the potential variability from a source that is independently distributed in the population (e.g., the family effect in the study of linkage or familial aggregation), in the population-based haplotype analysis, the source of variability is not independent. That is, the design matrix of the random haplotype effect does not have a diagonal or block-diagonal structure. Furthermore, the dimension of the random haplotype effect is fixed. Therefore, the data under the alternative hypothesis cannot be represented as a collection of independent data vectors. As a result, the distribution of the LRT statistic does not converge to the conventional 50:50 mixture of χ^{2}_{0} and χ^{2}_{1} (i.e., the limiting distribution predicted by the usual asymptotic theory^{15}). Instead, empirical evidence indicates that the distribution of VC-LRT statistics has higher weighting of χ^{2}_{0}. Hence, the threshold value obtained from the 50:50 χ^{2} mixture is overstringent and causes a too-conservative testing result. Such overconservative findings of the LRT was obtained also by Crainiceanu and Ruppert^{16} in certain linear mixed models.

To overcome the problem of a lack of independence and also to generalize the VC approach to all types of trait values, we propose a score test under the generalized linear mixed-model (GLMM) framework. Specifically, we construct a score statistic based on the restricted maximum likelihood (REML) or the marginal likelihood function of the VC and identify its nontypical asymptotic distribution. The proposed test is easy to implement and computationally efficient yet is general enough to accommodate a broad class of phenotypes and correlation structures. It allows for covariate information and can be used for phase-unknown genotypic data. Through simulation, we demonstrate the validity of the test and investigate the power performance of the VC approach and the fixed-effect approach under general scenarios. We also apply the proposed method to a case-control data set from a genomewide association study of amyotrophic lateral sclerosis (ALS) conducted by Schymick et al.^{17} In the analysis, we test for gene-trait association on chromosome 10 with the 275 ALS cases and 271 controls and examine statistical significance at the genomewide level. We verify the findings from the proposed method by comparing them with the results reported by Schymick et al.^{17}

## Material and Methods

### VC Method for Association Analysis

We denote the data with the following notations. For individual *i* (*i*=1,2,…,*n*), we have trait value *Y*_{i}, environmental covariates *X*_{i} (a *K*×1 vector including the intercept term), and haplotype *H*_{i} (an *L*×1 vector, where *L* is the number of distinct haplotypes observed in the population). Vector *H*_{i} records individual *i*’s haplotype pair via a certain scoring rule, such as by setting its *h*th element as the number of haplotype *h* that individual *i* carries. Throughout this article, we treat explanatory variables (e.g., *X*_{i} and *H*_{i}) as constants and will omit them in the lists of the conditional variables. This means that, for example, we will use *Var*(*Y*_{i}) instead of *Var*(*Y*_{i}*X*_{i},*H*_{i}).

Assume that the trait value *Y*_{i} follows some distribution with conditional mean *E*(*Y*_{i}β)=μ_{i} and conditional variance , where *m*_{i} is a known prior weight (e.g., binomial denominator), is the dispersion parameter (e.g., measurement-error variance for a normal quantitative trait), and is the variance function. Then, the VC model can be expressed under the framework of GLMM as

where *g*(·) is a link function that connects the conditional mean μ_{i} and the explanatory variables, γ_{K×1} represents the fixed effect of environmental covariates, and β_{L×1} is the random effect of haplotypes. The haplotype effect is assumed to have a multivariate normal (MN) prior. With model (1), the marginal phenotypic variance, *Var*(*Y*_{i}), can be partitioned into genetic components and environment components, and the association between haplotypes and traits can be detected by testing for zero genetic VC (i.e., τ=0). Intuitively, τ=0 implies that all β_{h} share the same value, and this is essentially the null hypothesis of the standard fixed-effect approaches.

The correlation structure of β_{h} is specified through the *L*×*L* matrix *R*_{β}. Here, we consider a general formulation for *R*_{β} by letting its (*h*,*k*) element, denoted by *r*_{hk}, depend on the similarity level between haplotypes *h* and *k*, which is quantified by a certain similarity metric, *s*(*h*,*k*). One simple choice of the correlation structure is to let *R*_{β}=*I*, where *I* is the identity matrix. This independence structure imposes no correlation among distinct haplotypes and reflects the “unstructured” variation among haplotypes. The independence prior may be reasonable if haplotype variants were created mainly by recombinations instead of mutations. In contrast, one can introduce local-dependence structures to account for the role of mutation and to reflect the conjecture that evolutionarily close haplotypes tend to have similar effects on traits. One convenient choice of such *R*_{β} is the conditional autoregressive (CAR) structure. The CAR structure assumes that all β_{h} are correlated but that the correlation diminishes as the haplotype similarity decays. With our representation, a CAR structure is to let *R*_{β}=*C*, where *C*^{-1} has diagonal elements equal to 1 and off-diagonal elements equal to -*s*(*h*,*k*).^{18} Alternatively, to avoid choosing between an independence prior and a sole CAR prior, an intermediate option, in practice, is the convolution model that combines the two: τ*R*_{β}=τ_{1}*I*+τ_{2}*C*.^{12}^{,}^{13} In this work, we focus on the model that was considered by Schaid^{14} and set , with 0*s*(*h*,*k*)1. This model uses the haplotype similarity to reflect the correlation directly. It is more extreme but uses a simpler concept than the convolution model, by compromising between the dependence and the independence priors. It allows for correlation induced from partially similar haplotypes but assumes independence among haplotypes that share zero similarity.

### VC-Score Test for Haplotype-Phenotype Association

To motivate our VC-score test for haplotype-phenotype association, we illustrate the method, assuming a normally distributed trait (perhaps after some transformation, such as the logarithm transformation) with a known dispersion parameter, . We then present the VC-score test for general scenarios of unknown and trait values with an arbitrary distribution. We provide the derivation of the generalization in appendixes A and B.

#### Quantitative traits with known dispersion parameter

For quantitative traits that follow a normal distribution directly or after appropriate transformations, model (1) reduces to a linear mixed-model in matrix notation:

where *X* is the design matrix for γ, whose *i*th row is *X*^{T}_{i}; *H* is the design matrix for β, whose *i*th row is *H*^{T}_{i}; β~*MN*(0,τ*R*_{β}) is the same as described in model (1); and ε~*N*(0,*I*) represents the uncertainty in measuring traits *Y.* Since our primary interest is to test *H*_{0}:τ=0, we consider the REML log-likelihood function of VC (τ,). It is well known that the REML estimating equation for (τ,) is unbiased and will produce less biased estimates compared with the maximum-likelihood approach.^{19}

Denote by _{REML}(τ,;*Y*) the REML log-likelihood function of τ and , which is given by

where *V*=τ*HR*_{β}*H*^{T}+*I*τ*S*+*I* is the marginal variance of **Y** and where *P*=*V*^{-1}-*V*^{-1}*X*(*X*^{T}*V*^{-1}*X*)^{-1}*X*^{T}*V*^{-1} is the projection matrix for the linear mixed model (2). The REML log-likelihood function (3) can also be viewed as the marginal log-likelihood of (τ,) from the Bayesian perspective obtained by specifying a flat prior for γ and integrating out γ from *f*(*Y*;γ,τ,).

Simple algebra^{20} shows that the score statistic of τ evaluated under *H*_{0} on the basis of the REML function (3) is equal to

where *P*_{0}=^{-1}{*I*-*X*(*X*^{T}*X*)^{-1}*X*^{T}}=^{-1}*Q* is the projection matrix *P* evaluated under *H*_{0}:τ=0 and where *Q*=*I*-*X*(*X*^{T}*X*)^{-1}*X*^{T}. It is immediately seen from equation (4) that *E*(*U*_{τ})=0 under *H*_{0}:τ=0, and, when τ>0, *E*(*U*_{τ})=τ·*tr*(*QSQS*)/(2^{2}), which is a strictly increasing function of τ unless *QS*=0. Therefore, larger values of *U*_{τ} provide stronger evidence against *H*_{0}. This suggests that the testing procedure for *H*_{0}:τ=0 using *U*_{τ} should be one sided.

In a situation where the VC τ represents the potential variability due to a source that is independently distributed in the population such as the subject-specific effects in a longitudinal study, the score statistic *U*_{τ} given in equation (4) under *H*_{0}:τ=0 has an asymptotic normal distribution with zero mean and some variance when the number of independent clusters goes to infinity.^{21} However, this condition does not satisfy in our case. In model (1), the design matrix *H* for the random effects β is not block diagonal and the dimension of β is fixed. Hence, the Lin’s^{21} asymptotic result does not directly apply to *U*_{τ}.

Since is known, the second term in *U*_{τ} is a constant. Therefore using the score statistic *U*_{τ} is equivalent to using the first term of *U*_{τ} (denoted by *T*_{τ}):

We show in appendix A that *T*_{τ} has the same distribution as the weighted χ^{2} random variables , where χ^{2}_{1,i}’s are independent χ^{2} random variables with 1 df, and λ_{i} is the ordered non-zero eigenvalues of the semipositive definite matrix with λ_{1}λ_{2}λ_{c}>0 (*c**L*). If the (1-α)th quantile of this weighted χ^{2} distribution is denoted by *T*_{(α)}, then a level α score test will reject *H*_{0} if *T*_{τ}*T*_{(α)}.

#### General traits with unknown dispersion parameter

Here, we present the VC-score test for the general case in which the traits may not be normally distributed and the dispersion parameter may or may not be known. As indicated by the derivation given in appendix B, our test statistic can be defined as

where μ=*g*^{-1}(*X*γ), Δ=*diag*{*g*^{′}(μ_{i})}, is the maximum-likelihood estimate of γ under *H*_{0}, and is the REML type of estimate (such as the one that uses Pearson residuals) of under *H*_{0}. Matrix *W*=*diag*{*w*_{i}}, with . These quantities are readily available by fitting a standard generalized linear model, *g*(μ)=*X*γ. We derive in appendix B that *T*_{τ} also follows approximately the weighted χ^{2} distribution , where λ_{1}λ_{2}λ_{c}>0 (*c**L*) is the nonzero eigenvalues of matrix *W*^{-1/2}*P*_{0}*SP*_{0}*W*^{-1/2}/2. We note that the conclusions given in the previous section are a special case of the results given here. For normally distributed traits, Δ=*I*, and *W*=*V*^{-1}, which equals ^{-1}*I* under *H*_{0}. Hence, equation (6) reduces to equation (5), and the matrix *W*^{-1/2}*P*_{0}*SP*_{0}*W*^{-1/2}/2 reduces to

#### Gamma approximation of the distribution of test statistic *T*_{τ}

Given the fact that *T*_{τ} follows a weighted χ^{2} distribution, one can obtain the significance threshold at level α from simulation. However, such a task may not be trivial when α is small. As an alternative, we introduce a Gamma approximation of the distribution of *T*_{τ}. Empirical evidence indicates that the eigenvalues λ_{1},λ_{2},,λ_{c} of the matrix *W*^{-1/2}*P*_{0}*SP*_{0}*W*^{-1/2}/2 are dominated by the first few ones and decay rapidly to 0 (fig. 1). Following the work of Zhang and Lin,^{22} we use the Satterthwaite method to approximate the null distribution of *T*_{τ} by a Gamma distribution with parameters (*a*,*b*). Let and denote the mean and variance of *T*_{τ}, respectively. We match the mean and the variance of the Gamma distribution and those of the test statistic by setting *ab*= and *ab*^{2}=, and we get *a*=^{2}/ and *b*=/. We can then obtain or calculate the *P* value of the test statistic from the distribution of Gamma (*a*,*b*). The mean, , and variance, , of *T*_{τ} can be calculated (appendixes A and B) by

and

where

and

#### Phased haplotype data versus unphased genotype data

Although we have described our test, assuming that the haplotype information *H* is observed, the phase information can be not crucial. From equations (5) and (6), we see that the haplotype information appears in *T*_{τ} only through *S*=*HR*_{β}*H*^{T}, whose (*i*,*j*) element, denoted by *S*_{ij}, can be rewritten as

The right-hand side of the equation states that *S*_{ij} is simply the similarity score between the haplotype pair of person *i* and that of person *j* measured by metric . As a result, by choosing those metrics that do not require phase information, we can calculate *S* without resorting to the phased data. One choice is to set *s*(*h*,*k*) as the proportion of matching alleles between two haplotypes, *h* and *k.* As demonstrated by Tzeng et al.^{11} and Schaid,^{14} such *S*_{ij} is equivalent to the proportion of matching alleles between the genotypes of individual *i* and individual *j* and hence can be calculated directly from genotypes with unknown phase.

### Simulation Studies

We conduct simulation studies to examine the performance of the proposed score test. In the simulation, we generated covariates *X*_{i}, haplotypes *H*_{i}, and trait values *Y*_{i}, given *X*_{i} and *H*_{i}, for each individual. The covariate *X*_{i} is drawn from a standard normal distribution, and the haplotype *H*_{i} is generated using a technique similar to those reported by Roeder et al.^{23} and Tzeng et al.^{8} Specifically, we simulated 100 haplotypes under the coalescent model,^{24} with an effective population size of 10^{4}, a scaled mutation rate of 5.6×10^{-4} per bp, and a scaled recombination rate of ~6×10^{-3} per bp for the cold spots and a rate 45 times greater for the hotspots. These parameters are chosen to roughly match the genes observed in the SeattleSNP database. We discarded SNPs with minor-allele frequencies <0.05. The hypothetical disease locus is selected on the basis of a predetermined minor-allele frequency, *q,* and the diversity of haplotypes flanking the SNP. In the simulation, we considered *q*=0.1, 0.3, and 0.5 and haplotype-diversity levels of high (11–16 distinct haplotypes), moderate (9–11 distinct haplotypes), and low (6–9 distinct haplotypes). We set a haplotype region to be a segment of five adjacent SNPs, including the two SNPs on the left and the three SNPs on the right of the disease locus. Given that the disease SNP is excluded, we also considered whether the disease SNP is “tagged” or “not tagged” by the surrounding five SNPs under each scenario. We defined that the disease SNP is “tagged” if there is at least one SNP whose *R*^{2} with the disease SNP is >0.7, and it is “not tagged” otherwise. We then randomly sampled with replacement of 2 haplotypes from the 100 haplotypes to form an individual. The simulated haplotype data were then converted into unphased genotype data.

We next generated the trait values *Y*_{i} on the basis of *X*_{i} and the genotypes at the disease locus. We determined the trait value of individual *i* according to *X*_{i} and the number of disease alleles (*G*_{i}), using an additive-effect model. In the simulation study, we considered both quantitative traits and binary traits and adopted the same trait-generating scheme as did Lake et al.^{25} and Tzeng et al.^{8} For quantitative traits, we used a random-sampling scheme and generated 200 trait values from the normal conditional distribution of *Y*_{i} with mean γ_{0}+γ_{1}×*X*_{i}+(*G*_{i}-1) and variance . We set the heritability (*h*^{2}) at 0.1 and γ_{0}=γ_{1}=1. For binary traits, we used a case-control sampling scheme and generated trait values of 0 or 1, using the penetrance function *logitP*(*Y*=1*G*_{i},*X*_{i})=γ_{0}+γ_{1}×*X*_{i}+θ×*G*_{i}. We set the odds ratio (OR) (*e*^{θ}) at 2.0 and set the disease prevalence at 0.01 by letting γ_{0}=-4.5 and γ_{1}=0. We repeated the process until we collected 100 cases and 100 controls.

We analyzed these simulated data to evaluate the power performance of the VC-score method. To compare, we also conducted haplotype analyses, using the fixed-effect method and, in addition, the VC method via regular LRT (VC-LRT) under some scenarios. These analyses were performed assuming unknown phases. For fixed-effect analysis, we used the haplotype-score test of Schaid et al.,^{26} as implemented in the R function ”haplo.score,” and determined the *P* values by using the asymptotic χ^{2} distribution. The *P* values of our VC-score method were obtained from the approximated Gamma distribution, and the *P* values of the VC-LRT method were obtained from 50:50 mixtures of χ^{2}_{0} and χ^{2}_{1}.

### Data Application

We considered the data set from the genetic association study of ALS conducted by Schymick et al.^{17} The ALS data set consists of 276 patients with sporadic ALS and 271 neurologically normal control subjects^{27} and contains their genotypes at the 550K SNPs across the genome in the Illumina chip assays. The original genotyping was performed in the laboratory of Drs. Singleton and Hardy at National Institute of Aging. The genotype data have been made publicly available in the SNP Database at the National Institute of Neurological Disorders and Stroke (NINDS) Human Genetics DNA and Cell Line Repository. Schymick et al.^{17} performed a genomewide association analysis and reported 34 SNPs that have *P* values <.0001 on the basis of the single-SNP genotypic test with 2 df. They used the Bonferroni correction to adjust for multiple testing, and the threshold of significance at the nominal level of 0.05 is 9.1×10^{-8}. Although none of the 34 SNPs was significant after the Bonferroni correction, the most significant SNP (*rs4363506*) lay in close proximity to one of the actin cytoskeleton genes (the dedicator of cytokinesis 1 gene [*DOCK1* {MIM 601403}]) that are increasingly recognized as playing an important role in motorneuron disease. To assess the performance of our method, we applied the proposed VC-score method to part of this data set. We focused our analysis on chromosome 10, where the most significant SNP is located. We used the results reported by Schymick et al.^{17} as a benchmark to evaluate our findings.

## Results

### Simulation Studies

In the comparison of our VC-score method with the fixed-effect method, we reported the results for quantitative traits and binary traits under 18 scenarios (3 values of allele frequency × 3 levels of haplotype diversity × 2 different tagging statuses of the disease SNP). Type I error rates were calculated on the basis of 2,000 replications, and power was calculated on the basis of 1,000 replications.

We listed the results of type I error rates of the VC-score test in table 1 for quantitative traits and in table 2 for binary traits. The values are around the nominal levels of α=0.05 and α=0.01, indicating that the Gamma distribution approximates the null distribution of *T*_{τ} adequately. To ensure this conclusion, we also examined the null distribution of the test statistics; the results are displayed in figure 2. The left panels of figure 2 show the two quantile plots (hereafter, “QQ-plots”) that compared the quantiles of the standardized *T*_{τ} from the null distribution with the quantiles of the standard normal distribution. The upper panels are for quantitative traits, and the lower panels are for binary traits. In both cases, it is apparent that standardized *T*_{τ} does not have a standard normal distribution, following Lin’s^{21} asymptotic result. We then draw the QQ-plots of *T*_{τ} against the Gamma distribution (fig. 2, right panels). In the simulation, each replication *i* generated a *T*_{τ,i} that follows approximately a Gamma distribution with a unique shape parameter, *a*_{i}, and a unique scale parameter, *b*_{i}. To create a QQ-plot against these nonidentical Gamma variables, we first created a scaled *T*_{τ,i}=*T*_{τ,i}/*b*_{i} that follows Gamma (*a*_{i},1). Then, we used a single shape parameter of to obtain the theoretical quantiles. Although use of a single shape parameter can cause some deviation in the QQ-plot (such as what can be observed in the right section of the graph), overall we see that the data points agree with the 45° line, indicating that the Gamma approximation works reasonably well.

*T*

_{τ}. We defined the standardized

*T*

_{τ}as and the scaled

*T*

_{τ,i}as

*T*

_{τ,i}/

*b*

_{i}, where

*b*

_{i}is the scale parameter of the Gamma approximation. The left panels show the standardized

*T*

_{τ}

**...**

The results of power comparison are displayed in tables tables33 and and44 for quantitative traits and in tables tables55 and and66 for binary traits. We highlighted those cases in which the power gain is significant at a 0.05 level by use of McNemar’s test. We found that the correlation between the disease SNP and its nearby SNPs plays a key role in predicting the performance of the VC-score test compared with the fixed-effect test. When the disease SNP was tagged by at least one of surrounding SNPs, we observed a systemic power improvement of the VC-score method over the fixed-effect method. This is consistent for both trait types across all scenarios (tables (tables33 and and5).5). If none of the adjacent SNPs was highly correlated (i.e., *R*^{2}>0.7) with the unobserved disease SNP, we saw a power drop compared with that seen for the tagged SNPs. In these cases, the fixed-effect method tends to retain a higher power than that of the VC-score method, although the pattern is not universal (tables (tables44 and and66).

We also examined the performance of the VC-score test by varying the strength of genetic effects. We set the heritability *h*^{2} at 0.00, 0.05, and 0.10 for quantitative traits and set the OR at 1.0, 2.0, and 2.5 for binary traits. This simulation considered the scenario of high haplotype diversity, a tagged disease SNP, and an allele frequency of 0.1 and used a sample size of 200 individuals generated from random sampling for both trait types. The power is calculated on the basis of 500 replications, and the type I error rate (i.e., for *h*^{2}=0 and *OR*=1) is calculated on the basis of 1,000 replications. As a quick verification, we see from table 7 that the power of the VC-score method increases as the genetic effect becomes stronger.

As a comparison, we also conducted the fixed-effect analysis and the VC analysis with the regular LRT. We were unable to obtain the result of the VC-LRT for binary traits because of computational limitations, since one has to calculate a *c*-dimensional (*c* = the number of the nonzero eigenvalues) numerical integration to obtain the LRT statistics. Table 7 shows that, as expected, the VC-LRT produces the lowest power among the three methods. The analysis of type I error rate helps to explain the low power of the VC-LRT; the size determined from the regular 50:50 mixture of χ^{2}_{0} and χ^{2}_{1} is extremely small, and the overconservative threshold obtained from the χ^{2} mixture leads to a loss of power. When compared with the fixed-effect method, we noticed that the power loss of the VC-LRT method in our simulation is more substantial than seen in the results of Schaid.^{14} We think this is probably because Schaid simulated data from a VC model, which would favor the performance of a VC approach.

### Analysis of the ALS Data Set

Using unphased genotype data from the ALS study, we replicated the single-SNP genotypic test of Schymick et al.^{17} and performed two haplotype association analyses, one with the fixed-effect method and the other with the proposed VC-score method. There are a total of 28,818 SNPs genotyped on chromosome 10, and we removed the 26,258th SNP because of its ambiguous marker information. Following the same haplotype definition as that in the work of Schymick et al.,^{17} we also defined haplotypes by using a sliding window of three SNPs in the haplotype analyses.

The results from the VC-score method showed that the most significant association signal, with a *P* value of 1.2×10^{-7}, is near *rs4363506.* The *P* values are presented in figure 3, with the location of *rs4363506* indicated by the arrows. The two adjacent windows around the most significant signal also have similar *P* values: 1.3×10^{-7} and 2.7×10^{-7}. These locations agree with the findings of Schymick et al.,^{17} who reported that *rs4363506* has a *P* value of 6.8×10^{-7} for the genotypic test and a *P* value of 4.8×10^{-6} for the three-marker haplotype test. Although our *P* values appeared smaller, they were not significantly different from the results of the single-SNP analyses and were not significant after Bonferroni correction with the threshold of 9.1×10^{-8}. We also compared the VC-score results with those of the fixed-effect method (fig. 3*B*). The fixed-effect method also indicated a peak signal around *rs4363506,* with the peak *P* value (4.9×10^{-6}) slightly larger than those of the VC-score test and the single-SNP test. In general, we observed that analyses at the haplotype level reduced the noisy association signals of single SNPs. Although the clearer association pattern of haplotypic analyses came with the cost of extra degrees of freedom used in multimarker variations, we see that the VC-score haplotype test achieved a level of significance that is comparable to that of single-SNP analyses.

*P*values from the ALS data analysis around the most promising SNP reported by Schymick et al.

^{17}(i.e., SNP

*rs4363506,*with location indicated by the arrows). The

*P*values are presented on the scale of negative logarithm of base 10.

*A, P*values of the

**...**

To ensure that the results are not sensitive to the definition of the haplotypes, we repeated our analysis with window sizes of three, four, and five SNPs. The *P* values for a window size of four SNPs for fixed-effect and VC-score methods are shown in figure 3*C**,* in which the *P* value curves retain a similar pattern to the *P* value curves in the three-SNP analysis (fig. 3*B*). Indeed, the *P* value curves of various window sizes for the same method are similar (data not shown), except that larger window sizes led to a more smoothing effect on the *P* values across SNPs. With window sizes of four and five SNPs, the *P* values at the windows containing *rs4363506* are 8.3×10^{-8} and 7.7×10^{-8}, respectively; both are significant at the threshold determined by the Bonferroni correction. Judging by the criteria of locating the signal around *rs4363506* and producing reasonable *P* values, we think that the VC-score method performed competitively with the standard analyses of these data.

## Discussion

The VC approaches have been considered as one common strategy to reduce the degrees of freedom required in haplotype association analyses. However, it is noticed that direct application of the VC-LRT to haplotype association often fails to increase power. In this article, we reported possible reasons that contribute to this phenomenon, which include that (a) the LRT statistic has a nontypical limiting distribution in a haplotype random-effect model and (b) none of the surrounding SNPs is highly correlated with the unobserved disease SNP. Although the latter reason naturally limits the performance of the type of VC approaches discussed in this article, the former can be overcome. In essence, our work tackles this limiting-distribution problem; we introduced a VC-score test based on the REML or the marginal likelihood function of the VC under the GLMM. We showed that the test statistic follows a weighted χ^{2} distribution and provided a Gamma approximation. We demonstrated the validity of the proposed method through simulation. Constructed under the GLMM framework, our VC method can be applied to a broad class of data, allowing for traits of various types, different choices of correlation structure, and a flexible range of model assumptions. Finally, by choosing suitable similarity metrics, the proposed method can be directly applied to unphased genotypic data.

We note that the LRT statistic in the VC haplotype model does not converge to the distribution derived from the typical asymptotic theory that assumes independent clusters. As a result, use of the conventional limiting distribution could lead to an overconservative testing result. Crainiceanu and Ruppert^{16} reported similar findings and provided a practical procedure to find the distribution of the LRT statistic for continuous *Y* variables. However, we still recommend the score test over the LRT, for several reasons. First, the correct LRT procedure of Crainiceanu and Ruppert^{16} is applicable only to continuous traits. Second, the LRT is generally a more difficult test to implement under the GLMM framework. For example, with binary traits, it is almost impractical to obtain the exact maximum-likelihood estimates of the VC and, hence, the LRT statistic. On the contrary, our score test is applicable to a wide-ranging class of models, and it requires only the estimates under *H*_{0} that can be easily obtained from standard statistical software. The broad coverage and the fast and easy implementation of the score test makes the VC strategy an effective tool for haplotype analysis, even in modern genomewide association studies. We have implemented the VC-score test in R and have distributed the R code at our Authors’ Web site.

An additional factor that would influence the power performance is the LD pattern between the observed SNPs and the unobserved disease SNP. Our power analysis showed that, if at least one marker is in high LD with the disease SNP (i.e., *R*^{2}>0.7), the VC-score method performs better than the fixed-effect method. If all markers are in low LD with the disease SNP, both methods suffer from power loss. This is understandable because low correlation indicates lack of information from neighboring SNPs to make the correct inference. However, we noticed that the VC-score test appears to have a larger drop in power. The VC method is developed from an evolutionary point of view, which implicitly assumes high correlation among adjacent SNPs with the disease locus and uses the correlation to reduce the degrees of freedom. On the other hand, the fixed-effect approach does not rely on this assumption. We conjecture that this is the explanation for the observation of different degrees of power loss under the “untagged” scenarios, as well as for the observation of the power gain of the VC method under the “tagged” scenarios. We plan to continue exploring this aspect of the problem.

In this work, we suggested the use of similarity metrics that do not require phase information. For example, a candidate is the metric that counts the number of matching alleles between two haplotypes. Theoretically, metrics that make use of the phase information should be more powerful. A corresponding metric to our candidate that incorporates phase information is one that counts the number of longest consecutive matching alleles. Metrics of this type are more likely to capture the identity-by-descent sharing and to better reflect the results of the decay of haplotype sharing. However, these metrics are not robust to genotyping errors and recent market mutations, which often limit their power in practice. Our previous work found that phase-dependent and phase-independent metrics have similar performance.^{11} In addition, the use of phase-independent metrics bypasses the need to infer the phase information, which is often achieved under unrealistic assumptions. For example, the expectation-maximization algorithm is typically used to impute haplotype phases, and it assumes that the population has common haplotype frequencies and is in Hardy-Weinberg equilibrium (HWE). These assumptions tend to not hold with the existence of population substructure; the haplotype frequencies could vary across subpopulations, and the HWE may hold only within the subpopulations but not for the entire population. Consequently, the phases are not inferred accurately—hence, the subsequent association inference. Use of phase-independent measures does not rely on imputing the haplotype information and, therefore, avoids these issues naturally.

Although we have introduced a Gamma approximation of the distribution of the score test for the sake of simplicity, we note that there exists alternative approaches to estimate the distribution. One approach is to use a three-moment approximation,^{28}^{,}^{29} as opposed to the two-moment matching method described in this article. The three-moment method uses the information of the nonzero eigenvalues λ_{1},…,λ_{c} of the matrix *W*^{-1/2}*P*_{0}*SP*_{0}*W*^{-1/2}/2 and approximates the α-level significance threshold as , where , *h*^{′}=*k*^{3}_{2}/*k*^{2}_{3}, and χ_{α} is the αth quantile of χ^{2}_{h′} (i.e., a χ^{2} distribution with *h*^{′} df). The *P* value of the test statistic *T*_{τ} is then the right-tail portion of of the χ^{2}_{h′} distribution. Another approach is to obtain directly the empirical distribution of *T*_{τ} through simulation based on the fact that . To do so, find eigenvalues λ_{1},…,λ_{c}, and generate *c* sets of random values from the χ^{2}_{1} distribution, each set with a certain sample size—say, 500,000. Then, the weighted sums of these random values form the empirical distribution of the VC-score statistics. The simplicity of the simulation carries over to genomewide association studies, in which case, the sample size of the simulated χ^{2}_{1} values has to be reasonably large with respect to the stringent genomewide significance level. Fortunately, these simulated χ^{2}_{1} random values can be used repeatedly in every haplotype region, and all that needs to be recalculated is the nonzero eigenvalues of matrix *W*^{-1/2}*P*_{0}*SP*_{0}*W*^{-1/2}/2 obtained in each region.

The VC-score method introduced here focuses on detecting the haplotype main effect, but the framework can be extended to consider interactions. For example, one can incorporate terms for gene-environment and gene-gene interactions in the model and can examine their significance by testing the corresponding VC. We are exploring these ideas in an ongoing work. We also note that the VC-score method has a direct connection to the strategy of reducing the degrees of freedom in haplotype association tests through haplotype sharing. The haplotype information appears in the formula of the score statistic *T*_{τ} as *S*=*HR*_{β}*H*^{T}, and the elements of *S* record the level of haplotype similarity between two subjects by a certain haplotype metric, *s*(*h*,*k*). On the one hand, in the VC methods, the selection of *R*_{β} (and hence the similarity metric) largely emphasizes the evolutionary relationship of haplotypes in history. On the other hand, in the haplotype-sharing approaches, we choose the similarity metric to quantify the similarity level in the current population. We believe that taking advantage of the implicit relationship between the two methods can offer more insights into both strategies. We plan to further investigate this issue.

## Acknowledgments

We are grateful to the NINDS Human Genetics DNA and Cell Line Repository at Coriell and the laboratory, for supplying the data of the ALS study used in this work. We also thank the reviewers, for their constructive feedback, and Dr. Andrew Allen, Dr. Silviu-Alin Bacanu, and Dr. Patrick Sullivan, for their helpful discussion and suggestions. J.-Y.T. was supported by National Science Foundation grant DMS-0504726 and National Institutes of Health (NIH) grant R01 MH074027. D.Z. was supported by NIH grant R01 CA85848-04.

## Appendix A: Distribution and Moments of the VC-Score Statistic for Normal Traits

From equation (5), we have for normal traits. Define vector , which follows a standard multivariate normal distribution. We then can rewrite *T*_{τ} as

where *Z*^{2}_{i}~χ^{2}_{1} distribution. Equation (A1) holds as μ^{T}*Q*=0 by the fact of *Q,* a projection matrix.

When is unknown, we replace in our test statistic *T*_{τ} with its REML estimate under *H*_{0}, which yields the regular mean squared error for the linear regression model *Y*=*X*γ+ε:

Since, under *H*_{0}:τ=0, traits **Y** reduce to independent data, the variance of will be negligible when the total sample size, *n,* is large. Therefore, the exact distribution of *T*_{τ} with replaced by can be approximated by the distribution of , where are the eigenvalues of , if such ”exact” distribution is critically needed.

We can approximate the distribution of *T*_{τ} by a Gamma distribution by matching the first two moments. Following the work of Harville,^{20} the mean, , and variance, , of *T*_{τ} are

and

To take into account the fact that is estimated, we obtained the following result by Taylor expansion:

where *I*_{τ}=*E*(*T*_{τ}/), *I*_{}=*E*[^{2}_{REML}(τ,)/^{2}], and *U*_{}=_{REML}(τ,)/. Hence, we can estimate the mean of *T*_{τ} by

and can estimate the variance of *T*_{τ} by

where

and

## Appendix B: Derivation of the VC-Score Test for General Traits

Motivated by the nice properties of the REML or the marginal log-likelihood function of the VC for a normal trait, presented in equation (3), and the success of the score test for testing a polynomial covariate effect in semiparametric additive mixed models by use of mixed-model representation,^{22} we take a similar approach for a nonnormal trait, *Y,* such as a binary disease trait. Under model specification (1), the marginal log-likelihood function _{M}(τ,;*Y*) of τ and the possible dispersion parameter are given by

The marginal log-likelihood in (B1) usually involves a high-dimensional and mathematically intractable integration. To overcome this problem, Zhang and Lin^{22} used Laplace approximation to derive an approximate score statistic for testing *H*_{0}:τ=0 on the basis of a similar marginal log-likelihood function. For our model, their score statistic (eq. [16] in the work of Zhang and Lin^{22}) reduces to

where μ=*g*^{-1}(*X*γ) is the mean of **Y** under *H*_{0}:τ=0; Δ=*diag*{*g*^{′}(μ_{i})}; *W*=*diag*{*w*_{i}}, with *w*^{-1}_{i}=*m*^{-1}_{i}*v*(μ_{i}){*g*^{′}(μ_{i})}^{2}; *P*_{0}=*W*-*WX*(*X*^{T}*WX*)^{-1}*X*^{T}*W*; is the maximum-likelihood estimate of γ under *H*_{0}; and is the REML type of estimate of under *H*_{0}.

Under *H*_{0}:τ=0, and are consistent estimates of γ and , respectively. Hence, there is not much variability in the second term (relative to the first term) of *U*_{τ} when sample size *n* is large. Therefore, we can again use the first term for testing *H*_{0}:τ=0:

To derive the distribution of *T*_{τ}, first we show, through Taylor expansion, that, under *H*_{0}:τ=0,

and

Then, we have

As a result,

where the *i*th element of is defined by and where μ_{i} and *Var*(*Y*_{i}) are the true mean and variance, respectively, of *Y*_{i} under *H*_{0}. The result indicates that *T*_{τ} has approximately the same distribution as

Again, denote by λ_{1}λ_{2}λ_{c}>0 (*c**L*) the eigenvalues of *W*^{-1/2}*P*_{0}*SP*_{0}*W*^{-1/2}/2, and denote by *u*_{1},*u*_{2},…,*u*_{c} the corresponding orthonormal eigenvectors. Then, under *H*_{0}, *T*_{τ} will have approximately the same distribution as , where . Under the condition that each *u*_{j} is not dominated by a few elements, *Z*_{1},*Z*_{2},,*Z*_{c} will be approximately independent standard normal random variables. So, *T*_{τ} has approximately the same distribution as that of , which is similar to the case of normal traits.

We can approximate the distribution of *T*_{τ} by a Gamma distribution by matching the first two moments, as was done in appendix A. The approximate mean of *T*_{τ} is given by =*tr*(*P*_{0}*S*)/2, and the approximate variance by , where , , and .

## Web Resources

The URLs for data presented herein are as follows:

*DOCK1*)

## References

**American Society of Human Genetics**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (295K)

- Accounting for haplotype uncertainty in matched association studies: a comparison of simple and flexible techniques.[Genet Epidemiol. 2005]
*Kraft P, Cox DG, Paynter RA, Hunter D, De Vivo I.**Genet Epidemiol. 2005 Apr; 28(3):261-72.* - Tests of association between quantitative traits and haplotypes in a reduced-dimensional space.[Ann Hum Genet. 2005]
*Sha Q, Dong J, Jiang R, Zhang S.**Ann Hum Genet. 2005 Nov; 69(Pt 6):715-32.* - Association mapping via regularized regression analysis of single-nucleotide-polymorphism haplotypes in variable-sized sliding windows.[Am J Hum Genet. 2007]
*Li Y, Sung WK, Liu JJ.**Am J Hum Genet. 2007 Apr; 80(4):705-15. Epub 2007 Feb 19.* - [Construction of haplotype and haplotype block based on tag single nucleotide polymorphisms and their applications in association studies].[Zhonghua Yi Xue Yi Chuan Xue Za Zhi. 2007]
*Gu ML, Chu JY.**Zhonghua Yi Xue Yi Chuan Xue Za Zhi. 2007 Dec; 24(6):660-5.* - The role of haplotypes in candidate gene studies.[Genet Epidemiol. 2004]
*Clark AG.**Genet Epidemiol. 2004 Dec; 27(4):321-33.*

- Meta-Analysis of Sequencing Studies With Heterogeneous Genetic Associations[Genetic epidemiology. 2014]
*Tang ZZ, Lin DY.**Genetic epidemiology. 2014 Jul; 38(5)389-401* - Kernel score statistic for dependent data[BMC Proceedings. ]
*Malzahn D, Friedrichs S, Rosenberger A, Bickeböller H.**BMC Proceedings. 8(Suppl 1)S41* - Longitudinal Analysis Is More Powerful than Cross-Sectional Analysis in Detecting Genetic Association with Neuroimaging Phenotypes[PLoS ONE. ]
*Xu Z, Shen X, Pan W, for the Alzheimer's Disease Neuroimaging Initiative.**PLoS ONE. 9(8)e102312* - Test for interactions between a genetic marker set and environment in generalized linear models[Biostatistics (Oxford, England). 2013]
*Lin X, Lee S, Christiani DC, Lin X.**Biostatistics (Oxford, England). 2013 Sep; 14(4)667-681* - Kernel Machine SNP-set Testing under Multiple Candidate Kernels[Genetic epidemiology. 2013]
*Wu MC, Maity A, Lee S, Simmons EM, Harmon QE, Lin X, Engel SM, Molldrem JJ, Armistead PM.**Genetic epidemiology. 2013 Apr; 37(3)267-275*

- MedGenMedGenRelated information in MedGen
- PubMedPubMedPubMed citations for these articles
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree

- Haplotype-Based Association Analysis via Variance-Components Score TestHaplotype-Based Association Analysis via Variance-Components Score TestAmerican Journal of Human Genetics. Nov 2007; 81(5)927PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...