- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# Association Mapping via Regularized Regression Analysis of Single-Nucleotide–Polymorphism Haplotypes in Variable-Sized Sliding Windows

## Abstract

Large-scale haplotype association analysis, especially at the whole-genome level, is still a very challenging task without an optimal solution. In this study, we propose a new approach for haplotype association analysis that is based on a variable-sized sliding-window framework and employs regularized regression analysis to tackle the problem of multiple degrees of freedom in the haplotype test. Our method can handle a large number of haplotypes in association analyses more efficiently and effectively than do currently available approaches. We implement a procedure in which the maximum size of a sliding window is determined by local haplotype diversity and sample size, an attractive feature for large-scale haplotype analyses, such as a whole-genome scan, in which linkage disequilibrium patterns are expected to vary widely. We compare the performance of our method with that of three other methods—a test based on a single-nucleotide polymorphism, a cladistic analysis of haplotypes, and variable-length Markov chains—with use of both simulated and experimental data. By analyzing data sets simulated under different disease models, we demonstrate that our method consistently outperforms the other three methods, especially when the region under study has high haplotype diversity. Built on the regression analysis framework, our method can incorporate other risk-factor information into haplotype-based association analysis, which is becoming an increasingly necessary step for studying common disorders to which both genetic and environmental risk factors contribute.

Rapid improvements in high-throughput genotyping technologies have greatly reduced the cost of genomewide analyses and are resulting in a boom of large-scale genetic association studies of common disorders. Involving either a group of candidate genes or the whole genome, these studies employ single SNP-based linkage disequilibrium (LD) mapping to systematically evaluate the role of common genetic variants in the risk of developing various complex disorders. By approaching comprehensive coverage of common genetic variants, these studies have a statistical power for detecting genetic risk factors with moderate effects that is much improved over that of previous studies.^{1} Meanwhile, the comprehensive coverage of common genetic variants has also greatly increased the number of polymorphisms that need to be tested within a study and thus poses a great challenge for statistical analysis. LD-based association analysis can be performed by analyzing either individual SNPs or multiple-SNP haplotypes. It is still debatable which of the two methods is more powerful for detecting common risk factors, and it is likely that one method will perform better than the other under certain disease models and certain LD patterns.^{2}^{}^{}^{}^{}^{}^{–}^{8} In practice, both single-SNP and multiple-SNP haplotype analyses are performed in genetic association studies.

Strategies for performing haplotype analyses are still the subject of active debate and research. One of the important issues is how many adjacent SNPs should be included simultaneously in a particular haplotype analysis. Early suggestions were to perform the haplotype analysis within regions of high LD, often referred to as “LD blocks,” where most of the genetic variation can be captured by a limited number of haplotypes.^{9} To undertake such an analysis, LD blocks need to be defined before haplotype association tests are performed within each predefined LD block. Although this approach is simple and offers an appealing concept, the definition of haplotype blocks can be problematic. Several different criteria have been proposed,^{9}^{}^{}^{–}^{12} but it is still unclear which one is the most suitable. Frequently, the boundaries of LD blocks are not obvious. In addition, performance of haplotype analysis within predetermined LD blocks fails to consider possible correlations among LD blocks. Furthermore, it is almost inevitable that LD block–based haplotype analysis will result in “orphan” SNPs that fall outside any predetermined LD blocks and are therefore excluded from haplotype analysis. In such instances, the full information on genetic variability within a region will not be used in the haplotype analysis. Hence, the use of LD blocks as the fundamental units of association testing may not be the most efficient strategy for haplotype analyses.^{13}

Another strategy for performing haplotype analyses is based on the sliding-window framework, in which several neighboring SNPs, together called a “window,” are included in a haplotype analysis, and such a window-based analysis is performed in a stepwise fashion across the region under study. Initial approaches to sliding window–based haplotype analyses employed windows of uniform size.^{14}^{}^{–}^{16} However, the determination of the fixed window size in such methods can be cause for concern. In theory, the optimal window size should be the one that results in a haplotype or haplotypes that maintain the highest LD with the genetic risk variant or variants to be detected. The optimal window size, therefore, should be influenced by the underlying LD pattern. Use of a fixed window size becomes more problematic when haplotype analyses are performed over a large genomic region or over the whole genome, where LD patterns are surely variable across the region. Therefore, it is impossible to predefine a single optimal window size for a sliding-window analysis of large-scale data.

Alternatively, sliding window–based haplotype analyses can be performed without fixing the window size. In this implementation, a range of window sizes are considered in the haplotype analysis. By analyzing both simulated and experimental data, Lin et al.^{17} argued that an exhaustive search of all the possible windows of SNPs at the genome level is not only computationally practical but also statistically sufficient for detection of common or rare genetic-risk alleles. However, such an exhaustive search followed by a massive correction for multiple testing inevitably caused a loss of power. In addition, given a fixed number of samples, the number of haplotype tests that can be afforded should be limited; hence, it is more reasonable that the maximum window size be determined on the basis of the local LD pattern and the available sample size, rather than being as large as the size of a big region (e.g., a chromosome) in the genome scan. Recently, Browning used variable-length Markov chains for association mapping,^{18} attempting to adapt haplotype analyses to the local LD pattern. The adaptation is made by inferring the structure of the graph that represents the variable-length Markov chains, and each merging edge of the inferred graph represents a cluster of haplotypes that will be tested for association with the disease. The number of tests is decreased if the inferred graph is parsimonious. During its merging (clustering) process, Browning’s method considers all the haplotypes of all lengths for a fixed set of SNPs in a region and uses a modified merging algorithm of Ron et al.^{19} to ensure that low-frequency haplotypes are continually merged and that the inferred graph is parsimonious. As a result, when the region under study exhibits a complex LD pattern (e.g., because of a high recombination rate) and thus contains many unique haplotypes (each with low frequency), the merging process of Browning’s method will tend to group a large number of unique haplotypes into a small number of haplotype clusters. The inferred graph, therefore, will contain a limited number of merging edges for association tests. Consequently, each resulting merging edge is likely to have high haplotype diversity and to fail to capture the true association between particular haplotype(s) and underlying disease risk allele(s).

Besides the issue of the number of SNPs to be considered in haplotype analyses, another challenge is how to handle the large number of haplotypes in association tests. Haplotype analyses are generally performed in two distinct ways. One is to test each haplotype by performing a series of 1-df tests, followed by a correction for multiple testing, usually Bonferroni correction. The other way is to analyze the whole set of haplotypes by performing a single multiple-df global test. However, for both approaches, the power of detection is seriously weakened because of either the massive correction for a large number of tests or the many degrees of freedom. Several approaches have been proposed to tackle this problem. One commonly employed approach is to ignore rare haplotypes by grouping them into a single pseudohaplotype and hence to reduce the total number of haplotypes to be tested. For this strategy to be applied, a frequency threshold to define rare haplotypes needs to be specified in advance, which is sometimes tricky to do in reality. Moreover, when all the rare haplotypes are lumped together, the risk association with any rare haplotype(s) is likely to be missed, and even if it is not missed, it is impossible to interpret the positive association of this heterogeneous group of rare haplotypes with the disease. Seltman et al.^{20} suggested an alternative approach that involved performing a series of 1-df tests guided by the cladogram, followed by Bonferroni correction. However, the sequence of the tests suggested cannot be optimal in all cases. For example, in their 14 simulated models, multiple-df global tests took the lead in 7 models, whereas the sequential tests took the lead in 6 models. So, the improvement of the detection power offered by this approach is marginal. A third approach is to cluster haplotypes by their similarity. For example, Durrant et al.^{21} designed an allele frequency–based haplotype-similarity measure, used standard hierarchical clustering to group haplotypes, and accepted the haplotype partition with the smallest association *P* value. The haplotype clustering in such an approach is independent of disease status, which creates an opportunity to increase the detection power if disease status is used to guide the haplotype grouping. Another appealing approach, called the “penalized log-likelihood method,” is to force similar haplotypes to have similar estimated effects by imposing a penalty on similar haplotypes with different estimated coefficients.^{22} The objective function to minimize for estimating the coefficients is the sum of the squared error plus a weighted penalty. To apply this method, a suitable haplotype similarity measure has to be selected from multiple existing ones. Moreover, the slow estimation of coefficients makes it time consuming to apply the cross-validation method for determining the weight that makes a trade-off between the sum of squared error and the penalty.

In this article, we propose a new method for performing variable-sized sliding window–based haplotype analysis. First, at each testing position (i.e., the beginning position of a sliding window), we determined the maximum window size for the haplotype analysis on the basis of local haplotype diversity as well as sample size. Subsequently, a joint analysis of all the haplotypes of different lengths (up to the maximum window size) at the same beginning position was performed using a regularized regression method. Guided by the disease status, the regularized regression shrinks the effects of noninformative haplotypes to zero, and hence the effective degrees of freedom of the regularized regression model is greatly reduced. Since the joint analysis in our method not only takes account of the dependency among haplotypes but also makes effective use of their complementariness, it is more efficient in managing a large number of haplotypes and thus is more powerful in association detection than are approaches employing either a large number of single-haplotype–based tests or a conventional global test of all the haplotypes. We evaluated the performance of our method, in terms of the power to detect the presence of a genetic risk allele, by comparing it with the performance of a single SNP–based test, cladistic analysis of haplotypes^{21} in a fixed-sized sliding-window framework, and an association-mapping method based on variable-length Markov chains, which is in the haplotype-clustering framework and makes use of local LD pattern.^{18} We have demonstrated that our current method provides better performance than these three methods.

## Methods

For simplicity of exposition, we assume that the genetic association study is performed for a case-control analysis of phase-known haplotype data (see the “Discussion” section for how to generalize to phase-unknown genotype data). Consider *M* unrelated case and control chromosomes, typed for *L* SNPs in a region. Denote *X*_{ij}{1,2}, *i*=1,…,*M* and *j*=1,…,*L**,* as the allele configuration at SNP *j* in chromosome *i,* and denote *y*_{i}{0,1}, *i*=1,…,*M**,* as the disease status of chromosome *i.*

In the sliding-window framework, a window is a set of neighboring SNPs. A window ^{s}_{l} denotes the set of SNPs {*s*,*s*+1,…,*s*+*l*-1}. The haplotype in chromosome *i,* composed of SNPs in a window ^{s}_{l}, is denoted *X*_{isl}. The set of distinct haplotypes in a window ^{s}_{l} is defined as {*X*_{isl}|*i*=1,…,*M*}. A variable-sized window that begins with SNP *s,* denoted as Ω^{s}, is a collection of windows ^{s}_{l}, with *l* ranging from 2 to *k*_{s}, where *k*_{s} is the largest *k* such that |^{k}_{l=2}{*X*_{isl}|*i*=1,…,*M*}|*M*/2. In other words, the maximum window size in our variable-sized window is based on the local haplotype diversity and the available sample size, and it is defined in such a way that the number of distinct haplotypes in a variable-sized window is, at most, half the number of observed chromosomes. We assume that *n* is the number of independent variables (i.e., unique haplotypes) that are included in the regression model, and *m* is the number of samples we are given. To accurately estimate the coefficients in the regression model, *n* should be upper bounded by a function of *m.* For an ordinary regression analysis that maximizes the likelihood (i.e., minimizes the sum of squared error), a rule of thumb for the ratio *n*:*m* is . For the *l*_{1}-norm regularized regression that we use, although there is no theoretical proof yet, it has been suggested that, for the ratio *n*:*m*, *n**m*. ^{23} Under the assumption that phase-known haplotypes are given as input data, *M* chromosomes correspond to *M*/2 diploid individuals (i.e., *m*=*M*/2); we therefore choose *M*/2 as the maximum number of distinct haplotypes that can be accommodated in the regularized regression model.

### Performing the Association Test in a Variable-Sized Window

For a given *s*{1,…,*L*-1}, suppose there are *J* distinct haplotypes in the variable-sized window Ω^{s}. In this article, we take account of the dependency and complementariness among the *J* haplotypes and test them in one model. We make use of the shrinkage techniques in the regression to deal with the problem of the many degrees of freedom. The main reason for turning to regression is its fast estimation of coefficients. Use of regression models instead of logistic regression models is not uncommon in practice.^{24}^{,}^{25} To work with a regression model, we introduce a new variable *y*^{*}_{i} for each *y*_{i}, and the former can be interpreted as a true underlying continuous phenotype represented by the latter. In our experiments, *y*^{*}_{i}=1 when *y*_{i}=1, and *y*^{*}_{i}=-1 when *y*_{i}=0.

There are two steps in performing the association test in a variable-sized window. In step 1, we estimate the haplotype effect differences for the *J* haplotypes, using *l*_{1}-norm regularized regression, which is described below. Those haplotypes whose estimated effect difference (with respect to the reference haplotype) is not equal to zero are taken as informative haplotypes. If there are no informative haplotypes, we claim that there is no association between the haplotypes in the window and the disease of interest; otherwise, we proceed to the next step. In step 2, we test the statistical significance of the informative haplotypes selected in the first step by the *F* test. Below we describe how to make use of the generalized degrees of freedom (GDF) to correct the selection bias in the first step and to calculate an unbiased *P* value for association in each variable-sized window.

*l*_{1}-Norm Regularized Regression

Suppose there are *J* distinct haplotypes in the variable-sized window Ω^{s}. Let *D*^{s}_{ij}, *i*=1,…,*M* and *j*=1,…,*J**,* be a {0,1} variable, representing whether chromosome *i* contains haplotype *j.* The regularized regression model is parameterized with β^{s}={β^{s}_{0},β^{s}_{1}…,β^{s}_{J}}, where β^{s}_{0} is the haplotype effect of a reference haplotype, which is one of the *J* haplotypes, but is unknown before the fitting of the model; β^{s}_{j} is the haplotype effect difference between the *j*th haplotype and the reference haplotype. Unlike ordinary regression that aims to minimize the sum of squared error () between *y*^{*} and its estimation, the regularized regression has the joint objective of using the simplest model to obtain the least squared error. There is a hyperparameter α_{s} that makes a trade-off between these two contradicting objectives. The *l*_{1}-norm regularized regression uses as the model-complexity measure and estimates β^{s} by minimizing . The second term in the objective function forces the *l*_{1}-norm regularized regression to use as small a number of haplotypes as possible to predict *y*^{*} accurately.^{26} For a known α_{s}, β^{s} can be found using quadratic programming techniques that are computationally intensive; LARS^{23} was proposed to estimate β^{s} in time similar to that of standard linear regression for a series of α. The model obtained with a given α_{s} is called “α_{s}-indexed,” and its corresponding parameters are represented as β^{s}(α_{s}). We decided on the best value of α_{s} (equivalently, the best value of β^{s}) by the adaptive model-selection method,^{27} which was observed to perform better than cross-validation methods.^{28}

A central concept in the adaptive model-selection method^{27} is the GDF.^{29} For a general modeling procedure, such as regularized regression, which involves variable selection, the GDF is introduced to correct selection bias and to accurately measure the complexity of the model obtained. Those who are interested in the details of the GDF can refer to the work of Ye.^{29} The adaptive model-selection method generalizes Akaike information criterion (AIC), one of the model selection criteria, to the extended AIC, where the degrees of freedom in AIC are replaced with the GDF. For the computational details of the GDF and the extended AIC, please see appendix A. To decide on the best value of α_{s}, the adaptive model-selection method chooses the α that minimizes the extended AIC. Denote the chosen α_{s} as .

### Using the GDF to Calculate Unbiased *P* Values

The estimated haplotype-effect differences are now , *j*=1,…,*J*. Those haplotypes whose values are not equal to zero are selected as informative haplotypes. We test the disease-haplotype association by testing the statistical significance of the informative haplotypes, using the *F* test. Under the null hypothesis of no association between the disease and the haplotypes in the variable-sized window, all the haplotypes have no effect difference with respect to the reference haplotype. Hence, the sum of squared error of the null model is , where . Under the alternative hypothesis, all the noninformative haplotypes have the same effect as the reference haplotype, and all the informative haplotypes have different effect differences. Under the assumption that the indices of the informative haplotypes are *t*_{1},…,*t*_{G}, the sum of squared error of the alternative model is , where . The alternative model is -indexed, and we denote its GDF as . The statistic^{30} to test the significance of the contributions of the informative haplotypes is

which follows the *F*-distribution asymptotically under the null hypothesis, with the first degrees of freedom being and the second degrees of freedom being .

There are only *G* nonzero coefficients and one intercept in the alternative model; however, when the test statistic is calculated, the degrees of freedom of the alternative model are taken to be , which is usually >*G*+1. If we use *G*+1 as the degrees of freedom of the alternative model, the resulting *P* value will be biased downward. The GDF is used to correct the selection bias in the regularized regression^{29}; thus, the *P* value is called “unbiased” because it is calculated on the basis of the GDF of the model.

### Simulation Data

All the simulation data were generated using the ms program.^{31} First, 4,000 haplotypes were generated using the following parameters: region size of 300 kb; effective population size of 10,000; recombination rate per site per generation of 10^{−9} or 10^{−7}, and 300 SNPs within the region. Then, 2,000 individual samples were generated by randomly pairing the haplotypes. One or two SNPs with minor-allele frequency (MAF) of ~0.05 were randomly selected as the disease-causing variant(s) from the region (see explanation below about the disease model). Under the assumption of a multiplicative model of disease inheritance and an equal case:control ratio, the phenotype of each individual was simulated using the logistic regression model (appendix B) and an odds ratio for the heterozygous genotypes at the causal SNP(s) in the range 1.2–2.5. After generation of the phenotypes, the genotypic information of the selected causal SNP(s) was removed from the simulated haplotypes before statistical analysis.

We simulated two types of data on the basis of two different disease models. In the first model, there is only one disease-causing SNP within the simulated region. In the second model, there are two disease-causing SNPs within the simulated region that act jointly (not interactively). When selecting the two risk SNPs within a region, in addition to the requirement that both SNPs have an MAF of ~0.05, the pairwise *r*^{2} between the two risk SNPs is required to be <0.1, and they are separated by as many SNPs as possible. The odds ratios for the two causal SNPs are set to be the same. Detailed simulation procedures are described in appendix B.

The decay of LD, seen in both *D′* and *r*^{2}, in our simulated data (after filtering out the SNPs with MAF <0.03) was compared with that in the HapMap project.^{34} The overall patterns of our simulated data (data not shown) are similar to those from the HapMap project.^{34} Whereas the simulation data with a low recombination rate of 10^{−9} shows overall stronger LD and a slower LD decay than that of the HapMap data, the simulation data with a high recombination rate of 10^{−7} shows overall weaker LD and a faster LD decay than that of the HapMap data. Given the fact that the average recombination rate across 500 kb in the human ENCODE regions ranges from 0.19 to 1.25 cM,^{34} the rate of 10^{−9} represents the low end of recombination rates in the human genome, whereas the rate of 10^{−7} represents a high recombination rate observed in some parts (hot spots) of the human genome. It is therefore expected that reasonable differences of LD pattern will be seen between our simulation data and the HapMap data, and our simulation data are suitable for evaluating the performance of our method for analyzing real human population data.

### Experimental Data

Chinese subjects who received a diagnosis of idiopathic Parkinson disease from neurologists at two major movement disorder centers in Singapore (Singapore General Hospital and National Neuroscience Institute) were included in the study. The diagnosis of Parkinson disease was made in accordance with the diagnostic criteria of the United Kingdom Parkinson Disease Society Brain Bank. Healthy controls of similar age and matching sex and race were recruited at the same clinics. Institutional ethics committees approved the study, and informed consent was obtained from all study subjects.

## Results

We compared the performance of our method of variable-sized sliding windows by use of regularized regression (referred to in the table and figures as “VSSWRR”) with three other methods for association analyses: allele-based single-locus χ^{2} test (hereafter referred to as “SINGLE”), cladistic analysis of haplotypes^{21} (hereafter referred to as “CLADHC”), and association mapping by use of variable-length Markov chains^{18} (hereafter referred to as “VLMC”). SINGLE is used as a comparison baseline, CLADHC is used as a benchmark of haplotype analyses in a fixed-sized sliding-window framework; and VLMC is used as a benchmark of haplotype tests that are in the haplotype-clustering framework and make use of local LD pattern. Since CLADHC adopts a fixed window size, we analyze each replicate datum by using window sizes of 4–10 separately and present the highest power for each odds ratio. Throughout the article, we used Bonferroni correction to adjust for multiple testing (multiple sliding windows that start at different positions for our method, multiple haplotype partitions and multiple sliding windows for CLADHC, multiple single-SNP tests for SINGLE, and multiple haplotype cluster tests for VLMC).

### Analysis of Simulated Data

For each of the two disease models, with a recombination rate per site per generation of 10^{−9}, the simulation procedure in appendix B was invoked 100 times to generate 100 replicate data for each of the odds ratios: 1.2, 1.4, 1.6, 1.8, 2, and 2.5. To mimic a typical genetic association study, we first filtered out rare SNPs (MAF <0.03) and then identified tagging SNPs in 90 randomly selected individual samples (180 haplotypes), using a haplotype *R*^{2} value of 0.85.^{32} On average, 18 SNPs remained after filtering by MAF and haplotype *R*^{2}. The phenotypes and tagging-SNP haplotypes of 2,000 simulated cases and simulated controls were subsequently used in genetic association analyses.

The performance comparisons among the four methods were done in two ways. First, the performance was evaluated in terms of the detection power—that is, the rate of declaring association on the basis of the smallest adjusted *P* value at a significance level of .05 within a region. Second, the performance was evaluated by calculating the type I error rate for each method, which was done by randomly permuting the disease status for each datum and then averaging over the disease models.

Our method consistently outperformed the other three methods in terms of the detection power at various odds ratios under the two different disease models. Under the single-disease-allele model (fig. 1), our method provides the best detection power among the four methods, although the difference between our method and VLMC (the second best) is moderate. At very low odds ratios (1.2–1.4), all the methods have poor power of detection, which is expected, given the low population frequency of the simulated disease allele and the limited size of the simulated sample of 1,000 cases and 1,000 controls. For moderate odds ratios (1.8–2.5), both our method and VLMC provide significantly higher detection power than do the CLADHC and SINGLE. Under the model of two disease alleles, for a moderate odds ratio (1.6–2), our method consistently provides 10%–20% more power than do the other three methods (fig. 2). All three haplotype-based methods perform better than SINGLE for odds ratios of 1.8–2.5; however, they do not perform better for low odds ratios—in this case, all the methods have poor detection power.

**...**

**...**

To further investigate the performance of each method, the haplotype complexity was increased by increasing the recombination rate per site per generation from 10^{−9} (fig. 2) to 10^{−7} (fig. 3) for the model of two disease alleles. With a recombination rate of 10^{−7}, LD strength within the region under study was greatly reduced, and the haplotype complexity within the region therefore increased significantly. Specifically, for each of the 100 simulated data replicates, we calculated the number of unique haplotypes of length from 2 to *L*-*s*+1 (under the assumption that there are *L* SNPs) for each possible window beginning at position *s.* The average number of unique haplotypes over all *s* in the 100 simulation data increases from 94 to 1,760. The percentage of sliding windows that have >2,000 unique haplotypes of different lengths increases from 0% to 21%. For the region associated with high haplotype complexity (fig. 3), our method provides much better detection power than that of the other three methods, and ~30%–50% more power than that of CLADHC, the second-best method. As the second-best method, CLADHC still performs significantly better than SINGLE and VLMC for odds ratios 1.8–2.5. Interestingly, VLMC, which is the second-best method for the simulated data with low haplotype complexity (a recombination rate per site per generation of 10^{−9}), has the worst performance, with very poor detection power even for the high odds ratio of 2.5. Overall, the improved performance of our method compared with the other three methods is more significant in the region with high recombination rate and thus with high haplotype complexity.

**...**

We also compared the type I error rates of the four methods (table 1). Our method has the highest type I error rate; however, it is still below the nominal value of 5%, and the difference in the type I error rates between our method and the other three methods is moderate. This indicates that the significant improvement of our method in terms of detection power does not lead to a significant increase in the type I error rate or false-positive rate.

### Analysis of Experimental Data

We also evaluated the performance of the four methods by using experimental data generated in a genetic association study of Parkinson disease. The data include the genotypes of 96 SNPs (from a single candidate gene) obtained from 211 cases and 215 healthy controls. The most likely haplotype pair for each individual was inferred by PLEM.^{33} Of 95 sliding windows, 81% have >426 (the number of total samples) unique haplotypes of different lengths. Among the four methods, only our method detected a significant association at the 5% experimentwise significance level after Bonferroni correction for multiple testing. The sliding window Ω^{31} (see the “Methods” section for description), beginning with SNP 31, had the smallest raw *P* value of .000285, which was significant after Bonferroni correction (fig. 4). The analysis was also performed using a permutation method for multiple-testing correction, and the conclusion remained the same. The smallest permutation-corrected *P* value of our method (for 1,000 permutations) was 0.019, whereas no significant evidence was detected by the other three methods (SINGLE and VLMC used 1,000 permutations, whereas CLADHC used 10,000 permutations because it involves two-level multiple testing). Within the sliding window of Ω^{31}, the longest informative haplotype selected by the regularized regression has length 18; hence, the identified critical region of the putative risk allele(s) was from SNP 31 to SNP 48. To further evaluate the significance of this finding, we performed 100 cross-validation analyses. In each cross-validation, we randomly selected 174 cases and 176 controls from the whole sample. Since the sample size of the cross-validation analysis was only 350, the *P* values obtained were inevitably nonsignificant after multiple-window adjustment. However, of the 100 cross-validation analyses, 58 mapped the critical region of the putative risk alleles (defined by the longest informative haplotype) to the interval between SNPs 29 and 49, and 15 analyses mapped the critical region to the interval between SNPs 16 and 34, which overlapped with the original critical region determined in the whole sample. This suggested that the identified critical region of the putative risk allele was unlikely to be caused by sampling bias, although further validation analyses are warranted.

## Discussion

In this article, we have proposed a haplotype-based method that works with variable-sized sliding windows to detect disease-haplotype associations for population-based case-control studies. For each variable-sized sliding window, the maximum window size is determined on the basis of local haplotype diversity as well as sample size. By doing a systematic performance evaluation under different disease models, we have shown that our method consistently outperforms the commonly used single-SNP–based association test and two haplotype association methods that have been demonstrated to be among the most effective methods to date. The outperformance of our method compared with the other three methods becomes much more significant when the region under study shows low LD. When the region under study exhibits extensive LD, our method provides good detection power (>60%) for disease alleles with moderate effects (odds ratio 1.8) and a low population frequency of 5% and provides almost full power for the model of two disease alleles with odds ratio of 2.5 and a population allele frequency of 5%. Importantly, the improvement of detection power by our method does not lead to a significant increase in type I error rate, and the overall rate is well below the nominal value of 5%. Meanwhile, it is worth pointing out that, when the region under study has relatively low LD, the detection power of our method is still not optimal (<40%), although it performs much better than the other three methods. So, there is still space for further improvement on our method.

To our knowledge, our method is the first application of GDF and *l*_{1}-norm regularized regression to haplotype association analyses. A major challenge for haplotype association approaches is the large number of haplotypes to be tested, and the issue becomes even more challenging when an exhaustive analysis of haplotypes is performed. It is expected that, in an exhaustive analysis of haplotypes within a region, many haplotypes have the same prefix and are thus highly correlated (e.g., haplotypes 122, 1221, and 12212, all beginning with SNP *s,* are highly correlated). Meanwhile, because haplotypes may have complementary effects, considering one haplotype at a time will weaken association strength. A series of 1-df tests, followed by permutation-based multiple-testing adjustment, can take into account the dependency among the tests but ignores the complementariness among haplotypes. In contrast, the conventional single, multiple-df global test can take into account the complementariness but fails to consider the dependency among haplotypes, because it treats each haplotype as a totally independent identity. To account for the dependency among haplotypes, the penalized log-likelihood method^{22} introduced a penalty term to force similar haplotypes to have similar estimated effects. The regularized regression we adopted in this article behaves differently. The dependency among haplotypes is evaluated on the basis of disease status—that is, two haplotypes are considered to be highly redundant if the association of one haplotype with the disease is not much affected by consideration of the two haplotypes together. The regularized regression shrinks the effects of redundant haplotypes to zero, so that the effective degrees of freedom of the model are much smaller than the given sample size. Thanks to the fast estimation of coefficients in *l*_{1}-norm regularized regression, the best model (i.e., the best trade-off parameter) can be found by the adaptive model-selection method,^{27} which was observed to be better than cross-validation methods.^{28} By taking into account both the redundancy and the complementariness among haplotypes and by using the GDF technique, our regularized regression method provides a more efficient and effective way to analyze a large number of haplotypes in association test than performance of a series of 1-df tests or a single multiple-df global test.

For the first time, we implement a procedure in which the maximum window size of a sliding-window analysis is determined on the basis of local haplotype diversity and sample size. It is well known that, in a linear model, the number of covariates that can be accurately estimated is constrained by the number of observations or samples, although the method of constraining may be different for different model-building techniques. Given a fixed sample size, to ensure an accurate estimation of the model parameters, the number of covariates or unique haplotypes that can be considered should be limited. Unconstrained inclusion of a large number of haplotypes in association test will not increase but, instead, will decrease the detection power.

Our approach has better performance than CLADHC, which has been shown to be one of the most powerful approaches for haplotype analysis to date. The improvement of our approach over CLADHC can be reflected in several ways. First, our approach is based on variable sliding-window size, whereas CLADHC employs a fixed window size. As pointed out above, the employment of a fixed window size reduces the power for detecting risk haplotype(s). In our performance evaluation, multiple fixed window sizes were explored in CLADHC analysis, which is what is usually done in practice. The results of CLADHC should have undergone adjustment for multiple window sizes, and the performance of CLADHC would have been worse. Second, when the number of haplotypes in a window is large, CLADHC needs to first merge rare haplotypes into one category before hierarchical clustering is performed. In contrast, our method considers both common and rare haplotypes directly in the regularized regression model and can include rare haplotypes in the final model if they are statistically significantly associated with the disease of interest. Third, the haplotype partition in CLADHC is obtained by hierarchical clustering, which is greedy in nature. Hence, it is likely that the “best” haplotype partition identified is only suboptimal among all possible ones. To the contrary, the regularized sum of squared error that serves as the objective function in the regularized regression is globally minimized. Of course, CLADHC has its own advantage in terms of mapping disease susceptibility loci, because its hierarchical clustering is based on haplotype evolution to some extent. Another comparison is of computational efficiency. Within a sliding window, CLADHC tries different haplotype partitions and thus needs to adjust for multiple testing (multiple haplotype partitions) within a window, which is usually done by conservative Bonferroni correction. Alternatively, the adjustment can be done by a permutation approach, but it requires at least 1,000 permutations to get accurate adjusted *P* values, which is computationally intensive. Our method tests the association between the disease and all the haplotypes within a window in one model; hence, there is no need to do multiple-testing adjustment within a window. However, to calculate unbiased *P* values, our method does require a parametric bootstrapping procedure to estimate the GDF of the model built by the regularized regression. Fortunately, 100 bootstraps are usually enough to accurately estimate the GDF, which is much less computationally intensive than the permutation method for multiple-testing adjustment. In summary, our method outperforms CLADHC by enjoying the flexibility of window sizes and the effectiveness of managing a large number of both common and rare haplotypes.

Another unique feature of our method is that it takes into account the observed haplotype associations with the phenotypes of interest when the regression model is being built for testing. In contrast, CLADHC and VLMC first group haplotypes on the basis of haplotype similarity and then perform association tests. So, both of those methods do not consider the observed haplotype associations with phenotypes when grouping the haplotypes, although they take the haplotype evolution into account to some extent. Modeling haplotype evolution in association tests may give an easier biological interpretation of association evidence; however, a problem may be encountered when a region is analyzed for which either there is more than one disease-risk allele or both risk and protective alleles exist and these alleles have different evolutionary histories. The haplotype clustering–based method may also encounter a problem when the region under study exhibits a complex LD pattern and thus contains a large number of unique haplotypes. For example, the analyses of our simulation data indicated that VLMC grouped a larger number of unique haplotypes into a smaller number of haplotype clusters (merging edges in the fitted graph) in the region with a high recombination rate than in the region with a low recombination rate (data not shown). When a region with a large number of unique haplotypes is analyzed, the merging method of VLMC can result in a limited number of merging edges or haplotype clusters for testing, but the resulting haplotype clusters will become more heterogeneous and thus may fail to capture the true association between particular haplotype(s) and risk variant(s). This may explain, at least partially, the poor performance of VLMC under a high recombination rate. Therefore, the sequential nature (unsupervised haplotype grouping followed by association testing) of the existing haplotype clustering–based methods may be another reason why these methods have lower detection power than our method does.

There are, however, some limitations of our simulation analysis. First, our simulated data were constructed without the modeling of recombination hotspots. However, we did simulate the data by assuming a high or low recombination rate per site per generation (10^{−7} or 10^{−9}). Given the observed range of the average recombination rate (0.19–1.25 cM) across 500 kb in the human ENCODE regions,^{34} the rate of 10^{−9} represents the low end of recombination rates observed in the human genome, whereas 10^{−7} represents a high recombination rate observed in some parts (hotspots) of the human genome. The performance of our method in a region with recombination hotspots (a mixture of 10^{−7} and 10^{−9} recombination rates) is expected to be intermediate between the performances achieved with assumptions of low and high recombination rates across the whole region.

Second, we only simulated disease alleles with an MAF of 0.05. It is well known that the MAF of a disease allele has a major impact on the relative power of haplotype and single-marker methods. If there is only one common risk allele (MAF >0.05) within a region, it is likely that a single-marker test will have similar detection power as that of our haplotype analysis. However, when there is more than one common risk allele within a region, our method will have a better power than that of a single-marker analysis because association evidence from different haplotypes (associated with different risk alleles) may be built into a single regression model for testing in our method. For detecting disease-risk alleles with low MAFs, our method is likely to have even higher detection power than that of single-marker analyses and haplotype clustering–based approaches because our method includes directly the common and rare haplotypes in the regression model. By simulating a disease allele with low MAF and moderate effect, we explored a rather difficult scenario for detecting a risk allele to evaluate the power of our method.

Third, we directly simulated case and control samples instead of first simulating a source population and then randomly selecting cases and controls. Typical genetic association analyses are performed using cases and controls, often of similar numbers, that represent a very small proportion of the source population. Under the assumption of a relatively large sample size and a multifactorial disease model in which disease phenotype is influenced by multiple factors, each with a moderate effect, the random sampling of cases and controls from the source population should not cause a significant difference in LD pattern between the cases and the controls or between the selected samples and the source population. In our simulation data analysis, instead of simulating a source population and then a random sampling process, we directly simulated equal numbers of cases and controls that have a similar LD pattern and reflect the simulated disease model with a low disease-risk allele frequency and a moderate relative risk. By doing so, we eliminated the impact of a random sampling process. But, given the relatively large number of our simulated samples, they should be suitable for demonstrating the applicability of our method to real data.

Fourth, the use of Bonferroni correction for multiple-testing adjustment may impact the result of the power comparison between our method and the three competing methods, because Bonferroni correction overly penalizes a method that involves highly correlated tests. If one method involves a larger number of highly correlated tests, it will be overly penalized and therefore will appear to have a lower power than what would have been achieved if a more appropriate correction method (such as permutation-based correction) were used. Given that the sliding windows of our method are highly overlapping for adjacent SNPs, our method and SINGLE probably suffer a similar penalty from Bonferroni correction. As for VLMC, it is not clear whether the merging edges that are used for association testing suffer more from Bonferroni correction than do our variable-sized sliding windows; however, the type I error rates of VLMC and our method are very similar, suggesting that they are penalized to the similar degree. CLADHC may suffer more penalties than the other three methods do, because it needs to adjust for two-level multiple testing (one for multiple haplotype partitions within a window and the other for multiple windows). However, this could not be the sole reason why our method performed better than CLADHC, as we discussed above. The fact that the use of permutation tests in the analysis of the experimental data leads to the same conclusion as the use of Bonferroni correction suggests that our method might still have better power than the other three methods when permutation test is used for multiple-testing correction.

To mimic a typical genetic association study, our analysis of the simulated data was preceded by removal of rare variants and use of a tagging-SNP strategy; however, our regularized regression–based method can also be performed by consideration of all the typed SNPs. On one hand, including all the SNPs allows the full usage of genetic information and thus may increase the significance of the test within a sliding window. On the other hand, it also increases the total number of sliding windows and thus the total number of tests that need to be adjusted for. Currently, there is not a good solution to finding an optimal trade-off point between the maximization of genetic-information usage and the minimization of multiple-testing adjustment, because such a point seems to be different from case to case. For example, in our analysis of the candidate gene for Parkinson disease, use of all 96 typed SNPs allowed us to identify a significant association at the adjusted significance level of 0.05, whereas use of 15 tagging SNPs failed. One possible solution might be the application of the regularized regression to tests of the significance of all the informative haplotypes selected in all the sliding windows.

Our method can also be easily generalized to analysis of phase-unknown genotype data. For example, for phase-known haplotype data, *D*^{s}_{ij} in the regularized regression is an indicator of whether chromosome *i* contains haplotype *j*; for phase-unknown genotype data, *D*^{s}_{ij} can be set to the expected dosage of haplotype *j* in subject *i.*^{35}^{,}^{36} In particular, *D*^{s}_{ij} can be the weighted average number of copies of haplotype *j* in the haplotype pairs that are compatible with the genotype of subject *i,* with the weights equal to the estimated haplotype frequencies. However, the simultaneous estimation of the haplotype frequencies and the haplotype effects in the regularized regression, as is done in a standard logistic regression,^{37} needs further research. Our method works in a regression framework; hence, other risk factors can be incorporated as covariates in the regression model.

Our method has the potential to be applied to genomewide haplotype analysis, a very challenging task at the moment. Given the greatly varying LD patterns across the human genome, our variable-sized sliding-window method has a clear advantage over the methods that assume a fixed sliding-window size. Our method also has an advantage over the exhaustive haplotype analysis in a genomewide scan. In the genomewide analysis, the maximum window size of the exhaustive analysis can be as big as the whole chromosome. Consequently, the total number of unique haplotypes will be enormous, leading to a serious drain of power for detection. Our method overcomes this problem by determining the maximum window size on the basis of the local haplotype diversity and the sample size. In this study, we used the simple Bonferroni correction for the multiple-testing adjustment for different sliding windows, which is overly conservative, especially for a whole-genome analysis. Permutation-based adjustment is one alternative to multiple-testing adjustment. However, on the basis of our experience from this study, we think that the GDF is more favorable. Therefore, one future development will be to explore the application of the regularized regression and GDF to testing, in one model, the significance of all the informative haplotypes selected in all the sliding windows, for which a smaller number of parametric bootstrappings, rather than a large number of permutations, are performed.

The method for variable-sized sliding windows with use of regularized regression, coded in R, is available on request from the corresponding author.

## Acknowledgments

We thank Dr. C. Durrant for providing the program of CLADHC and Dr. S. Browning for providing the program of VLMC (programs that were used to compare simulation data and experimental data); M. Seielstad and K. Humphreys, for their useful comments; and Dr. E. K. Tan, for allowing us to use his experimental data for this study. We also thank the two anonymous reviewers for their helpful suggestions. This study was supported by funding from the Agency for Science and Technology and Research of Singapore.

## Appendix A

In classical linear models, the number of covariates and the covariate identities are fixed, even if different observed responses are given; hence, the degrees of freedom are equal to the number of covariates. However, situations are different in the regularized regression. With α fixed but given slightly different observed responses, the regularized regression comes up with different β^{s}(α) values; as a result, the number of nonzero coefficients and the identities of the nonzero coefficients may be quite different. In other words, the α-indexed model found by the regularized regression may be instable, sensitive to small changes in the observed responses. Hence, the number of nonzero coefficients cannot accurately measure the model complexity any more—that is, the degrees of freedom of the α-indexed model are no longer equal to the number of nonzero coefficients in the model. For a general modeling procedure, such as the regularized regression, which involves variable selection, the GDF are introduced^{29} to correct selection bias and to accurately measure the complexity of the model obtained. The GDF of a model is defined as the *average* sensitivity of the fitted values to a small change in the observed values. The parametric bootstrapping method proposed by Ye^{29} estimated the GDF by perturbing the observed response a little bit in some way, estimating the response by use of perturbed data, and computing the ratio of the estimated response to the perturbation rate. Usually, 100 bootstrapping is enough to accurately estimate GDF; hence, it is relatively efficient.

Suppose the observed value *y*_{i}, *i*=1,…,*n**,* is modeled as μ_{i}+, where μ_{i} is the expectation of *y*_{i} and is a Gaussian white noise with variance σ^{2}. An estimate *s*^{2} for σ^{2} can be obtained by an ordinary regression. Given a modeling procedure *M*:*y*→*u*, GDF(*M*), the GDF of the modeling procedure *M,* can be estimated as follows:

- 1. For
*t*=1,…,*T*, first generate δ_{ti}~*normal*(0,*s*^{2}),*i*=1,…,*n*. Then, evaluate on the basis of the modeling procedure*M.* - 2. Calculate as the regression slope from .
- 3.

The estimation of GDF is relatively insensitive to the choice of *s* for *s*[0.5σ,σ].

Given GDF(*M*), the extended AIC is defined as .

## Appendix B

The procedure that we used to generate the simulation data is as follows.

- 1. Generate genotype data.
- (a)Invoke the ms program
^{31}to generate 4,000 chromosomes, with the required invoking parameters. - (b)Form genotype data by randomly pairing the haplotypes.

- 2. Generate the phenotype (disease status).
- (a)Randomly select the required number of disease-causing SNPs whose MAF is approximately the desired MAF.
- (b)Generate the disease status based on the genotypes of the causal SNPs and the disease model by use of the following logistic regression model: , where
*I*is the number of causal SNPs, OR is the specified odds ratio for the heterozygous genotype of the causal SNP,*x*_{i}is the 0-1-2 genotype coding for the*i*th causal SNP, and “constant” is a constant that renders the required case:control ratio.

- 3. Remove the genomic information for the selected causal SNPs from the simulated haplotypes.

## References

*SCN1A*: implications for linkage-disequilibrium gene mapping. Am J Hum Genet 73:551–565 [PMC free article] [PubMed]

**American Society of Human Genetics**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (241K)

- Linkage disequilibrium mapping via cladistic analysis of single-nucleotide polymorphism haplotypes.[Am J Hum Genet. 2004]
*Durrant C, Zondervan KT, Cardon LR, Hunt S, Deloukas P, Morris AP.**Am J Hum Genet. 2004 Jul; 75(1):35-43. Epub 2004 May 13.* - Sequential haplotype scan methods for association analysis.[Genet Epidemiol. 2007]
*Yu Z, Schaid DJ.**Genet Epidemiol. 2007 Sep; 31(6):553-64.* - Evolutionary-based grouping of haplotypes in association analysis.[Genet Epidemiol. 2005]
*Tzeng JY.**Genet Epidemiol. 2005 Apr; 28(3):220-31.* - [Construction of haplotype and haplotype block based on tag single nucleotide polymorphisms and their applications in association studies].[Zhonghua Yi Xue Yi Chuan Xue Za Zhi. 2007]
*Gu ML, Chu JY.**Zhonghua Yi Xue Yi Chuan Xue Za Zhi. 2007 Dec; 24(6):660-5.* - Evaluating associations of haplotypes with traits.[Genet Epidemiol. 2004]
*Schaid DJ.**Genet Epidemiol. 2004 Dec; 27(4):348-64.*

- Rule-based induction method for haplotype comparison and identification of candidate disease loci[Genome Medicine. ]
*Karinen S, Saarinen S, Lehtonen R, Rastas P, Vahteristo P, Aaltonen LA, Hautaniemi S.**Genome Medicine. 4(3)21* - Localization of Association Signal from Risk and Protective Variants in Sequencing Studies[Frontiers in Genetics. ]
*Brisbin A, Jenkins GD, Ellsworth KA, Wang L, Fridley BL.**Frontiers in Genetics. 3173* - Evaluation of a LASSO regression approach on the unrelated samples of Genetic Analysis Workshop 17[BMC Proceedings. ]
*Guo W, Elston RC, Zhu X.**BMC Proceedings. 5(Suppl 9)S12* - Multilocus Genetic Analysis of Brain Images[Frontiers in Genetics. ]
*Hibar DP, Kohannim O, Stein JL, Chiang MC, Thompson PM.**Frontiers in Genetics. 273* - Multilocus association mapping using generalized ridge logistic regression[BMC Bioinformatics. ]
*Liu Z, Shen Y, Ott J.**BMC Bioinformatics. 12384*

- Association Mapping via Regularized Regression Analysis of Single-Nucleotide–Pol...Association Mapping via Regularized Regression Analysis of Single-Nucleotide–Polymorphism Haplotypes in Variable-Sized Sliding WindowsAmerican Journal of Human Genetics. Apr 2007; 80(4)705PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...