- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- NIHPA Author Manuscripts
- PMC2784738

# A Variable-Sized Sliding-Window Approach for Genetic Association Studies via Principal Component Analysis

^{1}Department of Mathematical Sciences Michigan Technological University, Houghton, MI 49931

^{2}Department of Mathematics, Heilongjiang University, Harbin 150080, China

## Abstract

Recently with the rapid improvements in high-throughout genotyping techniques, researchers are facing the very challenging task of analyzing large-scale genetic associations, especially at the whole-genome level, without an optimal solution. In this study, we propose a new approach for genetic association analysis that is based on a variable-sized sliding-window framework and employs principal component analysis to find the optimum window size. With the help of the bisection algorithm in window-size searching, our method is more computationally efficient than available approaches. We evaluate the performance of the proposed method by comparing it with two other methods—a single-marker method and a variable-length Markov chain method. We demonstrate that, in most cases, the proposed method outperforms the other two methods. Furthermore, since the proposed method is based on genotype data, it does not require any computationally intensive phasing program to account for uncertain haplotype phase.

## Background

Currently, with the availability of large-scale genotyping technologies, the genotyping cost of genome-wide association (GWA) studies has been largely reduced and a boom of large-scale GWA studies is underway. Nevertheless, the success of most association studies is based on the linkage disequilibrium (LD) between the functional mutations and markers in a local region of the genome. Varieties of statistical approaches that rely on LD pattern have been developed to map functional variants (Spielman et al. 1993; Olson et al. 1994; Rannala and Reeve 2001; Ardlie et al. 2002). The most straightforward approach of LD-based association analysis is the single-marker analysis, which tests each single nucleotide polymorphism (SNP) for association with the disease. However, many studies have shown that this simple method may be inefficient in most cases because of the limited genetic information used in finding the functional mutations. We need methods that could better use information of multi-markers jointly. An alternative approach of the single-marker analysis is multiple-marker analysis based on either haplotypes or genotypes (Morris and Kaplan 2002; Clayton et al. 2004; Seaman and Müller-Myhsok 2005). This approach still has the disadvantage that large degrees of freedom are always involved in the test statistic due to the large number of haplotypes. For mapping complex disease genes, it is still hard to make the verdict on which of the two methods is more powerful (Sevice et al. 1999; Barton 2000; Maclean et al. 2000; Zöllner and von Haeseler 2000; Akey et al. 2001; Morris and Kaplan 2002; Wessel and Schork 2006). Under certain disease models and certain LD patterns one method outperforms the other, so it is likely that there is no single best approach to detect the common risk factors. In practice, researchers have employed both single-marker and multiple-marker analysis in genetic association studies. If conducting a multiple-marker analysis, a researcher has to determine how many neighboring SNPs should be included in the analysis.

Recent studies have suggested that the human genome can be partitioned into blocks with limited haplotype diversity within each block (Gabriel et al. 2002). Therefore, most of the genetic variation can be captured by a limited number of haplotypes and haplotype association tests are performed within each predefined block (Gabriel et al. 2002). For haplotype block approaches, there are several different criteria that have been proposed to predetermine the blocks, but it is still not clear which one is the best (Perola et al. 2002; Zhang and Li 2003; Zhang et al. 2004; Zhu et al. 2004 ). Furthermore, it is hard to determine the boundaries of the blocks and it usually will result in many single-marker blocks, which shows no advantage over the single-marker analysis. Considering the reasons mentioned above, haplotype block approaches may not be the most efficient method to conduct the association studies (Zhao et al. 2003).

The sliding-window approach is another strategy of multiple-marker analysis. In this approach, a genome region under study is divided into windows and a multiple-marker association test is performed in each window. There are two groups of sliding-window methods: uniform-sized sliding-window approaches and variable-sized sliding-window approaches (Clayton et al 1999; Bourgain et al. 2000; Toivonen et al. 2000; Mathias et al. 2006; Yang et al. 2006; Yi et al. 2007; Huang et al. 2007). For the uniform-sized sliding-window approaches, it is hard to decide the optimal window size under different scenarios. It will become more problematic when the uniform-sized sliding-window approaches are performed over a large genome region or over the whole genome, where the LD patterns certainly vary frequently. Therefore, the variable-sized sliding-window approaches with a variable window size decided by the underlying LD pattern perform more efficiently in large scale data analysis. The problem for the variable-sized sliding-window approach is in finding the optimal window size.

Browning (2006) proposed a variable-sized sliding-window approach based on a variable-length Markov chain model, which automatically adapts to the LD pattern between markers. Browning argued that this approach can be thought of as haplotype testing with sophisticated windowing that accounts for extent of LD to reduce both the degrees of freedom and number of tests. Li et al. (2007) also proposed a variable-sized sliding-window approach in which the maximum size of a sliding window is determined by local haplotype diversity and a regularized regression analysis is used to tackle the problem of multiple degrees of freedom in the haplotype test. However, both Browning’s and Li et al.’s methods require phased data as input. Even though haplotype phasing programs are now available, it is still very time-consuming to phase a large number of markers.

In this study, we proposed a novel method for multiple-marker association analysis of genotype data. Based on the variable-sized sliding-window frame, we decide the optimal window size by the local LD pattern via Principal Component Analysis (PCA). Then we use a score test based on a logistic model to test association within a window. Simulation studies are used to compare the power of the proposed approach with that of the single-marker association test and the haplotype clustering method based on variable-length Markov chains by Brownings (2007). Our simulation studies demonstrate that the proposed method provides better performance than the single-marker association test and Browning’s method in most of the scenarios. Our method is much faster computationally than Browning’s and Li et al.’s methods because our method is based on genotypes and thus does not need to estimate haplotypes.

## Methods

### Optimum Window-Size Searching Procedure

Consider a case-control sample with total M individuals and assume each individual has been genotyped at *N* SNPs. Let *G** _{i}* = (

*g*

_{i1},

*g*

_{i2}, … ,

*g*

*)*

_{iN}*(*

^{T}*i*= 1, 2,…,

*M*) denote the multi-marker genotype of the

*i*th individual, where

*g*

*denote the genotype of the*

_{ij}*i*th individual at the

*j*th SNP and

*g*

*code as 0, 1, or 2 (the number of minor allele). Let*

_{ij}*y*

*denote the trait value of individual*

_{i}*i*(1 for cases and 0 for controls).

In the sliding-window frame, a window, denoted as ${w}_{l}^{b}$, is a set of neighboring SNPs {*b*,*b* + 1,*b* + 2,…,*b* +*l* − 1 }. A variable-sized sliding window which begins with SNP *b* , denoted as Ω* ^{b}* , is a collection of windows ${w}_{l}^{b}$ with

*l*ranging from

*s*to Γ

*, where*

^{b}*s*and Γ

*are predefined smallest and largest window sizes, respectively (in our simulation studies, we use*

^{b}*s*= 4 and Γ

*= 35 ).*

^{b}In this study, we apply the PCA to define the optimum window size. The optimal window size for windows beginning with SNP *b* is defined as the maximum window size among windows ${w}_{l}^{b}$ such that *c*_{0} proportion of the total information can be explained by the first *k* Principal Components (PCs), where *c*_{0} and *k* are predefined. In our searching procedure, we start with a window ${w}_{l}^{b}$ , *l* = *s* = *k* + 1 , so at least the window length is longer than *k* , the number of the important PCs.

To carry out the PCA, we let ${\Sigma}_{g}^{b}=\sum _{i=1}^{M}({G}_{i}^{b}-\stackrel{\u2012}{{G}^{b}}){({G}_{i}^{b}-\stackrel{\u2012}{{G}^{b}})}^{T}$ , a *l* × *l* matrix, denote the sample variance-covariance matrix of genotypic numerical codes, where ${G}_{i}^{b}={({g}_{i,b},{g}_{i,b+1},\cdots ,{g}_{i,b+l-1})}^{T}$ and $\stackrel{\u2012}{{G}^{b}}=\frac{1}{M}\sum _{i=1}^{M}{G}_{i}^{b}$. Let ${e}_{j}^{b}$ be the eigenvector corresponding to the *j* th largest eigenvalue ${\lambda}_{j}^{b}$ of the sample variance-covariance matrix ${\Sigma}_{g}^{b}$ Thus in window ${w}_{l}^{b}$ , the total variance in the original data set explained by the *j* th PC is ${\lambda}_{j}^{b}\u2215({\lambda}_{1}^{b}+{\lambda}_{2}^{b}+\cdots +{\lambda}_{l}^{b})$. Let $C=({\lambda}_{l}^{b}+\cdots +{\lambda}_{k}^{b})\u2215({\lambda}_{1}^{b}+{\lambda}_{2}^{b}+\cdots +{\lambda}_{l}^{b})$, the proportion of the total variability explained by the first *k* PCs. The following three steps show a natural way to search for the optimum window size.

- Step 1: Among a set of windows Ω , conduct PCA on the genotypes within the window ${w}_{s}^{b}$, a window begins at SNP
*b*and with*s*=*l*=*k*+ 1 as the shortest window size. - Step 2: Calculate
*C*, the proportion of the total variability explained by the first*k*PCs for window ${w}_{l}^{b}$. If*C*>*c*_{0}, we let*l*=*l*+ 1 , which enlarges the window size by including one more SNP and we continue to carry out step 3. Otherwise, we say that*l*is the best window size for the windows that begin at SNP*b*. - Step 3: Repeat step 2.

### Adapt Bisection Method to Modify the Optimum Window-Size Searching Procedure

As we can imagine, our previous optimum window-size searching procedure is very computational demanding for the genome-wide analysis, especially when the window size gets larger. Therefore it is necessary for us to relieve the computational burden of our sliding-window method. In mathematics, the bisection method is a root-finding algorithm that works by repeatedly dividing an interval into half and then selecting the subinterval in which the root exists.

By adapting the bisection method, we modified our optimum window-size searching procedure as follows:

- Step 1: let
*l*= [ (*s*+ Γ)/ 2 ], where*s*and Γ are the predefined smallest and largest window sizes among a set of windows Ω, and [a] is the largest integer that is less than or equal to a.^{b} - Step 2: Among Ω
, firstly we conduct PCA within the window ${w}_{l}^{b}$, a window begins at SNP^{b}*b*and the window size is*l*. - Step 3: Calculate
*C*for this window ${w}_{l}^{b}$. If*C*>*c*_{0}, we let*s*=*l*, which enlarges the window size by including more SNPs and we continue to carry out step 4. Otherwise, we let Γ =*l*, which shortens the window size by excluding more SNPs but does not change the start position of this window. - Step 4: Repeat step 1 to step 3 until Γ −
*s*≤ 1 .

By employing the bisection algorithm in the optimal window-size searching process, the computational burden is significantly relieved.

### Score Test

After we find the optimum window size for a window, we can apply any appropriate test statistic to test for association in this window. In our study, we use the score test statistic based on a logistic model to test for association. Consider ${w}_{l}^{b}$, a window beginning at SNP *b* with the optimum window size *l* . Let *G _{i}*, ${x}_{i}={({x}_{i1}^{\ast},{x}_{i2}^{\ast},\cdots ,{x}_{ik}^{\ast})}^{T}$, and

*y*

*denote the genotype, the first*

_{i}*k*PCs of the genotype, and the trait value (1 for cases and 0 for controls) of the

*i*th individual, where

*i*= 1, 2 ,…,

*M*. Let

*p*

*denote the probability of disease given genotype*

_{i}*G*

*. Suppose that the*

_{i}*k*PCs follow a logistic model $\mathrm{log}\frac{{p}_{i}}{1-{p}_{i}}={\beta}_{0}+{\beta}^{T}{x}_{i}$, where

*β*= (

*β*

_{1},…,

*β*

*)*

_{k}*; then the score test statistic is given by (Clayton et al. 2004)*

^{t}
where $U={\sum}_{i=1}^{M}({y}_{i}-\stackrel{\u2012}{y})({x}_{i}-\stackrel{\u2012}{x})$, $V=\mathrm{var}\left(y\right)\left[{\sum}_{i=1}^{M}({x}_{i}-\stackrel{\u2012}{x}){({x}_{i}-\stackrel{\u2012}{x})}^{T}\right]$, $\mathrm{var}\left(y\right)=\frac{1}{M}{\sum}_{i=1}^{M}{({y}_{i}-\stackrel{\u2012}{y})}^{2}$, and *M* is the sample size. The statistic *T*^{2} asymptotically follows the χ ^{2} distribution with *k* degrees of freedom.

In this score test, we use PCA to reduce the degrees of freedom from *l* to *k* . According to our experience, we can reduce the degrees of freedom greatly while the first *k* PCs can still explain more than 90% of the total variability. Since the proposed method can reduce the number of degrees of freedom greatly and also keep the majority of the information, the power of the test can be increased.

### Comparison of Methods and Adjustment for Multiple Testing

We compare the power of the proposed method (TPCSW) with the single-marker association test (TSingle) and variable-length Markov chain test (Tbeagle) proposed by Brownings (2007). We propose to use permutation tests to adjust for multiple testing. For the Tbeagle, we use its inbuilt multiple-testing correction via permutation. For the TSingle or TPCSW, the permutation procedure is as follow. Suppose that there are *L* SNPs (windows). Let *p** _{i}* denote the p-value of the test in the

*i*th SNP (window) (

*i*= 1,…,

*L*). For each permutation, we randomly shuffle the case and control status and recalculate the test statistics and p-values based on the permuted data. Let

*p*

*denote the p-value of the test in the*

_{ij}*i*th SNP (window) and the

*j*th permutation and ${p}_{\mathrm{min}}^{j}=\mathrm{min}\left\{{p}_{1j},\dots ,{p}_{Lj}\right\}$. Suppose that we perform

*J*permutations. Then, the adjusted p-value of the test in the

*i*th SNP (window) is given by ${P}_{i}^{o}=\frac{\#\left\{j:{p}_{\mathrm{min}}^{j}<{p}_{i}\right\}}{J}$. In this study, we use 1000 permutations to evaluate the adjusted p-values.

## Simulation Setup

To evaluate the performance of the proposed method, we conduct simulation studies under variety of scenarios. We generate haplotypes using the *ms* program by Hudson (2002). In the *ms* program, we use a mutation rate of 2.5×10^{−8} per nucleotide per generation, a recombination rate of 10^{−8} per pair of nucleotides per generation, and an effective population size of 10,000. These choices were also adopted in Nordborg and Tavare (2002), Kimmel and Shamir (2006), and Feng et al. (2007). Using the *ms* program, we first generate a haplotype pool with 10,000 haplotypes (1,000 SNPs) and a genotype can be generated by randomly choosing two haplotypes from the pool.

When generating data to evaluate the type I error, the genotype of each individual is composed of two haplotypes randomly chosen from the haplotype pool. We randomly assign one individual as a case or a control independent of the genotypes. There are four sample sizes: 600, 800, 1000, and 1200 (half cases and half controls), and the proportion of the total variability explained by the first *k* ( *k* = 3 ) PCs is *c*_{0} = 95%. For each scenario, we generate 1,000 replicated samples to evaluate the type I error rate.

For power comparison, we consider two sets of disease models. In the first set, we consider four three-locus disease models, denoted as model *L*_{1} to model *L*_{4} , which are similar to those used by Millstein et al. (2006) in their simulation studies. We randomly choose three SNPs with minor allele frequencies between 0.1 and 0.33 as the three disease loci (the genotypes of the disease loci are kept in the data set for analysis). A logistic model is used to relate genotypes at the disease loci to the trait. Let *p* = *pr*(*affected/genotype*) and *x*_{1}, *x*_{2} , and *x*_{3} be the numerical codes of the genotypes at the three disease loci. The relationship between *p* and *x*_{1}, *x*_{2}, *x*_{3} is given by the logistic model $\mathrm{log}\frac{p}{1-p}={\beta}_{0}+{\beta}_{1}{x}_{1}+{\beta}_{2}{x}_{2}+{\beta}_{3}{x}_{3}+{\beta}_{123}{x}_{1}{x}_{2}{x}_{3}$.

Assume that the overall population prevalence is 10%. Then the value of *β*_{0} can be determined by the values of the other parameters. The four different models are determined by different values of the parameters. The values of the parameters are given in Table 1. In models *L*_{1} to *L*_{4} , *x** _{k}* = 0, 1, or 2 corresponds to genotypes

*a*

_{k}*a*

*,*

_{k}*A*

_{k }*a*

*, or*

_{k}*A*

_{k }*A*

*at the*

_{k}*k*th disease loci (

*k*= 1, 2,3), an additive coding of the genotypes.

For the second set of disease models, we consider a single-locus disease model. Let *p* be defined same as the above and *x* be the additive code of the genotype at the disease locus. The relationship between *p* and *x* is given by the logistic model $\mathrm{log}\frac{p}{1-p}={\beta}_{0}+{\beta}_{1}x$. Assume that the overall population prevalence is 10%. Then the value of *β*_{0} can be determined by the values of the other parameters. We consider four different disease models (denoted by *L*_{5} to *L*_{8} ) based on the above logistic model with different disease models corresponding to different values of *β*_{1} and intervals of the minor allele frequency (MAF) of the disease locus. The values of *β*_{1} and the intervals of the MAF of the disease locus for the four disease models are given in Table 2. When the interval of the MAF at the disease locus is given, we randomly choose a SNP with MAF in the interval as the disease locus (the genotypes of the disease locus are kept in the data set for analysis).

## Results

Throughout the simulation studies, we set *c*_{0} , the proportion of the total variability explained by the first *k* PCs ( *k* = 3 ), as 95% for TPCSW. The results of type I error rates are given in Table 3. With 1,000 replicated samples, the standard deviations for the type I error rate are $\sqrt{0.05\times 0.95\u22151000}\approx 0.007$ and $\sqrt{0.01\times 0.99\u22151000}\approx 0.0031$ for the nominal levels of 0.05 and 0.01. The 95% confidence intervals are (0.036, 0.064) and (0.004, 0.016) for the nominal levels of 0.05 and 0.01. The results shown in Table 3 illustrate that the estimated type I errors of all the three methods are within the 95% confidence intervals, which indicate that the estimated type I errors are not significantly different from the nominal levels.

The power comparison results of the three methods under the four three-locus disease models are shown in Figure 1. From Figure 1, we can see that our method consistently outperforms the other two methods in terms of the detection power at various sample sizes under the four disease models. The power of the other two methods, Tbeagle and TSingle, are very similar.

The power comparison results of the three methods under the four single-locus disease models are shown in Figure 2. When MAF is low, i.e. [0.045, 0.05], Tbeagle is the most powerful one and the proposed method and TSingle have similar power. When MAF is high, i.e. [0.29, 0.30], all three methods have similar power. When MAF is in the middle, i.e. between 0.05 and 0.3, the proposed method and Tbeagle have similar power and both methods are more powerful than TSingle. Overall, we can conclude that, except for the case of low MAF, our method is always one of the most powerful methods.

## Discussion

In this article, we have proposed a genotype-based method through PCA to find optimal window sizes of variable-sized sliding windows to detect disease associations. We use intensive simulation studies to evaluate the performance of the proposed method. The simulation results show that in most cases our method outperforms the commonly used single-marker association test and Browning’s variable-length Markov chain method. Our method is capable in phase unknown situation while Browning’s variable-length Markov chain method is based on phase known situation, so intensive computation to phase the data is first required. Our method significantly outperforms the other two methods in multi-locus disease models.

There is a common problem for variable-size sliding-window approaches because they are usually based on haplotype data which demands a computationally intensive method to phase the genotype data first. Our method tackled the common disadvantage for variable-size sliding-window approaches by finding the optimal window size using genotype-based method. Therefore our method has the potential to be applied to genome-wide association studies.

To improve our methodology, there is still one thing that needs further consideration, that is, how to choose the values of the parameters *k* and *c*_{0} , the number of PCs we used and the proportion of the total variability explained by the first *k* PCs. In this study we set *c*_{0} = 95% and *k* = 3 . Through intensive simulation studies, we conclude that *c*_{0} = 95% and *k* = 3 are good choices in most cases. However, it is hard to find optimal values for the two parameters; therefore it should be one of our future steps of this method.

In our simulation studies, the disease-related SNPs are not removed from the genotype data before analysis. In this case, it seems that the single-marker analysis should give the best results. However, even if the disease-related SNPs are kept in the genotype data, several studies have shown that multiple-marker methods may be more powerful than the single-marker analysis (Zhao et al. 2000; Zhang et al. 2003). Following example may explain partially why multiple-marker methods can be more powerful. Consider a case-control study with 1000 cases and 1000 controls. Suppose that the frequencies of the disease allele in cases and controls are 0.2 and 0.15, respectively. Then, the p-value of the allelic chi-square test is 3.2×10^{−3} . Consider five markers around the disease locus (include the disease locus) and assume that a mutation occurred at haplotype 11111 many years ago. The frequency of haplotype 11111 in cases should be higher than that in controls. Suppose frequencies of haplotype 11111 in cases and controls are 0.05 and 0.00, respectively. Then, the p-value of the allelic chi-square test to test association of haplotype 11111 (haplotype 11111 as one allele and all other haplotypes as another allele) is 8.0×10^{−13} and the p-value after adjustment for multiple testing (at most 32 haplotypes) is 2.56×10^{−11} . This example shows that disease-marker association may not be detectable as first-order association between a single marker and the disease locus but may be detected by extended marker haplotypes.

In summary, it is shown that the proposed method is simpler, faster and more powerful than the recently developed method—Browning’s variable-length Markov chain method. The computational efficiency and power compared to its peers make our method an attractive choice in detecting disease associated SNP(s) in genome-wide association studies.

## Acknowledgments

This work was supported by NIH grant R01 GM069940 and the Overseas-Returned Scholars Foundation of Department of Education of Heilongjiang Province (1152HZ01).

## Reference

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (461K)

- Detecting susceptibility genes for rheumatoid arthritis based on a novel sliding-window approach.[BMC Proc. 2009]
*Sha Q, Tang R, Zhang S.**BMC Proc. 2009 Dec 15; 3 Suppl 7:S14. Epub 2009 Dec 15.* - Association mapping via regularized regression analysis of single-nucleotide-polymorphism haplotypes in variable-sized sliding windows.[Am J Hum Genet. 2007]
*Li Y, Sung WK, Liu JJ.**Am J Hum Genet. 2007 Apr; 80(4):705-15. Epub 2007 Feb 19.* - On association analysis of rare variants under population substructure: an approach for the detection of subjects that can cause bias in the analysis--T opt: an outlier detection method.[Genet Epidemiol. 2013]
*Qiao D, Mattheisen M, Lange C.**Genet Epidemiol. 2013 Jul; 37(5):431-9. Epub 2013 May 14.* - Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies.[Am J Hum Genet. 2009]
*Browning BL, Yu Z.**Am J Hum Genet. 2009 Dec; 85(6):847-61.* - Missing data imputation and haplotype phase inference for genome-wide association studies.[Hum Genet. 2008]
*Browning SR.**Hum Genet. 2008 Dec; 124(5):439-50. Epub 2008 Oct 11.*

- A Genome-Wide Scan for Breast Cancer Risk Haplotypes among African American Women[PLoS ONE. ]
*Song C, Chen GK, Millikan RC, Ambrosone CB, John EM, Bernstein L, Zheng W, Hu JJ, Ziegler RG, Nyante S, Bandera EV, Ingles SA, Press MF, Deming SL, Rodriguez-Gil JL, Chanock SJ, Wan P, Sheng X, Pooler LC, Van Den Berg DJ, Le Marchand L, Kolonel LN, Henderson BE, Haiman CA, Stram DO.**PLoS ONE. 8(2)e57298* - Localization of Association Signal from Risk and Protective Variants in Sequencing Studies[Frontiers in Genetics. ]
*Brisbin A, Jenkins GD, Ellsworth KA, Wang L, Fridley BL.**Frontiers in Genetics. 3173* - Sample Reproducibility of Genetic Association Using Different Multimarker TDTs in Genome-Wide Association Studies: Characterization and a New Approach[PLoS ONE. ]
*Abad-Grau MM, Medina-Medina N, Montes-Soldado R, Matesanz F, Bafna V.**PLoS ONE. 7(2)e29613* - Gene- or region-based association study via kernel principal component analysis[BMC Genetics. ]
*Gao Q, He Y, Yuan Z, Zhao J, Zhang B, Xue F.**BMC Genetics. 1275* - Genome-wide association filtering using a highly locus-specific transmission/disequilibrium test[Human Genetics. 2010]
*Abad-Grau MM, Medina-Medina N, Montes-Soldado R, Moreno-Ortega J, Matesanz F.**Human Genetics. 2010 Sep; 128(3)325-344*

- PubMedPubMedPubMed citations for these articles