- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# Genome Association Studies of Complex Diseases by Case-Control Designs

^{1}Department of Statistics, Texas A&M University, College Station, TX; and

^{2}Institute of Medical Biometry, Informatics and Epidemiology, University of Bonn, Bonn

## Abstract

One way to perform linkage-disequilibrium (LD) mapping of genetic traits is to use single markers. Since dense marker maps—such as single-nucleotide polymorphism and high-resolution microsatellite maps—are available, it is natural and practical to generalize single-marker LD mapping to high-resolution haplotype or multiple-marker LD mapping. This article investigates high-resolution LD-mapping methods, for complex diseases, based on haplotype maps or microsatellite marker maps. The objective is to explore test statistics that combine information from haplotype blocks or multiple markers. Based on two coding methods, genotype coding and haplotype coding, Hotelling’s *T*^{2} statistics *T*_{G} and *T*_{H} are proposed to test the association between a disease locus and two haplotype blocks or two markers. The validity of the two *T*^{2} statistics is proved by theoretical calculations. A statistic *T*_{C}, an extension of the traditional χ^{2} method of comparing haplotype frequencies, is introduced by simply adding the χ^{2} test statistics of the two haplotype blocks together. The merit of the three methods is explored by calculation and comparison of power and of type I errors. In the presence of LD between the two blocks, the type I error of *T*_{C} is higher than that of *T*_{H} and *T*_{G}, since *T*_{C} ignores the correlation between the two blocks. For each of the three statistics, the power of using two haplotype blocks is higher than that of using only one haplotype block. By power comparison, we notice that *T*_{C} has higher power than that of *T*_{H}, and *T*_{H} has higher power than that of *T*_{G}. In the absence of LD between the two blocks, the power of *T*_{C} is similar to that of *T*_{H} and higher than that of *T*_{G}. Hence, we advocate use of *T*_{H} in the data analysis. In the presence of LD between the two blocks, *T*_{H} takes into account the correlation between the two haplotype blocks and has a lower type I error and higher power than *T*_{G}. Besides, the feasibility of the methods is shown by sample-size calculation.

## Introduction

With the development of the Human Genome Project and of high-resolution microsatellite and early chromosomewide haplotype maps of the human genome, enormous amounts of genetic data on human chromosomes are becoming available. The opportunities for genomewide scans to map complex-disease genes are tremendous. However, it is not yet clear how to extract the most useful information for mapping complex-disease genes. To fully utilize the massive amount of genetic data for mapping complex-disease genes, novel mathematical and statistical methods are crucial. One urgent need is to explore statistical approaches of high-resolution haplotype or multiple-marker linkage disequilibrium (LD) mapping of complex diseases. One way to perform LD mapping of genetic traits is to use single markers. Since dense marker maps—such as single-nucleotide polymorphism (SNP) and high-resolution microsatellite maps—are available (Broman et al. ^{1998}; The International SNP Map Work Group ^{2001}; Kong et al. ^{2002}), it is natural and practical to generalize single-marker LD mapping to high-resolution haplotype or multiple-marker LD mapping. With the recent discovery of haplotype block structures in the human genome and with the development of early chromosomewide haplotype maps, it is important to develop better statistical methods for analysis of data on SNPs, haplotype patterns, and related patterns of LD. The chromosomewide haplotype maps are expected to be key resources for mapping complex-disease genes. For example, a systematic case-control analysis of common haplotype variants in the human genome would reveal major causative genetic contributions to a disease.

For a case-control study, one can use a χ^{2} statistic to test the null hypothesis that the marker allele or haplotype frequencies are equal in the cases and controls on the basis of a multiple-allele marker (Olson and Wijsman ^{1984}; Chapman and Wijsman ^{1998}; Nielsen et al. ^{1998}; Kaplan and Morris ^{2001}). The method, however, can not be directly used for multiple markers or haplotype blocks, since the phase of a double heterozygote may be unknown (Ott ^{1999}, p. 7). For multiple biallelic markers, such as SNPs, Xiong et al. (^{2002}) proposed a Hotelling’s *T*^{2} statistic for LD mapping of qualitative traits for case-control studies, which can not be used for haplotype block data with multiple haplotypes. Hence, it is necessary to develop methods of genomic LD mapping of qualitative trait loci based on haplotype block data or multiple-marker data for case-control studies.

This paper investigates methods of high-resolution LD mapping for complex diseases based on haplotype maps or microsatellite marker maps. The objective is to explore test statistics that combine information from haplotype blocks or multiple markers. For uniformity of notation, we use “haplotype blocks” or simply “blocks” in our analysis, which can be changed to “multiallelic markers.” We are interested in developing statistical methods for efficient use of genomic patterns of LD to identify genetic variants that contribute to qualitative complex diseases on the basis of multiallelic markers or haplotype block data in a case-control study. Based on two coding methods, genotype coding and haplotype coding, we propose Hotelling’s *T*^{2} statistics to test the association between a disease locus and two haplotype blocks. The statistical property of the above *T*^{2} statistics will be investigated. An extension of traditional χ^{2} method of comparing haplotype frequencies is proposed by simply adding two χ^{2} test statistics of the two haplotype blocks together. The merit of the three methods will be explored by calculation and comparison of power and type I errors. Also, the feasibility of the methods is shown by calculation of sample sizes.

## Methods

### Test Statistics

Suppose that a disease locus *D* is flanked by two haplotype blocks *H*_{1} and *H*_{2}, where *H*_{1} is a haplotype block on the left-hand side of *D* and *H*_{2} is a haplotype block on the right-hand side. Let us denote the haplotypes of block *H*_{1} by *H*_{11},…,*H*_{1l} and the haplotypes of block *H*_{2} by *H*_{21},…,*H*_{2r}, where *l* and *r* denote the number of observed haplotypes of blocks *H*_{1} and *H*_{2}, respectively. Consider a case-control design with *N* cases from an affected population and *M* controls from a unaffected population. Let us define a coding vector for each case or control by one of the following two ways (Schaid ^{1996}, p. 430).

#### Genotype coding

For the *i*th case, let *Hap*_{1i} be his/her two haplotypes at block *H*_{1}, and let *Hap*_{2i} be his/her two haplotypes at block *H*_{2}. Depending on the haplotypes *Hap*_{1i} (or *Hap*_{2i}), let us define an indicator vector *X*_{1i} (or *X*_{2i}) that contains exactly one component with value 1 and other components with value 0. That is, *X*_{1i}=[*x*_{1i1},…,*x*_{1i(l-1)},*x*_{1i12},…,*x*_{1i1l},…,*x*_{1i(l-1)l}]^{τ}, *X*_{2i}=[*x*_{2i1},…,*x*_{2i(r-1)},*x*_{2i12},…,*x*_{2i1r},…,*x*_{2i(r-1)r}]^{τ}, and

where the indicator variables *x*_{1ij}, *x*_{1ijk},*j*<*k*, *x*_{2is}, and *x*_{2ist},*s*<*t* are defined by

The dimension of *X*_{1i} (or *X*_{2i}) is *l*(*l*+1)/2-1 (or *r*(*r*+1)/2-1)—that is, the total number *l*(*l*+1)/2 (or *r*(*r*+1)/2) of genotypes of haplotype block *H*_{1} (or *H*_{2}) minus 1 to remove the redundancy.

#### Haplotype coding

Define *X*_{i}=[*z*_{1i1},…,*z*_{1i(l-1)},*z*_{2i1},…,*z*_{2i(r-1)}]^{τ}, where *z*_{uij} is the number of haplotypes *H*_{uj} for the *i*th case, i.e.,

For the *i*th control, one may define a vector *Y*_{i} in the same way. To illustrate the above coding methods, tabletable11 gives an example of genotype and haplotype codings for a block *H*_{1} with three haplotypes. The coding method for block *H*_{2} is similar.

Let and be the mean vectors. Define a pooled-sample variance-covariance matrix by

A Hotelling’s *T*^{2} statistic can be defined as

(Hotelling ^{1931}; Anderson ^{1984}). Hereafter, we will denote the Hotelling’s *T*^{2} for haplotype coding as *T*_{H} and the Hotelling’s *T*^{2} for genotype coding as *T*_{G}. Assume that the sample sizes *N* and *M* are sufficiently large that the large-sample theory applies. Under the null hypothesis of no association, the statistic *T*_{H} (or *T*_{G}) is asymptotically distributed as central χ^{2} with *l*+*r*-2 (or [*l*(*l*+1)/2-1]+[*r*(*r*+1)/2-1]) df. Under the alternative hypothesis of association, *T*_{H} (or *T*_{G}) is asymptotically distributed as noncentral χ^{2}. If only one haplotype block *H*_{1} is used in the analysis, the Hotelling’s *T*^{2} for haplotype coding will be denoted as *T*_{H1}, and the Hotelling’s *T*^{2} for genotype coding will be denoted as *T*_{G1}. Under the null hypothesis of no association, the statistic *T*_{H1} (or *T*_{G1}) is asymptotically distributed as central χ^{2} with *l*-1 (or *l*(*l*+1)/2-1) df. Under the alternative hypothesis of association, *T*_{H1} (or *T*_{G1}) is asymptotically distributed as noncentral χ^{2}. Similarly, one may introduce test statistics *T*_{H2} (or *T*_{G2}) if only one haplotype block *H*_{2} is used in the analysis.

If each of haplotype block *H*_{u} has only two haplotypes *H*_{u1},*H*_{u2},*u*=1,2, then the Hotelling’s *T*^{2} by haplotype coding described above coincides with the test statistic introduced by Xiong et al. (^{2002}). To see this, notice that *z*_{ui1}=1+(*z*_{ui1}-1), where *z*_{ui1}-1 is equal to the indicator variable *X*_{ij} (defined in Xiong et al. [2002], p. 1257). Hence, our method generalizes the method of using two biallelic markers in Xiong et al. (^{2002}) to two haplotype blocks with multiple haplotypes.

In the definition above, we consider only two haplotype blocks *H*_{1} and *H*_{2}. In practice, the test statistics *T*_{H} and *T*_{G} can be easily generalized to multiple haplotype blocks. To make the notation as simple as possible, we will focus on two haplotype blocks throughout the present article. In appendices A, B, and C, we will justify the use of the Hotelling’s *T*^{2} as an appropriate statistic to test association between the disease locus and the haplotype blocks by either the genotype coding method or the haplotype coding method. The basic idea is to show that the expectation of difference is equal to 0 if there is no association between the disease locus and the haplotype blocks. Then one may construct a test statistic based on the difference vector , which leads to the Hotelling’s *T*^{2}.

### Noncentrality Parameters

Let Σ_{A1}=*Cov*{[*z*_{1i1},…,*z*_{1i(l-1)}]|*Aff*} and Σ_{A2}=*Cov*{[*z*_{2i1},…, *z*_{2i(r-1)}]|*Aff*} be variance-covariance matrices of vectors [*z*_{1i1},…,*z*_{1i(l-1)}]^{τ} and [*z*_{2i1},…,*z*_{2i(r-1)}]^{τ}, respectively, in affected individuals. Similarly, let *z*_{1i(l-1)}]|*Unaff*} and *z*_{2i(r-1)}]|*Unaff*} be variance-covariance matrices of column vectors [*z*_{1i1},…,*z*_{1i(l-1)}]^{τ} and [*z*_{2i1},…,*z*_{2i(r-1)}]^{τ} in controls. Let (or ) be the column vector of measures of LD between haplotype block *H*_{1} (or *H*_{2}) and the disease locus *D.* Let α_{D} be the average effect of gene substitution and , and let *A* be the disease prevalence in population and (appendix A). Based on *E*(*z*_{uij}|*Aff*) and *E*(*z*_{uij}|*Unaff*), given in equation (A5) of appendix A, the noncentrality parameter λ_{Hu} of Hotelling’s test statistic *T*_{Hu} is given by

The elements of variance-covariance matrices Σ_{Au} and are calculated in appendix D. If the haplotype *H*_{u} has only two haplotypes—*H*_{u1},*H*_{u2} and *N*=*M*—then λ_{Hu} = 4*N*Δ^{2}_{u1}(α_{D}/*A* − + *Var*(*z*_{ui1}|*Unaff*)]^{-1}, where *Var*(*z*_{ui1}|*Aff*) and *Var*(*z*_{ui1}|*Unaff*) are given in equations (D1) and (D2) in appendix D.

Let Σ_{A}=*Cov*{[*z*_{1i1},…,*z*_{1i(l-1)},*z*_{2i1},…,*z*_{2i(r-1)}]|*Aff*} be a variance-covariance matrix of column vector [*z*_{1i1},…,*z*_{1i(l-1)},*z*_{2i1},…,*z*_{2i(r-1)}]^{τ} in affected individuals. Similarly, let *Unaff*} be a variance-covariance matrix of column vector [*z*_{1i1},…,*z*_{1i(l-1)},*z*_{2i1},…,*z*_{2i(r-1)}]^{τ} in controls. Let us denote

Then the noncentrality parameter λ_{H} of Hotelling’s test statistic *T*_{H} is given by

The elements of variance-covariance matrices Σ_{A} and are calculated in appendices D and E. The noncentrality parameter λ_{G} (or λ_{G1} or λ_{G2}) of *T*_{G} (or *T*_{G1} or *T*_{G2}) is given in appendix F.

For a case-control study using only one haplotype block *H*_{1}, one may use a χ^{2} statistic to test the null hypothesis that the haplotype frequencies are equal in the cases and controls (Olson and Wijsman ^{1994}; Chapman and Wijsman ^{1998}; Kaplan and Morris ^{2001}). Assume that *N* cases and *N* controls are sampled. Then the test statistic is given by , where is the frequency of haplotype *H*_{1j} in the cases, and is the frequency of haplotype *H*_{1j} in the controls. Using haplotype block *H*_{2}, one may construct a similar test statistic , where is the frequency of haplotype *H*_{2j} in the cases, and is the frequency of haplotype *H*_{2j} in the controls. Using both haplotype blocks *H*_{1} and *H*_{2}, one may construct a test statistic *T*_{C}=*T*_{C1}+*T*_{C2} by summing *T*_{C1} and *T*_{C2} together. If the two statistics *T*_{C1} and *T*_{C2} are independent, *T*_{C} is asymptotically distributed as central χ^{2}_{l+r-2}, with *l*+*r*-2 df under the null hypothesis of no association. Under the alternative hypothesis, it is asymptotically distributed as noncentral χ^{2}_{l+r-2}(λ_{C}), where λ_{C}=λ_{C1}+λ_{C2},

and +2*P*(*H*_{2j})]. To calculate λ_{C1}, one needs to notice that *P*(*H*_{1j}|*Aff*)=*E*(*z*_{1ij}|*Aff*)/2, and so the conditional expected frequencies *P*(*H*_{1j}|*Aff*)=Δ_{1j}α_{D}/*A*+*P*(*H*_{1j}) and (appendices A and C). However, the independence of *T*_{C1} and *T*_{C2} can be true only in the case that there is linkage equilibrium between the two blocks. Hence, *T*_{C} may not be a valid test statistic unless one has strong evidence that the two blocks are in linkage equilibrium.

## Results

### Type I Errors

To explore the performance of the test statistics, we calculate type I errors for statistics *T*_{C}, *T*_{H}, and *T*_{G} for the four scenarios in table table2.2. We simulate 10,000 samples under an assumption of penetrance probabilities (*f*_{DD},*f*_{Dd},*f*_{dd})=(0.05,0.05,0.05), which implies that the disease is not associated with the two haplotype blocks. Every sample contains 100 cases and 100 controls (*N*=*M*=100). For each sample, we calculate the empirical test statistics *T*_{C}, *T*_{H}, and *T*_{G}. The type I error is calculated by dividing the count of those empirical test statistics, which are greater than or equal to the cut-off point at the significance level α=0.01, by 10,000. We repeat the above process a total of 100 times to get 101 type I errors for each of the test statistics *T*_{C}, *T*_{H}, and *T*_{G} for the four models in table table2.2. On the basis of the 101 type I errors of each statistic, we calculate their mean, standard deviation (SD), minimum, and maximum, which are presented in table table2.2. For model I in table table2,2, a strong LD between the two blocks *H*_{1}, *l*=2, and *H*_{2}, *r*=2, is assumed (Δ_{H11H21}=0.20); in this case, the type I error of *T*_{C} (mean 0.026) is much greater than those of *T*_{H} (mean 0.011) and *T*_{G} (0.012). In model II in table table2,2, we assume that block *H*_{1} has two haplotypes and the block *H*_{2} has three haplotypes, and the measures of LD are Δ_{H11H21}=0.15 and Δ_{H11H22}=-0.075; in this case, the type I error of *T*_{C} (mean 0.017) is the highest, and *T*_{G} (mean 0.015) has higher type I error than *T*_{H} (mean 0.012). In models III and IV of table table2,2, the block *H*_{2} has four haplotypes; in model III, the measures of LD are Δ_{H11H21}=Δ_{H11H22}=0.075 and Δ_{H11H23}=Δ_{H11H24}=-0.075, and in model IV, the two blocks are in linkage equilibrium; in these two cases, the type I errors of *T*_{G} (mean 0.20) are the highest, which may be due to the large degree of freedom of *T*_{G}. With LD (model III), *T*_{C} (mean 0.16) has a slightly higher type I error than *T*_{H} (mean 0.13); without LD (model VI), *T*_{H} (mean 0.013) has a slightly higher type I error than *T*_{C} (mean 0.010). Figure 1 shows the QQ plot for each statistic of *T*_{C}, *T*_{H}, and *T*_{G} for the four models in table table2.2. Each of the QQ plots in figure 1 is drawn by comparing 10,000 sample statistic values with 10,000 related χ^{2}-distribution values (*X*-axis). These QQ plots are consistent with the results of table table2.2. Moreover, it is evident that the type I error level of statistic *T*_{H} is reasonable for *N*=*M*=100.

### Power Calculation and Comparison

To calculate the noncentrality parameters, we assume a deterministic population genetic model. Assume that a single disease mutation was introduced into the population *T* generations ago, with a frequency *P*_{D}. First, we consider only one haplotype block *H*_{u}, *u*=1,2. At the initial generation of the occurrence of the mutation, the haplotype frequencies *P*(*H*_{u1}*D*)(0)=*P*_{D} and *P*(*H*_{uj}*D*)(0)=0,*j*=2,…, *l,* if *u*=1, or *j*=2,…, *r,*if *u*=2. Moreover, *P*(*H*_{u1}*d*)(0)=*P*(*H*_{u1})-*P*_{D} and *P*(*H*_{uj}*d*)(0)=*P*(*H*_{uj}),*j*=2,…,*l* if *u*=1, or *j*=2,…,*r* if *u*=2. Let θ_{u} be the recombination fraction between haplotype block *H*_{u} and disease locus *D*,*u*=1,2. Given a map distance λ_{u} between haplotype block *H*_{u} and disease locus *D*, the recombination fraction θ_{u} can be calculated by Haldane’s map function θ_{u}=[1-*exp*(-2λ_{u})]/2, under the assumption of no interference. At generation *T,* the haplotype frequencies can be approximately calculated by *P*(*H*_{uj}*D*)(*T*)=*P*(*H*_{uj}*D*)(0)*e*^{-Tθu}+*P*_{D}*P*(*H*_{uj})(1-*e*^{-Tθu}) and *P*(*H*_{uj}*d*)(*T*)=*P*(*H*_{uj}*d*)(0)*e*^{-Tθu}+*P*_{d}*P*(*H*_{uj})(1-*e*^{-Tθu}),*j*=1,…,*l*, if *u*=1, or *j*=2,…,*r*, if *u*=2. Second, we consider both haplotype blocks *H*_{1} and *H*_{2}. At the initial generation of the occurrence of mutation, the haplotype frequencies *P*(*H*_{11}*DH*_{21})(0)=*P*_{D} and *P*(*H*_{1j}*DH*_{2s})(0)=0,*j*=1,…,*l*,*s*=1,…,*r*, and (*j*,*s*)≠(1,1). That is, the disease-susceptibility allele *D* was carried by haplotype *H*_{11}*H*_{21} at the initial generation of mutation. The other initial haplotype frequencies are *P*(*H*_{11}*dH*_{21})(0)=*P*(*H*_{11}*H*_{21})-*P*_{D} and *P*(*H*_{1j}*dH*_{2s})(0)=*P*(*H*_{1j}*H*_{2s}),*j*=1,…,*l*,*s*=1,…,*r* and (*j*,*s*)≠(1,1).

At generation *T,* the haplotype frequencies can be approximately calculated by *P*(*H*_{1j}*DH*_{2s})(*T*)=Δ_{jDs}(0)*e*^{-T(θ1+θ2)} + *P*(*H*_{1j})Δ_{2s}(0)*e*^{-Tθ2} + *P*(*H*_{2s})Δ_{1j}(0)*e*^{-Tθ1}+ *P*(*H*_{1j})*P*_{D}*P*(*H*_{2s}) and *P*(*H*_{1j}*dH*_{2s})(*T*)=*P*(*H*_{1j}*H*_{2s})-*P*(*H*_{1j}*DH*_{2s})(*T*),*j*=1,…, *l,* *s*=1,…, *r,* where Δ_{jDs}(0) = *P*(*H*_{1j}*DH*_{2s})(0) − *P*(*H*_{1j})Δ_{2s}(0) − *P*(*H*_{2s})Δ_{1j}(0) − *P*(*H*_{1j})*P*_{D}*P*(*H*_{2s}) is the measure of initial LD at the three loci for haplotypes *H*_{1j} and *H*_{2s}, Δ_{1j}(0)=*P*(*H*_{1j}*D*)(0)-*P*(*H*_{1j})*P*_{D} is the measure of initial LD between haplotype *H*_{1j} and disease locus *D*, and Δ_{2s}(0)=*P*(*DH*_{2s})(0)-*P*_{D}*P*(*H*_{2s}) is the measure of initial LD between haplotype *H*_{2s} and disease locus *D* (Akey et al. ^{2001}).

To make a power comparison, we consider four genetic models: heterogeneous recessive, heterogeneous dominant, additive, and multiplicative. First, we consider optimistic penetrance probabilities and genotype relative risks given in table table33 (Nielson et al. ^{1998}). For less optimistic models, with lower penetrance probabilities and genotype relative risks, we consider the four models in table table4.4. For each model in table table4,4, the population disease prevalence is ~0.05 and the sib recurrence risk is ~0.06 (Iles ^{2002}). We assume that the distance between the two haplotype blocks is 4 cM. The block *H*_{1} is located at position 0 cM, and the block *H*_{2} is located at position 4 cM. Since the disease locus *D* is usually unknown, we assume that it is located in the interval between *H*_{1} and *H*_{2}. Given the location of disease locus *D,* the map distance λ_{u} between *H*_{u} and *D* can be used to calculate the recombination fraction θ_{u} by Haldane’s map function, *u*=1,2, λ_{1}+λ_{2}=4 cM. To calculate the power, we first partition the interval of 4 cM between block *H*_{1} and *H*_{2} to be 100 subintervals with 101 end-points. Given that the disease locus *D* is located at an end-point, we may perform power calculation at this locus. We assume that the haplotype *H*_{1} has two haplotypes *H*_{11} and *H*_{12} with equal frequencies, *P*_{D}=0.10, *N*=*M*=100, and *T*=50 for the four models in table table3.3. For the four models in table table4,4, *P*_{D}=0.30, *N*=*M*=500. For each genetic model in table table4,4, figures figures2,2, ,3,3, and and44 show power curves of *T*_{C},*T*_{H},*T*_{G},*T*_{C2},*T*_{H2}, and *T*_{G2} for *r*=2,3,4 haplotypes of block *H*_{2}, respectively. The related parameters, such as measures of LD between block *H*_{1} and block *H*_{2}, are given in the legend of each figure. First, it is clear from these three figures that the power of using two haplotype blocks is generally higher than that of using one block. When the disease locus *D* is far from block *H*_{2}, the power of using two haplotype blocks is significantly higher. When the disease locus *D* is close to block *H*_{2}, the power of using two haplotype blocks is similar to that of using only one block *H*_{2}. Second, the power of *T*_{C} is generally higher than or similar to that of *T*_{H}, and the power of *T*_{H} is higher than or similar to that of *T*_{G}. This may be due to the lack of consideration of correlation between the two blocks by *T*_{C} (see the type I error comparison in table table2).2). Third, the power of *T*_{C2} is similar to that of *T*_{H2} and higher than that of *T*_{G2}.

*T*

_{C},

*T*

_{H},

*T*

_{G},

*T*

_{C2},

*T*

_{H2}, and

*T*

_{G2}at significance level α=0.01, using two haplotype blocks

*H*

_{1},

*l*=2, and

*H*

_{2},

*r*=2, when

*P*(

*H*

_{11})=

*P*(

*H*

_{12})=

*P*(

*H*

_{21})=

*P*(

*H*

_{22})=0.50, Δ

_{H11H21}

**...**

*T*

_{C},

*T*

_{H},

*T*

_{G},

*T*

_{C2},

*T*

_{H2}, and

*T*

_{G2}at significance level α=0.01, using two haplotype blocks

*H*

_{1},

*l*=2, and

*H*

_{2},

*r*=3, when

*P*(

*H*

_{11})=

*P*(

*H*

_{12})=0.5,

*P*(

*H*

_{21})=0.4,

*P*(

*H*

_{22})=

*P*(

**...**

*T*

_{C},

*T*

_{H},

*T*

_{G},

*T*

_{C2},

*T*

_{H2}, and

*T*

_{G2}at significance level α=0.01 using two haplotype blocks

*H*

_{1},

*l*=2, and

*H*

_{2},

*r*=4, when

*P*(

*H*

_{11})=

*P*(

*H*

_{12})=0.5,

*P*(

*H*

_{21})=

*P*(

*H*

_{22})=

*P*(

*H*

_{23})=

**...**

To explore the effect of the degree of LD on the test statistics, figure figure55 plots power curves under an assumption of linkage equilibrium between the two blocks *H*_{1} and *H*_{2} for four models in table table4.4. From the four graphs of figure figure5,5, the power of *T*_{H} is similar to or slightly higher than that of *T*_{C}, except for heterogeneous recessive and multiplicative models, in which the power of *T*_{H} is slightly lower than that of *T*_{C}. In all graphs of figure figure5,5, the power of *T*_{C} and *T*_{H} is higher than that of *T*_{G}. Figure Figure66 plots power curves for different mutation ages of the disease allele *D* for four models in table table4.4. For the four models in table table4,4, the power is very high for a disease mutation of *T*=30, high for *T*=40, and relatively high for *T*=50 generations old. Figure Figure77 plots power curves of *T*_{H} for different disease frequencies *P*_{D} for the four models in table table4.4. For recessive disease model in table table4,4, a disease with frequency *P*_{D}0.30 would have high power if the haplotype block is close to the disease locus. For the other three models in table table4,4, a disease with frequency *P*_{D}0.20 would have high power if the haplotype block is close to the disease locus (fig. (fig.77).

*T*

_{C},

*T*

_{H}, and

*T*

_{G}at significance level α=0.01 using two haplotype blocks

*H*

_{1},

*l*=2, and

*H*

_{2},

*r*=4, when

*P*(

*H*

_{11})=

*P*(

*H*

_{12})=0.5,

*P*(

*H*

_{21})=

*P*(

*H*

_{22})=

*P*(

*H*

_{23})=

*P*(

**...**

*T*

_{H}for different mutation ages at significance level α=0.01, using two haplotype blocks

*H*

_{1},

*l*=2, and

*H*

_{2},

*r*=4, when

*P*(

*H*

_{11})=

*P*(

*H*

_{12})=0.5,

*P*(

*H*

_{21})=

*P*(

*H*

_{22})=

*P*(

*H*

_{23})=

**...**

*T*

_{H}for different disease frequency at significance level α=0.01, using two haplotype blocks

*H*

_{1},

*l*=2, and

*H*

_{2},

*r*=4, when

*P*(

*H*

_{11})=

*P*(

*H*

_{12})=0.5,

*P*(

*H*

_{21})=

*P*(

*H*

_{22})=

*P*(

**...**

Corresponding to the six figures for the less optimistic models in table table4,4, we provide six figures for the optimistic models in table table33 on ^{our Web site}. The power of the heterogeneous recessive model in table table33 is low (figs. 1, 2, and 3 on ^{our Web site}). In contrast, the power of the heterogeneous recessive model in table table44 is reasonably high (figs. (figs.2,2, ,3,3, and and4).4). In the absence of LD, the power of *T*_{H} is similar to or slightly higher than that of *T*_{C} for the four models in table table33 (fig. 4 on ^{our Web site}). For recessive disease model in table table3,3, the power is low even for very young disease mutation (*T*=10) (fig. 5 on ^{our Web site}). For the recessive disease model in table table3,3, a disease with frequency *P*_{D}0.15 would have high power if the haplotype block is close to the disease locus. For the other three models in table table3,3, a disease with frequency *P*_{D}0.10 would have high power (fig. 6 on ^{our Web site}).

### Sample Size

Table Table55 gives sample size required for the four genetic models in table table33 at significance level .01 and 80% power using two haplotype blocks *H*_{1}, *l*=2, and *H*_{2}, *r*=4. Except for heterozygous recessive disease with low disease-allele frequency *P*_{D}=0.05, the sample sizes required are <400 and are feasible in practice. For most cases, the sample sizes required are <100. Table Table66 gives the sample sizes required for the four genetic models in table table44 at significance level 0.01 and 80% power, using two haplotype blocks *H*_{1}, *l*=2, and *H*_{2}, *r*=4. Compared with the sample sizes in table table55 for the four models in table table3,3, the sample sizes in table table66 for the four models in table table44 are much greater. For the recessive disease model in table table4,4, the sample sizes required for low frequency (*P*_{D}0.10) are >5,000, and so it may not be realistic to recruit enough patients for such disease studies. For all dominant disease models and recessive disease models with high disease frequency (*P*_{D}=0.20 or 0.30), the sample sizes required are <1,000 and are feasible in practice. For the additive and multiplicative disease models in table table4,4, the sample sizes required are <1,000, except for low–disease-frequency cases (*P*_{D}=0.05) or old disease mutations (*T*50).

For the sample sizes given in tables tables55 and and6,6, we perform an empirical power calculation by 10,000 replicates. The results for *T*_{H} are pretty consistent with the theoretical value of 0.80.

## Discussion

The objective of this paper is to explore methods for high-resolution haplotype or multiple-marker genome-association studies of complex diseases by case-control designs. We investigated test statistics that combine information from haplotype blocks or multiple markers. We introduced two Hotelling’s *T*^{2} statistics *T*_{G} and *T*_{H} to test association between a disease locus and two haplotype blocks on the basis of two coding methods, genotype coding and haplotype coding. By theoretical analysis, we showed that they are valid test statistics. Ignoring the correlation between the two blocks, one may use an extension sum statistic, *T*_{C}, of two traditional χ^{2} test statistics, *T*_{C1} and *T*_{C2}, for comparing haplotype frequencies in cases and controls. For each of the three statistics, the power of using two haplotype blocks is higher than that of using only one haplotype block. By power comparison, we notice that *T*_{C} has higher power than *T*_{H}, and *T*_{H} has higher power than *T*_{G}.

In the absence of LD between the two blocks, the power of *T*_{C} is similar to that of *T*_{H} and is higher than that of *T*_{G}. In the presence of LD between the two blocks, the type I error of *T*_{C} is higher than those of *T*_{H} and *T*_{G}. Hence, we advocate to use *T*_{H} in the data analysis. In the presence of LD between the two blocks, *T*_{H} takes into account of the correlation between the two haplotype blocks and has the lowest type I error and a higher power than *T*_{G}. On the one hand, *T*_{G} has the lowest power, although it takes into account the correlation between the two haplotype blocks. On the other hand, the type I error of *T*_{G} gets bigger as the number of haplotypes increases, which may be due to the large degree of freedom. Therefore, *T*_{G} is less favorable than *T*_{H}.

Several empirical studies showed that the haplotypes have block structures in human genome, and each haplotype block has limited diversity (Daly et al. ^{2001}; Goldstein ^{2001}; Patil et al. ^{2001}; Reich et al. ^{2001}; Rioux et al. ^{2001}; Stephens et al. ^{2001}; Gabriel et al. ^{2002}). The haplotype blocks are punctuated by apparent sites of recombination or hot-spot areas. Within a haplotype block, there are only a few (2–4) haplotypes, and LD decays only gradually with distance. Within the hot-spot areas, however, there may have been several recombination events, and thus LD decays rapidly with distance. The recombination events are clustered to be hot spots. These patterns of LD are very relevant to genomewide association studies for mapping complex-disease genes. However, the general properties of haplotype structure in human genome are not fully understood. It is necessary to characterize patterns of LD in the human genome, and to investigate approaches of high resolution LD mapping of complex traits based on haplotype block data.

The test statistics, such as *T*_{H} and *T*_{G}, that are based on multiallelic markers or haplotype blocks can usually lead to a large number of df. However, when haplotype block data are used, the df would not be very large if one took into account the recent discovery of haplotype structure in human genome. Although a haplotype block may enclose many SNPs, it takes only a few SNPs to uniquely identify each of the haplotypes in the block. This implies that the number of df when haplotype block data are used may be even less than that when multiple SNP markers are used in an analysis. Moreover, haplotype block data already take into account the haplotype structure and potentially are more powerful.

In our analysis, only two haplotype blocks are discussed. One could generalize the method to use multiple haplotype blocks in the analysis. One interesting topic is to study the merit of a generalized *T*_{H} that uses multiple haplotype blocks, instead of the current version of *T*_{H}, which uses only two haplotype blocks. Moreover, the methods can be generalized to analyze pedigree data, including sib pairs (Cordell and Clayton ^{2002}). Other issues, such as population-stratification effects and methods of combining population and pedigree data, are exciting research topics (Ardlie et al. ^{2000}; Rannala and Reeve ^{2001}). If the data contain individuals with missing genotypes within the haplotype blocks or with genotyping errors, some potential problems can arise in actual data analysis. The effect of uncertainty in the haplotype block’s start and stop positions is unclear. More investigations will be necessary to cope with these challenges.

## Acknowledgments

We thank two reviewers for very detailed and thoughtful critiques, which made the paper more clear. R.F. was supported partially by a research fellowship from the Alexander von Humboldt Foundation, Germany, and an International Research Travel Assistance Grant from Texas A&M University. M.K. was supported by grant KN 370/1-1 (Project D1 of FOR 423) from the Deutsche Forschungsgemeinschaft.

## Appendix A

Suppose that the disease locus has two alleles *D* and *d, D* being the allele for disease susceptibility and *d* being normal. Assume that the disease-susceptibility allele *D* has population frequency *P*_{D}, and normal allele *d* has population frequency *P*_{d}. Let *f*_{DD}, *f*_{Dd}=*f*_{dD}, and *f*_{dd} be the probabilities that an individual with genotypes *DD*, *Dd*, and *dd* is affected with the disease, respectively. Since allele *D* is disease susceptible, one may assume *f*_{DD}*f*_{Dd}*f*_{dd}. Let and . Denote the disease prevalence in the population by *A*=*f*_{DD}*P*^{2}_{D}+2*f*_{Dd}*P*_{D}*P*_{d}+*f*_{dd}*P*^{2}_{d}, and . As in quantitative genetics, let us introduce some notation. Let *a*=*f*_{DD}-(*f*_{DD}+*f*_{dd})/2,*d*=*f*_{Dd}-(*f*_{DD}+*f*_{dd})/2, δ_{D}=2*d*, and α_{D}=*a*+(*P*_{d}-*P*_{D})*d*. In terms of quantitative genetics, α_{D} is the average effect of gene substitution, and δ_{D} is the dominant deviation (Falconer and Mackay ^{1996}). Similarly, denote , and . Denote the measures of LD between haplotype *H*_{1j} of the first haplotype block *H*_{1} and the disease locus *D* by Δ_{1j}=*P*(*H*_{1j}*D*)-*P*(*H*_{1j})*P*_{D},*j*=1,…,*l*, and the measures of LD between haplotype *H*_{2s} of the second haplotype block *H*_{2} and the disease locus *D* by Δ_{2s}=*P*(*DH*_{2s})-*P*(*H*_{2s})*P*_{D},*s*=1,…,*r*. For *u*=1,2, the frequencies of heterozygous genotype *H*_{uj}*H*_{uk},*j*≠*k*, in affected and unaffected individuals are calculated in appendix B as

The frequencies of homozygous genotype *H*_{uj}*H*_{uj} in affected and unaffected individuals are calculated in appendix B as

Under the null hypothesis of no association between the haplotype blocks *H*_{u},*u*=1,2 and the disease locus *D*—that is, Δ_{uj}=0 for all *j*, equations (A1), (A2), (A3) and (A4), imply the expectation for genotype coding method. In appendix C, we show

Hence, we have which implies the expectation for the haplotype coding method, under the null hypothesis of no association between the haplotype blocks *H*_{u},*u*=1,2 and the disease locus *D.*

## Appendix B

Notice that *P*(*H*_{uj}*D*)=Δ_{uj}+*P*(*H*_{uj})*P*_{D},*P*(*H*_{uj}*d*)=-Δ_{uj}+*P*(*H*_{uj})*P*_{d}, *P*(*H*_{uk}*D*)=Δ_{uk}+*P*(*H*_{uk})*P*_{D}, and *P*(*H*_{uk}*d*)=-Δ_{uk}+*P*(*H*_{uk})*P*_{d} for *u*=1,2. Using the expression α_{D}=(*f*_{DD}-*f*_{dd})/2+(*P*_{d}-*P*_{D})[*f*_{Dd}-(*f*_{DD}+*f*_{dd})/2]=*P*_{D}*f*_{DD}+*P*_{d}*f*_{Dd}-*P*_{D}*f*_{Dd}-*P*_{d}*f*_{dd}, the frequency of genotype *H*_{uj}*H*_{uk},*j*≠*k*, in affected can be calculated as

Similarly, the frequency of genotype *H*_{uj}*H*_{uj} in affected can be calculated as

## Appendix C

## Appendix D

Using the notations of in equations (A1), (A2), (A3), and (A4), we calculate the variance-covariance matrices Σ_{A1} and . First, we calculate the variance of the number of haplotypes *H*_{uj} in affected by equations (A3) and (A5)

Similarly, the variance of the number of haplotypes *H*_{uj} in controls is

## Appendix E

To calculate the covariance between *z*_{1ij},*z*_{2is}, denote for *j*≠*k*,*s*≠*t*

For *j*=1,…,*l*-1 and *s*=1,…,*r*-1, the covariance

Similarly, for *j*=1,…,*l*-1 and *s*=1,…,*r*-1, the covariance

where and are expected genotype frequencies in controls like those defined in equation (E1) for cases.

## Appendix F

To calculate the noncentrality parameter λ_{G}, we notice first that the expectation

where is equal to , and is equal to .

Let Σ_{G} be the variance-covariance matrix of genotype coding *X*_{i}. Then its elements can be calculated by *Var*(*x*_{uij}|*Aff*)=*a*_{ujj}-*a*^{2}_{ujj}, where *j*=1,…,*l*-1 if *u*=1 and *j*=1,…,*r*-1 if *u*=2, *Var*(*x*_{uijk}|*Aff*)=*a*_{ujk}-*a*^{2}_{ujk},*j*≠*k*, *Cov*(*x*_{uij},*x*_{ui(j+k)}|*Aff*)=-*a*_{ujj}*a*_{u(j+k)(j+k)} if *k*1, *Cov*(*x*_{uij},*x*_{uimk}|*Aff*)=-*a*_{ujj}*a*_{umk} for *m*≠*k*, *Cov*(*x*_{uijk},*x*_{uist}|*Aff*)=-*a*_{ujk}*a*_{ust} for *j*≠*k* and *s*≠*t*.

Using the notation in equations (A1), (A3), and (E1), the covariances between *x*_{1ij},*x*_{1ijk} and *x*_{2is},*x*_{2ist} are given by

Similarly, we may calculate the variance-covariance matrix for the controls. Then the noncentrality parameter λ_{G} of *T*_{G} is given by

Using the variance-covariance matrices Σ_{G1} and of the genotype coding vector *X*_{1i} in affected and unaffected individuals, one may calculate the noncentrality parameter λ_{G1} similarly.^{}

## Electronic-Database Information

Accession numbers and URLs for data presented herein are as follows:

## References

*T*

^{2}test for genome association studies. Am J Hum Genet 70:1257–1268 [PMC free article] [PubMed]

**American Society of Human Genetics**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (434K)

- High resolution T association tests of complex diseases based on family data.[Ann Hum Genet. 2005]
*Fan R, Knapp M, Wjst M, Zhao C, Xiong M.**Ann Hum Genet. 2005 Mar; 69(Pt 2):187-208.* - Haplotype block linkage disequilibrium mapping.[Front Biosci. 2003]
*Xiong M, Zhao J, Boerwinkle E.**Front Biosci. 2003 May 1; 8:a85-93. Epub 2003 May 1.* - Sibship T2 association tests of complex diseases for tightly linked markers.[Hum Genomics. 2005]
*Fan R, Knapp M.**Hum Genomics. 2005 Jun; 2(2):90-112.* - Haplotypes vs single marker linkage disequilibrium tests: what do we gain?[Eur J Hum Genet. 2001]
*Akey J, Jin L, Xiong M.**Eur J Hum Genet. 2001 Apr; 9(4):291-300.* - Will haplotype maps be useful for finding genes?[Mol Psychiatry. 2004]
*van den Oord EJ, Neale BM.**Mol Psychiatry. 2004 Mar; 9(3):227-36.*

- Test Selection with Application to Detecting Disease Association with Multiple SNPs[Human Heredity. 2010]
*Pan W, Han F, Shen X.**Human Heredity. 2010 Jan; 69(2)120-130* - Power of Single- vs. Multi-Marker Tests of Association[Genetic epidemiology. 2012]
*Wang X, Morris NJ, Schaid DJ, Elston RC.**Genetic epidemiology. 2012 Jul; 36(5)480-487* - A New Association Test to Test Multiple-Marker Association[Genetic epidemiology. 2009]
*Wang X, Zhang S, Sha Q.**Genetic epidemiology. 2009 Feb; 33(2)164-171* - Sibship T2 association tests of complex diseases for tightly linked markers[Human Genomics. ]
*Fan R, Knapp M.**Human Genomics. 2(2)90-112* - Binomial Mixture Model-based Association Tests under Genetic Heterogeneity[Annals of Human Genetics. 2009]
*Zhou H, Pan W.**Annals of Human Genetics. 2009 Nov; 73(Pt 6)614-630*

- Genome Association Studies of Complex Diseases by Case-Control DesignsGenome Association Studies of Complex Diseases by Case-Control DesignsAmerican Journal of Human Genetics. Apr 2003; 72(4)850PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...