- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# Generalized *T*^{2} Test for Genome Association Studies

## Abstract

Recent progress in the development of single-nucleotide polymorphism (SNP) maps within genes and across the genome provides a valuable tool for fine-mapping and has led to the suggestion of genomewide association studies to search for susceptibility loci for complex traits. Test statistics for genome association studies that consider a single marker at a time, ignoring the linkage disequilibrium between markers, are inefficient. In this study, we present a generalized *T*^{2} statistic for association studies of complex traits, which can utilize multiple SNP markers simultaneously and considers the effects of multiple disease-susceptibility loci. This generalized *T*^{2} statistic is a corollary to that originally developed for multivariate analysis and has a close relationship to discriminant analysis and common measure of genetic distance. We evaluate the power of the generalized *T*^{2} statistic and show that power to be greater than or equal to those of the traditional χ^{2} test of association and a similar haplotype-test statistic. Finally, examples are given to evaluate the performance of the proposed *T*^{2} statistic for association studies using simulated and real data.

## Introduction

Lack of tangible success of genetic linkage analyses for mapping of multifactorial trait loci with small-to-moderate effects, coupled with progress in the development of detailed SNP maps of the human genome (Gray et al. 2000), has led to the suggestion of population-based genomewide association studies (Risch and Merikangas 1996) that are based on linkage disequilibrium (LD). Traditional population-based association studies compare marker-allele frequencies between cases and control subjects, separately for each marker. However, when a collection of SNP markers is available, using only a single marker each time and ignoring the nonindependence among markers are inefficient. In addition, it is well known that complex diseases are influenced by multiple genes, requiring the development of statistical methods for evaluation of several trait loci collectively (Longmate 2001). Recently, discriminant analysis (Li et al. 2000), logistic regression (Czika et al. 2000), decision trees (Zhang and Bonney 2000), and neural networks (Bhat et al. 1999; Sherriff and Ott 2001) have been applied to genetic association studies using multiple marker loci. However, such methods provide only classification accuracy as a measure of significance—rather than *P* values, which are widely used to show significant evidence of association in the traditional context. Therefore, there is a need to describe the relationship between classification methods and traditional statistical testing.

In this article, we present, for population-based association studies of complex diseases, a generalized *T*^{2} test that simultaneously utilizes multiple SNP markers. The power of the generalized *T*^{2} statistic for the detection of a disease locus (or loci) will be evaluated, as will be comparability of the genotype *T*^{2} and haplotype *T*^{2} statistics.

In addition, we formulate the problem of identification of SNP markers or a combination of SNP markers, which make the largest contribution to disease risk, as a combinatorial optimization problem, and we develop efficient search algorithms. Finally, examples will be given to illustrate the applications of the proposed *T*^{2} statistic to association studies.

## Test Statistic

Consider a design in which *n*_{A} cases from an affected population and control subjects from a comparable unaffected population are sampled. Suppose that there are *J* markers that have been typed in the sample of cases and control subjects. The *j*th marker has alleles *B*_{j} and *b*_{j}*,* with population frequencies *P*_{Bj} and *P*_{bj}*,* respectively. Define an indicator variable for the genotype of the *j*th marker for the *i*th individual from the affected population:

Similarly, we define an indicator variable, *Y*_{ij}*,* for an individual from the unaffected population. Let

The pooled-sample variance-covariance matrix of the indicator variables for the marker genotypes is defined as

Hotelling’s (1931) *T*^{2} statistic is then defined as

Under the null hypothesis that LD between any marker being tested and a disease locus does not exist, the covariance matrix of the indicator variables for the marker genotypes of the individuals from the affected population, Σ_{A}=*Cov*(*X*_{i},*X*_{i}), and the covariance matrix of indicator variables for the marker genotypes of the individuals from the unaffected population, , are equal. Therefore, when the sample size is large enough to allow asymptotic theory to apply, under the null hypothesis,

is asymptotically distributed as a central *F* distribution with *J* and degrees of freedom. Under the alternative hypothesis that there is at least one marker showing LD with a disease locus, the covariance matrices Σ_{A} and are no longer equal and

is not asymptotically distributed as a noncentral *F* distribution. In this case, it can be shown that *T*^{2} is asymptotically distributed as a χ^{2}_{(J)} distribution.

## Power Evaluation

### Noncentrality Parameter

To evaluate power, we need to calculate the noncentrality parameter of the χ^{2}_{(J)} distribution of the *T*^{2} statistic under the alternative hypothesis. We begin by computing the allele frequencies in the affected and unaffected populations. Consider a disease locus with alleles *D* and *d.* The alleles *D* and *d* have population frequencies *P*_{D} and *P*_{d}*,* respectively. Let *f*_{DD}*,* *f*_{Dd}*,* and *f*_{dd} be the penetrance of the genotypes *DD**,* *Dd**,* and *dd**,* respectively. Let *P*_{A} denote the prevalence of the disease in the population. Then, *P*_{A} is given by

Let *P*_{B}(*A*) and be the frequencies of marker allele *B* in the affected and unaffected populations, respectively. Let *P*_{BD}*,* *P*_{Bd}*,* *P*_{bD} and *P*_{bd} be the frequencies of haplotypes *BD, Bd, bD,* and *bd,* respectively. The frequency *P*_{B}(*A*) is given by

Similarly, we have

where *,* *,* and *.*

Consider the *j*th marker and the *j*^{′}th marker. Let *P*_{BjBj′}(*A*) and be the frequencies of haplotype *B*_{j}*B*_{j′} in the affected and unaffected populations, respectively. If Hardy-Weinberg equilibrium is assumed, then it is easy to see that (see Appendix A)

where

Define

It is clear that the covariance matrices Σ_{A} and depend on the pairwise LD between the marker and trait loci. When *,* the noncentrality parameter of the *T*^{2} statistic under the alternative hypothesis is given by

where μ=[μ_{1},…,μ_{J}]^{T}*.* Let

*G*^{2} can be considered to be a genetic-distance measure between two populations that is similar to that proposed by Balakrishnan and Sanghvi (1968). Intuitively, then, the noncentrality parameter λ can be expressed as a function of this genetic distance between the case and control populations;—that is, *.* In the case in which all pairwise LD is equal to zero, *G*^{2} is reduced to

and the noncentrality parameter λ is

For a single marker, we have

and

Therefore, the noncentrality parameter and power depend on the sample size and the genetic distance, which, in turn, are a function of allele frequencies and LD between the marker and trait loci.

The classic test statistic for a single-marker case-control study is given by (see Chapman and Wijsman 1998)

where , , , and are the corresponding observed allele frequencies. Its noncentrality parameter, λ_{c}, is given by

It can be shown (see Appendix B) that λλ_{c}*.* Therefore, for case-control association studies, the proposed *T*^{2} statistic has higher (or equivalent) power than does the classic *T*_{c} statistic. Figure 1 compares the power, for detection of a disease gene, of the *T*^{2} statistic and the classic χ^{2} statistic. From figure 1, we can see that, in all cases, the power of the *T*^{2} statistic is higher than that of the classic χ^{2} statistic. However, when the allele frequencies are small, the differences in the power of these two statistics are very small. It can be shown that, even in more complicated situations, such as multiple marker and trait loci, the *T*^{2} statistic has higher power than does the classic χ^{2} statistic (data not shown).

### Two-Disease-Loci Model

To further evaluate the power of the *T*^{2} statistic, we consider two-locus disease models. Assume that there are two disease loci, *D* and *d.* Each disease locus has two alleles. The frequencies of the alleles *D*_{1} and *D*_{2} at disease locus *D* and of the alleles *d*_{1} and *d*_{2} at disease locus *d* can be denoted by *P*_{D1}*,* *P*_{D2}*,* *P*_{d1}*,* and *P*_{d2}*,* respectively. The frequencies of the genotypes *D*_{u}*D*_{v} and *d*_{k}*d*_{l} in the disease and normal populations are denoted by *P*_{DuDv} and *P*_{dkdl}*,* respectively. The penetrance of the genotypes *D*_{u}*D*_{v}*d*_{k}*d*_{l} will be denoted by *f*_{uvkl}*.* Then, the prevalence of the disease in the population is given by

Denote the indicator variables for the genotypes of the first and second markers for the first individual from the affected population and for the first individual from the unaffected population by *X*_{11}, *X*_{12}, *Y*_{11}, and *Y*_{12}, respectively. Let

and let

The elements of the vector μ and of the variance-covariance matrices Σ_{A} and are given in Appendix C. The noncentrality parameter of the *T*^{2} statistic for the two-locus disease model is then given by

For convenience of presentation, we assume that the two disease loci are unlinked. Table 1 presents six types of two-locus disease models (Neuman and Rice 1992; Schork et al. 1993; Ott 1999). To illustrate the performance of the *T*^{2} statistic for the detection of disease loci, we plot figure 2, showing the power of the *T*^{2} statistic as a function of the allele frequency under the six types of two-locus–disease models in table 1.

*T*

^{2}test, with significance level α=0.0001, as a function of allele frequency, in the case of Dom Dom, Dom Rec, Rec Rec, epistasis, threshold, and modifying models, when ,

*P*

_{D1}=

*P*

_{d1}, and

*f*=0.6 are

**...**

## The Haplotype *T*^{2} Statistic

When haplotype information is available, we can define an indicator variable for the alleles of the *j*th marker on the *i*th chromosome from the affected population:

Similarly, we define an indicator variable *y*_{Hij} for the marker alleles located on the chromosomes from the unaffected population. Following the same development in the genotype *T*^{2} statistic, we can define the haplotype *T*^{2} statistic. Let

The covariance matrix is defined as

The haplotype *T*^{2} statistic is then defined as

To compare the powers of the genotype *T*^{2} and haplotype *T*^{2}_{H}*,* we can compare their noncentrality parameters, because both *T*^{2} and *T*^{2}_{H} follow a χ^{2}_{(J)} distribution under the alternative hypothesis. It can be shown that the noncentrality parameter λ of the *T*^{2} statistic and the noncentrality parameter λ_{H} of the *T*^{2}_{H} statistic are equal (Appendix D).

Therefore, the power of the multilocus *T*^{2} statistic is the same as that of the haplotype *T*^{2} statistic. Equivalence of the two statistics is important, because unequivocal haplotypes are usually not available in the majority of case-control studies. Intuitively, this equivalence can be attributed to the fact that the multilocus *T*^{2} statistic contains the same pairwise LD information in the covariance matrices—that is, Σ_{A} and —that is contained in the haplotypes.

## Search Algorithm

To identify SNP markers (or the combination of SNP markers) that make the greatest contribution to disease risk and drug response, search algorithms are fundamental. In this study, we use a heuristic algorithm that seeks the best combination of SNP markers for risk assessment. The algorithm is based on the sequence-forward floating-selection (SFFS) algorithm of Pudil et al. (1994), which is easy to implement and which requires minimal computation. The SFFS algorithm is based on a sequence-forward–selection algorithm (SFS). The procedures for sequential-forward selection are as follows:

- 1.Compute the desired criterion value for each of the markers, and select the marker with the best value;
- 2.Form all possible two-dimensional vectors that contain the winner from the previous step, and compute the criterion value for each of them and then select the best one;
- 3.Form all three-dimensional vectors expanded from the two-dimensional winners, and select the best one; continue this process until the prespecified dimension of the feature vector—say,
*l*—is reached.

The SFS algorithm requires less computational burden than do other search algorithms, but it suffers from the so-called nesting effect—that is, once a marker is chosen, there is no way for it to be discarded in later steps. To overcome this problem, the SFFS algorithm was proposed. The SFFS algorithm balances the required computational time and overall optimality. (For details, interested readers are referred to Pudil et al. [1994] and Xiong et al. [2001].)

## Examples

The proposed *T*^{2} test was applied to a simulated data set from Genetic Analysis Workshop 12 (GAW12) (Almasy et al. 2001). Simulated data were provided for an isolated population founded ~20 generations ago by 100 individuals from the general population. Unrelated cases and control subjects () were obtained by selection of founders and their spouses from 23 extended pedigrees. Sequence data are available for a major gene, *MG6* on chromosome 6, that directly influences affection status of the individuals. *MG6* is known to account for 25.3% of disease liability, and, in the GAW12 data, the sequence data are labeled “GENE 1.” Site 557 was identified as the SNP closely related to disease liability. The *P* values of the *T*^{2} statistic and of the classic χ^{2} statistic, for testing the association between SNP markers and affection status that are included within GENE 1 are summarized in table 2. We can see from table 2 that both the *T*^{2} test and the χ^{2} test identified a common set of SNP markers showing significant association with affection status but that, in all cases, the *T*^{2} test had smaller *P* values than did the χ^{2} test. The *T*^{2} test identified site 557 as having the smallest *P* value. Since strong LD exists between many SNP markers within GENE 1 (Czika et al. 2000; Huang et al. 2001), both the *T*^{2} test and the χ^{2} test identified a number of SNP markers that had small *P* values.

*T*^{2}Test and the Classic χ

^{2}Test, When Applied to Simulated Data within Gene 1 from GAW12

^{[Note]}

Table 3 shows (1) the results of the *T*^{2} test for 17 two-SNP combinations that have *P* values <10^{−14} and (2), of all possible three-SNP combinations, the top 15 that have the smallest *P* value. Two features are evident from table 3: first, the *P* values of the optimal combination of two or three SNPs are smaller than that of each single SNP in the combination; second, an individual SNP may have a large *P* value, but its combinations with other SNPs may have a very small *P* value.

*T*^{2}Test Applied to Simulated Data within GENE 1 from GAW12, When Two or Three SNPs Are Used

^{[Note]}

The proposed *T*^{2} test was also applied to a real data set of cases of scleroderma, or systemic sclerosis (SSC) (X. Zhou, F. K. Tan, J. D. Reveille, C. Ahn, A. Wang, and F. C. Arnett, personal communication). SSC is a multisystem disease of unknown etiology and is characterized by cutaneous and visceral fibrosis, small-blood-vessel damage, and autoimmune features (Medsger 1997; X. Zhou, F. K. Tan, J. D. Reveille, C. Ahn, A. Wang, and F. C. Arnett, personal communication). Three SNP markers—SPARC 998, SPARC 1551, and SPARC 1992—were genotyped in 20 unrelated patients with SSC and in 75 normal control subjects from the Oklahoma Choctaw population. It has been reported that, in this population, (*a*) the clinical disease pattern is relatively homogeneous and (*b*) the prevalence of SSC in this population is high (X. Zhou, F. K. Tan, J. D. Reveille, C. Ahn, A. Wang, and F. C. Arnett, personal communication). To further evaluate the performance of the *T*^{2} test, both the *T*^{2} test and the χ^{2} test were applied to the samples from the Oklahoma Choctaw, to examine association between the SPARC gene and SSC. Table 4 presents the results. It is evident from table 4 that marker SPARC 998 has a *T*^{2}-associated *P* value that is much smaller than that associated with the classic χ^{2} test. It has been reported that expression of the SPARC gene is ~2.5-fold and ~5-fold increased, respectively, when cDNA microarrays and western-blot analysis are used, in case-control comparisons for SSC (X. Zhou, F. K. Tan, J. D. Reveille, C. Ahn, A. Wang, and F. C. Arnett, personal communication). The SPARC gene is being increasingly recognized as playing a variety of roles in tissue development, remodeling, and fibrosis (Motamed 1999).

## Discussion

In this study, we have proposed a generalized *T*^{2} statistic to relate DNA sequence variations to the occurrence of disease. We show that the noncentrality parameter of the *T*^{2} statistic is larger than that of the well-known χ^{2} statistic, indicating that the *T*^{2} statistic has greater power than does the χ^{2} statistic. In addition, simulation studies and examples with real data demonstrate that, for case-control studies, the *P* value of the *T*^{2} test is smaller than that of the χ^{2} test.

The proposed generalized *T*^{2} statistic has utility in three areas of contemporary human genomic analysis. First, there is general interest in using a dense set of SNPs spanning each of the chromosomes, to localize genes via genomewide association analyses. For such association studies to be effective, it is not necessary that the SNPs be the disease-susceptibility loci; rather, the SNPs may aid in identification of the location of a disease-susceptibility locus, on the basis of LD between the SNP marker loci and nearby disease-susceptibility loci. The magnitude of LD among SNPs (including disease-susceptibility loci) is largely determined by the recombination rates among loci and by stochastic sampling variation, including genetic drift, migration, and sampling. Therefore, association between an SNP (or SNPs) and a trait of interest is generally attributable to LD between the SNP (or SNPs) and a disease-susceptibility locus. Such association may indicate proximity of the inferred disease-susceptibility locus to the SNP marker locus.

The second area in which the proposed statistic has utility is in analysis of the spectrum of variation within a gene, to identify sites or combinations of sites influencing the trait of interest. Variations at these sites are candidates for further experimental and functional studies. An association-mapping perspective is of little utility in this situation, because the recombination rate among sites is practically zero. In this case, one should first identify the complete menu of variable sites within the gene and then consider the ability of these sites to predict levels or prevalence rates of the phenotype of interest. Recent studies (e.g., see Horikawa et al. 2000) have indicated that there are not sufficient data to predict a priori which sites (e.g., cSNPs) are likely to predict and which are likely not to predict.

The third application of the *T*^{2} analysis is to the development of a more comprehensive vision of the genetic architecture (see Boerwinkle et al. 1986) of a trait. It is widely accepted that risk to a common disease is influenced by multiple genes and that these genes are interacting both among themselves and with environmental factors. One of the goals of studying the genetics of common diseases is to identify the contributing genes and mutations and to characterize their interaction as they combine with other agents to influence disease risk. To achieve this goal, it is necessary to have methods that evaluate multiple loci—and their interactions—simultaneously. The proposed *T*^{2} statistic, by virtue of the fact that it simultaneously considers the effects of multiple loci and does not assume additivity among those effects, is an important step in this direction. Future developments will include (1) extension of the *T*^{2} method to quantitative traits and (2) stepwise site-selection procedures. One aspect of considerable interest in the area of genome association studies is the use of haplotype information. Recently, some have argued that haplotypes may be the relevant functional unit in the consideration of genotype-phenotype relationships (Drysdale et al. 2000). In addition, haplotype information can facilitate a cladistic approach to genotype-phenotype relationships (Templeton et al. 1987). In the case of the *T*^{2} test, haplotype information has here been shown not to lend additional information about genotype-phenotype relationships, relative to multilocus genotype information. Initially, this result was surprising. However, on further investigation it was realized that the sample variance-covariance matrix, **S**, contains the pairwise relationships among loci. Therefore, the *T*^{2} statistic captures the pairwise-association information found in haplotypes. Higher-order associations, however, may not be included in the regular *T*^{2} statistic, indicating that the *T*^{2}_{H} statistic may have advantages in those situations.

LD analyses and association mapping are powerful tools for contemporary human genetics. Efforts to build a collection of SNP markers in all genes of the human genome (e.g., see The International SNP Map Working Group 2001 ) and advances in genotyping technologies bode well for large-scale applications in the near future. Such undertakings are not without complications, however. The cost-per-locus test for SNPs remains high. The pattern of LD may vary considerably between populations. And there is a further need to develop, evaluate, and apply novel methods for relating the considerable genomic information to risk of disease—methods such as the *T*^{2} test proposed here.

## Acknowledgments

M.X. and J.Z. are supported by NIH grants GM56515 and HL 5448, and E.B. is supported by NIH grant HL 5448.

## Appendix A :

Assuming Hardy-Weinberg equilibrium, we can calculate and as follows:

Therefore, we have

Next, we calculate the variance-covariances, *Var*(*X*_{j}) and *Cov*(*X*_{j},*X*_{j′}). Note that

where δ_{jj′}(*A*)=*P*_{BjBj′}(*A*)-*P*_{Bj}(*A*)*P*_{Bj′}(*A*).

Combining the above equations yields *Cov*(*X*_{ij},*X*_{ij′})=*E*[*X*_{ij}*X*_{ij′}]-*E*[*X*_{ij}]*E*[*X*_{ij′}]=2δ_{jj′}(*A*). It is not difficult to see that *Var*(*X*_{ij})=*E*[*X*^{2}_{ij}]-(*E*[*X*_{ij}])^{2}=*P*^{2}_{Bj}(*A*)+*P*^{2}_{bj}(*A*)-[*P*_{Bj}(*A*)-*P*_{bj}(*A*)]^{2}=2*P*_{Bj}(*A*)*P*_{bj}(*A*). Similarly, we have .

## Appendix B :

Note that *P*_{b}(*A*)=1-*P*_{B}(*A*) and . Thus,

However, , which implies that

Therefore, we have

It follows that

But, when , λ is reduced to

## Appendix C:

To calculate the noncentrality parameter, λ_{2}, we begin with the calculation of the frequencies of the genotypes at the two disease loci, in the affected population and in the control populations. It follows from the definition of the genotype frequencies in the disease population that

Similarly, we have

Let be the probability that an individual with genotypes *D*_{i}*D*_{j} and *d*_{k}*d*_{l} is unaffected. By an argument similar to that used above, we obtain

Now we calculate the expectation of the indicator variables *X*_{11} and *Y*_{11}. Using the definition of the indicator variable, we have

Thus, the vector μ can be calculated by μ=(*E*[*X*_{11}]-*E*[*Y*_{11}],*E*[*X*_{12}]-*E*[*Y*_{12}])^{T}. Next we calculate the variance-covariance matrix Σ_{A}. It is easy to see that

Therefore, we obtain

and

## Appendix D :

If we assume Hardy-Weinberg equilibrium, we find that it is not difficult to show that

Thus, μ_{H}=(μ_{H1},…,μ_{HK})^{T}; Σ_{HA}=(1/2)Σ_{A} and . The noncentrality parameter λ_{H} is then given by

## References

_{2}-adrenergic receptor haplotypes alter receptor expression and predict

*in vivo*responsiveness. Proc Natl Acad Sci USA 97:10483–10488 [PMC free article] [PubMed]

**American Society of Human Genetics**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (151K)

- Genome association studies of complex diseases by case-control designs.[Am J Hum Genet. 2003]
*Fan R, Knapp M.**Am J Hum Genet. 2003 Apr; 72(4):850-68. Epub 2003 Mar 19.* - Data mining applied to linkage disequilibrium mapping.[Am J Hum Genet. 2000]
*Toivonen HT, Onkamo P, Vasko K, Ollikainen V, Sevon P, Mannila H, Herr M, Kere J.**Am J Hum Genet. 2000 Jul; 67(1):133-45. Epub 2000 Jun 9.* - Detection of disease genes by use of family data. I. Likelihood-based theory.[Am J Hum Genet. 2000]
*Whittemore AS, Tu IP.**Am J Hum Genet. 2000 Apr; 66(4):1328-40. Epub 2000 Mar 29.* - Tag SNP selection for association studies.[Genet Epidemiol. 2004]
*Stram DO.**Genet Epidemiol. 2004 Dec; 27(4):365-74.* - Finding genes influencing susceptibility to complex diseases in the post-genome era.[Am J Pharmacogenomics. 2001]
*Rannala B.**Am J Pharmacogenomics. 2001; 1(3):203-21.*

- Resequencing of Pooled DNA for Detecting Disease Associations with Rare Variants[Genetic epidemiology. 2010]
*Wang T, Lin CY, Rohan TE, Ye K.**Genetic epidemiology. 2010 Jul; 34(5)492-501* - On multi-marker tests for association in case-control studies[Frontiers in Genetics. ]
*Taub MA, Schwender HR, Younkin SG, Louis TA, Ruczinski I.**Frontiers in Genetics. 4252* - Weighted pedigree-based statistics for testing the association of rare variants[BMC Genomics. ]
*Shugart YY, Zhu Y, Guo W, Xiong M.**BMC Genomics. 13667* - Statistical Analysis Strategies for Association Studies Involving Rare Variants[Nature reviews. Genetics. 2010]
*Bansal V, Libiger O, Torkamani A, Schork NJ.**Nature reviews. Genetics. 2010 Nov; 11(11)773-785* - Test Selection with Application to Detecting Disease Association with Multiple SNPs[Human Heredity. 2010]
*Pan W, Han F, Shen X.**Human Heredity. 2010 Jan; 69(2)120-130*

- Generalized T2 Test for Genome Association StudiesGeneralized T2 Test for Genome Association StudiesAmerican Journal of Human Genetics. May 2002; 70(5)1257PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...