- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# Test for Interaction between Two Unlinked Loci

^{*}Present affiliation: Division of Cardiology, Emory University School of Medicine, Atlanta.

## Abstract

Despite the growing consensus on the importance of testing gene-gene interactions in genetic studies of complex diseases, the effect of gene-gene interactions has often been defined as a deviance from genetic additive effects, which is essentially treated as a residual term in genetic analysis and leads to low power in detecting the presence of interacting effects. To what extent the definition of gene-gene interaction at population level reflects the genes' biochemical or physiological interaction remains a mystery. In this article, we introduce a novel definition and a new measure of gene-gene interaction between two unlinked loci (or genes). We developed a general theory for studying linkage disequilibrium (LD) patterns in disease population under two-locus disease models. The properties of using the LD measure in a disease population as a function of the measure of gene-gene interaction between two unlinked loci were also investigated. We examined how interaction between two loci creates LD in a disease population and showed that the mathematical formulation of the new definition for gene-gene interaction between two loci was similar to that of the LD between two loci. This finding motived us to develop an LD-based statistic to detect gene-gene interaction between two unlinked loci. The null distribution and type I error rates of the LD-based statistic for testing gene-gene interaction were validated using extensive simulation studies. We found that the new test statistic was more powerful than the traditional logistic regression under three two-locus disease models and demonstrated that the power of the test statistic depends on the measure of gene-gene interaction. We also investigated the impact of using tagging SNPs for testing interaction on the power to detect interaction between two unlinked loci. Finally, to evaluate the performance of our new method, we applied the LD-based statistic to two published data sets. Our results showed that the *P* values of the LD-based statistic were smaller than those obtained by other approaches, including logistic regression models.

Complex diseases are typically caused by multiple factors, including multiple genes, primarily through nonlinear gene-gene interactions and gene-environment interactions. Gene-gene interaction is an important but complex concept.^{1} Despite growing recognition of the importance of gene interactions in genetic studies of complex diseases, classical genetic analysis either ignores gene interactions or defines the effect of gene interactions as a deviance from genetic additive effects, which is essentially treated as a residual term in genetic analysis.^{2} Fisher^{3} mathematically defined the effect of gene interactions as a statistical deviance from the additive effects of single genes, which is often referred to as “statistical interaction” between genes. This was further developed by Cockerham^{4} and Kempthorne^{5} into the modern representation that treats statistical gene interactions as interaction terms in a regression model or a generalized linear model on allelic effects.^{2}^{,}^{6}^{}^{}^{}^{}^{–}^{11} Modeling a trait as an additive combination of its single-locus main effects and interaction terms is likely to limit the power to detect interaction.

In the past several years, combinatorial partitioning^{12} and various data-mining methods^{1}^{,}^{13}^{}^{}^{}^{}^{}^{}^{}^{–}^{21} have been explored to detect gene-gene interaction. The limitations of these methods include (1) the lack of clear biological interpretation of gene-gene interaction, (2) the requirement of intensive computation, and (3) the fact that the power to detect gene-gene interaction may depend on the data structure.

To overcome these limitations, we propose to define interaction between two unlinked loci (or genes) for a qualitative trait as the deviance of the penetrance for a haplotype at two loci from the product of the marginal penetrance of the individual alleles that span the haplotype. This definition of gene-gene interaction between two unlinked loci measures the dependence of the penetrance at one marker locus on the genotypes at another locus, which is not derived from the additive model. Interaction between two unlinked loci will result in deviation of the penetrance of the two-locus haplotype from independence of the marginal penetrance of the alleles at an individual locus, which in turn will create linkage disequilibrium (LD) even if two loci are unlinked. The level of LD created depends on the magnitude of interaction between two unlinked loci. Therefore, it is possible to develop statistics for detection of interaction between two unlinked loci by use of deviations from LD. Such statistics for interaction detection between two unlinked loci have advantages, as follows. First, since interaction between two unlinked loci can be characterized by LD between two interacting loci, the LD-based statistics for detection of interaction between two unlinked loci will have a clear biological interpretation. Second, they will not treat interaction as a residual term in the model and can implicitly consider nonlinear interaction between two unlinked loci. Hence, LD-based statistics for detection of interaction between two unlinked loci will have higher power than that of the traditional Fisher's method. Third, computation of LD-based statistics is much faster than logistic regression models; thus, they are particularly suitable for genomewide association studies.

To date, formal statistics for testing gene interactions by use of LD among loci are not yet developed, although several empirical studies to assess the role of gene interaction by use of LD have been conducted.^{22}^{}^{}^{–}^{25} These studies assessed deviations from equilibrium in the affected population to indicate interaction between two unlinked loci. These empirical studies for testing interaction between two unlinked loci have limitations. Most of the LD-based empirical studies are descriptive. They separately tested deviation from equilibrium in cases and controls but did not provide a unified statistic to test gene interaction by assessing difference in LD between cases and controls. Furthermore, they did not examine the null distributions, type I error rates, and power of the test statistics. As a consequence, in the presence of complex LD patterns in populations, these LD-based empirical studies for identifying gene interactions may have high false-positive rates.

The main purpose of this article is to develop statistics with high power for detection of interaction between two unlinked loci. To accomplish this, we first develop general theory to study LD patterns under two-locus disease models. We then develop a novel definition of gene interaction and a measure of interaction between two unlinked disease loci under the framework of LD analysis. The pattern of LD between two unlinked loci created by gene-gene interaction provides a foundation for developing statistics for detection of interaction. This motives us to develop the LD-based statistics for testing interactions between two unlinked loci. We also investigate type I error rates of the LD-based statistics. Furthermore, we explore the possibility of using two unlinked tagging SNPs (tSNPs) for detecting interaction between two disease loci that are in LD with the chosen tSNPs. To investigate the impact of using tSNPs on interaction detection, we evaluate the power of directly using interacting disease loci and of using tSNPs that are in high LD with the interacting disease loci to detect interaction. To evaluate the performance of the new statistic, we also applied it to two real examples. We conclude with a discussion of the advantages and potential limitations of the proposed statistic.

## Methods

### LD Generated by Gene-Gene Interactions

To investigate the LD pattern generated by gene-gene interaction, we assume that two disease-susceptibility loci are in Hardy-Weinberg equilibrium (HWE) and are unlinked. Let *D*_{1} and *d*_{1} be the two alleles at the first disease locus, with frequencies *P*_{D1} and *P*_{d1}, respectively. Let *D*_{2} and *d*_{2} be the two alleles at the second disease locus, with frequencies *P*_{D2} and *P*_{d2}, respectively. Alleles *D*_{1} and *d*_{1} can be indexed by 1 and 2, respectively. At the first disease locus, let *D*_{1}*D*_{1} be genotype 11, *D*_{1}*d*_{1} be genotype 12, and *d*_{1}*d*_{1} be genotype 22. The genotypes at the second disease locus are similarly defined. Two-locus genotypes are simply denoted by *ijkl* for individuals carrying the haplotypes *ik* and *jl* arranged from left to right. Let *f*_{ijkl} be the penetrance of the individuals with haplotypes *ik* and *jl* arranged from left to right. Let *P*_{11}, *P*_{12}, *P*_{21}, and *P*_{22} be the frequencies of haplotypes *H*_{D1D2}, *H*_{D1d2}, *H*_{d1D2}, and *H*_{d1d2} in the general population, respectively. Let *P*^{A}_{11}, *P*^{A}_{12}, *P*^{A}_{21}, and *P*^{A}_{22} be their corresponding haplotype frequencies in the disease population. Let *P*^{A}_{D1}, *P*^{A}_{d1}, *P*^{A}_{D2}, and *P*^{A}_{d2} be the frequencies of the alleles *D*_{1}, *d*_{1}, *D*_{2}, and *d*_{2} in the disease population, respectively.

For ease of discussion, we introduce a concept of haplotype penetrance. Consider a haplotype with allele *i* at the first disease locus and allele *k* at the second disease locus. Then, the penetrance of haplotype *H*_{ik} is defined as

Let δ=*P*_{11}-*P*_{D1}*P*_{D2} be the LD measure in the general population. In appendix A, we show that haplotype frequencies in disease population can be expressed as

where *P*_{A} denotes disease prevalence and is given by

Now, we calculate the LD measure in the disease population under a general two-locus disease model. The measure of LD in the disease population is defined as δ^{A}=*P*^{A}_{11}*P*^{A}_{22}-*P*^{A}_{12}*P*^{A}_{21}. We can show (appendix A) that it can be given by

where *I*=*h*_{11}*h*_{22}-*h*_{12}*h*_{21}, which is defined as a measure of interaction between two unlinked loci and quantifies the magnitude of interaction. Absence of interaction between two unlinked loci is then defined as

Under this definition, in the absence of interaction, two unlinked loci in the disease population will be in linkage equilibrium.

From equation (2), we can see that, if *h*_{11}*h*_{22}≠*h*_{12}*h*_{21}, even if two loci are in linkage equilibrium in the general population, two loci will be in LD in the disease population. LD in the disease population is created by the interaction between two unlinked loci. This provides a basis for testing interaction between two unlinked loci, as shown in the “Test Statistic” section.

Define *h*_{D1}=*P*(*Affected*|*D*_{1}) and *h*_{D2}=*P*(*Affected*|*D*_{2}). In appendix A, we show that equation (3) implies that

Similar to linkage equilibrium, where the frequency of a haplotype is equal to the product of the frequencies of the component alleles of the haplotype, absence of interaction between two unlinked loci implies that the proportion of individuals carrying a haplotype in the disease population is equal to the product of the proportions of individuals carrying the component alleles of the haplotype in the disease population, if we assume that the disease is caused by only two investigated disease loci. In other words, interaction between two disease-susceptibility loci occurs when contribution of one locus to the disease depends on another locus.

Suppose that the first locus postulated above is a disease-susceptibility locus and that the second is a marker locus that does not predispose carriers to a disease phenotype. Let *f*_{ij} be the penetrance of the genotype *ij* at the disease-susceptibility locus. Then, we have *h*_{11}=*P*_{D1}*f*_{11}+*P*_{d1}*f*_{12}, *h*_{22}+*P*_{D1}*f*_{21}+*P*_{d1}*f*_{22}, *h*_{12}=*P*_{D1}*f*_{11}+*P*_{d1}*f*_{12}, and *h*_{21}=*P*_{D1}*f*_{21}+*P*_{d1}*f*_{22}, which implies that

That is, the measure of LD between a disease locus and a marker locus in the disease population (δ^{A}) can be expressed in terms of the measure of LD in the general population and a multiplicative factor. If the disease locus and the marker locus are unlinked, then the disease and marker loci will be in linkage equilibrium. This demonstrates that, in the absence of interaction between the unlinked marker and the disease loci, LD in the disease population cannot be created.

To further understand the measure of interaction between two unlinked loci, we examined the interactions between two unlinked loci under six two-locus disease models. Results are listed in table 1, in which the values represent the penetrances of the given genotypes.^{26}^{}^{–}^{28} The measure of interaction between two unlinked loci depends not only on penetrance but also on the frequencies of the disease alleles.

### Indirect Interaction between Two Unlinked Marker Loci

In the previous section, we studied interaction between two unlinked disease loci. Now, we consider two marker loci, each of which is in LD with either of two interacting loci. Although there is no physiological interaction between the two marker loci, if each marker locus is in LD with one of the two unlinked interacting loci, we still can observe LD between two unlinked marker loci in the disease population. Assume that marker *M*_{1} is in LD with disease locus *D*_{1} and that marker *M*_{2} is in LD with disease locus *D*_{2}. Furthermore, we assume that two disease loci, *D*_{1} and *D*_{2}, are unlinked. Let δ^{A}_{M} be the LD measure between two marker loci in the disease population. Let δ_{i} be the LD measure between marker *M*_{i} and disease locus *D*_{i} (*i*=1,2) in the general population. Then, we can show (appendix B) that

where δ^{A} is the measure of LD between two unlinked disease loci in the disease population. It is clear that, when the marker loci are the disease loci themselves, δ^{A}_{M} is reduced to δ^{A}. Equation (4) can also be written in terms of the measure of interaction between two unlinked loci:

Since δ_{i}*P*_{Di}*P*_{di}, the absolute value of the LD measure between two unlinked marker loci in the disease population—for example, |δ^{A}_{M}|—will be less than or equal to the absolute value of the LD measure between two unlinked disease loci in the disease population.

Equation (4) shows that the LD between unlinked marker loci in the disease population is proportional to the product of LD between each marker locus and its linked disease locus, δ_{1}δ_{2}. Since the criteria for tSNP selection are based on only one pairwise LD between the marker and disease loci, the LD between tSNPs and interacting loci may not be large enough to ensure that indirect interaction between two unlinked marker loci will be detected. Thus, if the interacting disease loci are not selected as tSNPs, many loci with interactions will be missed. This will have a profound implication on tSNP selection.

### Test Statistic

In the previous section, we showed that interaction between unlinked loci will create LD. Intuitively, we can test interaction by comparing the difference in the LD levels between two unlinked loci between cases and controls. Precisely, if we denote the estimators of the LD measures in cases and controls by and , respectively, then the test statistic can be defined as

where

and *n*_{A} and *n*_{G} denote the number of sampled individuals in cases and controls, respectively. *P*^{A}_{11}, *P*^{A}_{D1}, *P*^{A}_{D2}, *P*^{N}_{11}, *P*^{N}_{D1}, and *P*^{N}_{D2} are defined as before. , , , , , and are their estimators, the variance of the LD measure was the large-sample variance,^{29} and and are the estimators of the variances *V*_{A} and *V*_{N}, respectively. This statistic will be referred to as the “LD-based statistic” throughout the article. We can show that test statistic *T*_{I} is asymptotically distributed as a central χ^{2}_{(1)} distribution under the null hypothesis of no interaction between two unlinked loci (appendix C).

In theory, when there is no interaction between two unlinked loci, the LD between them should be zero. Thus, we can use case-only design to study interaction between two loci. In this case, equation (5) will be reduced to

However, in practice, background LD between two unlinked loci may exist in the population because of many unknown factors. Therefore, using equation (6) to test for interaction will increase type I error rates. The test statistic defined in equation (5) is more robust than that in equation (6). In appendix C, we showed that, for an admixed population, if differences in allele frequencies between two subpopulations at each of the two loci in cases and controls are the same, test statistic *T*_{I} in equation (5) is still a valid test for interaction between two unlinked loci.

## Results

### Patterns of Pairwise LD under Two-Locus Disease Models

Knowledge about differences in LD patterns between disease and general populations is crucial for association studies of complex diseases. To illustrate how the differences in LD patterns between disease and general populations are influenced by disease models, we examined the LD patterns between unlinked loci by assuming several two-locus disease models. We first studied the LD between two unlinked loci under three two-locus disease models: the union of dominant and dominant (Dom Dom), the union of recessive and recessive (Rec Rec), and threshold models (table 1). Figure 1 shows the LD between two unlinked loci, which is generated by the joint actions of two disease loci, as a function of the allele frequency at the first locus, under the assumption that the allele frequency at the second locus *P*_{D2}=0.1 and penetrance parameter *f*=1. Figure 1 shows that, although two unlinked loci in the general population is in linkage equilibrium, the LD between two unlinked loci in the disease population does exist. The LD in disease population depends on the disease models and the allele frequencies at two loci.

### Pairwise Interaction Measure

The proposed measure of interaction between two unlinked loci quantifies the magnitude of interaction between two unlinked loci. To further explore the properties of the interaction measure between two unlinked loci, we investigated the impact of the two-locus disease models on the measure of interaction. Figure Figure22 plots the measure of interaction between two unlinked loci under six two-locus disease models (table 1) as a function of penetrance parameter *f,* under the assumption that the allele frequencies at the two loci are 0.3 and 0.8 (fig. 2*A*) or 0.2 and 0.4 (fig. 2*B*). The figures shows that the measure of interaction is a monotonic function of the penetrance parameter. The measure of interaction depends on both the disease models and the allele frequencies at the two loci. However, the relationship between the measure of interaction and disease models is complex. For example, when the allele frequencies at two loci are 0.2 and 0.4, the measure of interaction for the Dom Dom model is much larger than that for Rec Rec model, whereas when the allele frequencies at two loci are 0.3 and 0.8, the measure of interaction for the Dom Dom model is smaller than that for the Rec Rec model. This may partially explain why gene-gene interaction detected in one population cannot be replicated in another population, because allele frequencies are different between populations.

### Null Distribution of Test Statistics

In the previous sections, we have shown that, when sample size is large enough to apply large-sample theory, distribution of the statistic *T*_{I} for testing interaction between two unlinked loci under the null hypothesis of no interaction is asymptotically a central χ^{2}_{(1)} distribution. To examine the validity of this statement, we performed a series of simulation studies. The computer program SNaP^{30} was used to generate two-locus genotype data of the sample individuals. A total of 10,000 individuals who were equally divided into cases and controls were generated in the general population. From each group of the cases and controls, 100–500 individuals were randomly sampled; 10,000 simulations were repeated.

Figure Figure3*A*3*A* and and3*B*3*B* plots the histograms of the test statistic *T*_{I} for testing gene-gene interaction between two unlinked loci with sample sizes *n*_{A}=*n*_{G}=150 and *n*_{A}=*n*_{G}=250, respectively. It can be seen that the distributions of the test statistic *T*_{I} are similar to the theoretical central χ^{2}_{(1)} distribution. Table 2 shows that the estimated type I error rates of the statistic *T*_{I} for testing interaction were not appreciably different from the nominal levels α=0.05, α=0.01, and α=0.001.

*T*

_{I}by use of 150 individuals (

*A*) or 250 individuals (

*B*) from both the cases and the controls in a homogeneous population.

*T*

_{I}in Testing Interaction between Two Unlinked Loci in a Homogeneous Population

To examine the impact of population substructure on the null distribution of the test statistic *T*_{I}, we performed a series of simulations. We assumed that allele frequencies at the first locus were 0.7 and 0.3 in population 1 and 0.3 and 0.7 in population 2. The allele frequencies at the second loci were assumed to be 0.2 and 0.8 in population 1 and 0.8 and 0.2 in population 2. From each population, 10,000 individuals were sampled, and these individuals were mixed to form an admixed population, which was then equally divided into cases and controls. Three hundred individuals were randomly sampled from each group of the cases and controls, and 10,000 simulations were repeated. Figure 4 shows the histograms of test statistic *T*_{I}. It can be seen that the distribution of *T*_{I} is similar to the theoretical central χ^{2} distributions, which shows that population admixture has a mild impact on the null distribution of test statistic *T*_{I}.

### Power Evaluation

To further evaluate the performance of the proposed statistic in testing gene-gene interaction, we compared the power of the LD-based statistic with that of the logistic model. We considered three types of genotype coding (genetic covariate variables). For a recessive model, homozygous wild-type, heterozygous, and homozygous mutant genotypes were coded as 0, 0, and 1, respectively. For a dominant model, these three genotypes were coded as 0, 1, and 1. For an additive model, they were coded as 0, 1, and 2. We considered two loci, denoted as *G* and *H,* respectively. Power for the logistic regression model in testing gene-gene interaction was calculated using the software QUANTO.^{31} Figure Figure5*A*5*A**,* 5*B*5*B**,* and and5*C*5*C* presents the power comparisons between logistic regression model and LD-based statistic under the three genetic interaction models: recessive × recessive, dominant × dominant, and additive × additive. We can see that the power of both logistic regression and the new LD-based statistic in detecting gene-gene interaction was a monotonic function of the interaction odds ratio, a widely used measure in quantifying the strength of interaction between two loci. This implies that the proposed new interaction measure and test statistic are closely related to the traditional interaction measure. Figure Figure5*A*5*A**,* 5*B*5*B**,* and and5*C*5*C* also shows that the power of the test statistic *T*_{I} is much higher than that of the logistic regression model.

*T*

_{I}and logistic regression analysis as a function of interaction odds ratio (

*R*

_{GH}) under three different models.

*A,*Recessive × recessive model, under the assumption that the risk allele frequencies at both loci

*G*and

**...**

Pairwise LD is widely used in tSNP selection^{32}—that is, the chosen tSNPs show greater LD (measured by *r*^{2}) than those nearby SNPs that were not selected for a preset threshold. This approach ensures enough power in detecting disease locus. We now investigate whether the selected threshold can ensure enough power to detect interaction between two unlinked loci. Figure Figure6*A*6*A**,* 6*B*6*B**,* and and6*C*6*C* shows the power of the statistic *T*_{I} for detecting interaction between two unlinked disease loci (using two tSNPs) as a function of the interaction measure under three two-locus disease models: Dom Dom, Dom Rec, and Rec Rec (table 1). For the simplicity of presentation, we assume that each of the two unlinked marker loci has an equal correlation coefficient with one of the two unlinked interacting disease loci. We fix the allele frequency at the second locus and change the allele frequency at the first locus to produce the changing measure of interaction between two loci. Several remarkable features emerge from figure figure6*A*6*A**,* 6*B*6*B**,* and and6*C*6*C**.* First, in many cases, power increases as the measure of interaction increases. Second, using neighboring tSNPs has much lower power than does using the two interacting disease loci themselves directly. Third, the magnitude of *r*^{2} has large impact on the power of interaction detection.

*T*

_{I}as a function of the interaction measure between two unlinked loci under a two-locus disease model.

*A,*Dom Dom, under the assumption that the number of individuals in both cases and controls are 500, penetrance

**...**

In figure figure6*A*6*A**,* 6*B*6*B**,* and and6*C*6*C**,* we studied the power as a function of measure of interaction. However, in practice, a measure of interaction cannot be directly observed. To provide more practically useful information for tSNPs selection and association studies, we plot figure figure7*A*7*A**,* 7*B*7*B**,* and and7*C*,7*C*, showing the power of statistic *T*_{I} for interaction detection of two unlinked loci as a function of the allele frequency at the first locus under three two-locus disease models: Dom Dom, Dom Rec, and Rec Rec (table 1). Like figure figure6*A*6*A**,* 6*B*6*B**,* and and6*C*6*C**,* figure figure7*A*7*A**,* 7*B*7*B**,* and and7*C*7*C* demonstrated that using tSNPs to detect interaction between two disease loci has much lower power than does using disease loci themselves. Figure Figure7*A*7*A**,* 7*B*7*B**,* and and7*C*7*C* also showed that allele frequencies have large impact on the power of interaction detection, although the patterns of the impact are different under different two-locus disease models.

### Application to Real Data Examples

The proposed LD-based statistic was also applied to two real data sets. The first data set is a case-control study. It includes 398 white patients with breast cancer and 372 matched controls from the Ontario Familial Breast Cancer Registry.^{33} A total of 19 SNPs from 18 key genes from the pathways of DNA repair, cell cycle, carcinogen/estrogen metabolism, and immune system were typed. All SNPs were in HWE. Under a codominant model, multivariate logistic analysis found significant gene-gene interactions between four pairs of genes: *XPD* and *IL10, GSTP1* and *COMT, COMT* and *CCND1,* and *BARD1* and *XPD.*^{33} We used the statistic *T*_{I} to test interactions between these four pairs of genes. The results are summarized in table 3. Table 3 also includes the crude *P* values obtained by Onay et al.^{33} When calculating the crude *P* values, Onay et al.^{33} included all the main effects as well as the only interested interaction term in their multivariate logistic regression model. Using our LD-based statistic, we also found these four pairs of significant interactions, however, with much smaller *P* values. Moreover, two pairs of significant interactions, *XPD* (Lys751Gln) with *IL10* (G−1082A) and *GSTP1* (Ile241Val) with *COMT* (Met108/158Val), remained significant after adjustment for multiple testing by use of Bonferroni correction. But all four pairs of significant interaction identified by logistic regression became nonsignificant after adjustment for multiple comparisons by use of the same Bonferroni correction procedure. It was noticed in Onay et al.^{33} that these four identified interactions can be justified by experiments and their biological relationships.^{33}^{}^{}^{}^{–}^{37}

The second data set was a birth cohort study that recorded the incidence of hospital admission with malaria and severe malaria from Kilifi District Hospital on the coast of Kenya in Africa.^{38} A total of 2,104 children from the study was genotyped for both hemoglobin (Hb) and α^{+}-thalassemia genes to test their interaction. The Hb gene has two alleles, A and S. The mutant S causes sickle cell disease. The normal and mutant alleles in the gene α^{+}-thalassemia are denoted by α and −. We applied the proposed statistic *T*_{I} to this data to test interaction between the Hb and α^{+}-thalassemia genes. The results are summarized in table 4. For comparison, table 4 also lists *P* values obtained by Poisson regression analysis performed by Williams et al.^{38} We can see that the *P* values of the test statistic *T*_{I} were smaller than those of the Poisson regression analysis. Each of the structural variant HbS and α^{+}-thalassemia is protective against severe *Plasmodium **falciparum* malaria. However, if they were inherited together, protection against malaria was lost. The negative epistasis between these two genes can be explained by their biochemical functions.^{38} The malaria-protective effect of HbAs comes from allele Hbs, which might increase binding of hemichromes to the erythrocyte membrane, leading to opsonization and accelerating the removal of infected erythrocytes by phagocytosis. However, coexistence of α^{+}-thalassemia with Hbs reduces the concentration of Hbs, which in turn reduces the protective effect of Hbs against malaria.

## Discussion

Understanding how genomic information underlies the development of complex diseases is one of the greatest challenges in the 21st century. In the past several decades, genetic studies of human disease have focused on a “locus-by-locus” paradigm.^{39} However, biological information is processed in complex networks. The disease emerges as the result of interactions between genes and between a gene and environments. Studying one individual gene or polymorphism at a time to explore the cause of the disease and ignoring the interaction between loci (genes) are unlikely to deeply unravel the mechanism of disease. With the imminent completion of the International HapMap Project, development of statistical methods for detecting gene-gene interaction is of great importance. The purpose of this article is to present a new statistic for identifying interaction between two unlinked loci.

Association studies rely heavily on the LD pattern between pairs of loci. Knowledge about the difference in LD between the disease and general populations is essential for understanding the interaction between two loci and their association with the disease. However, little is known about how the multiple-locus disease models influence the pattern of LD in the disease population and how the interaction between two functional SNPs generates the LD in a disease population. Therefore, before presenting the new statistic for detection of the interaction between two unlinked loci, we first developed the general theory to study LD patterns in a disease population under two-locus disease models. We introduced a new concept of haplotype penetrance and developed a measure of interaction between two unlinked loci. Surprisingly, the formula for calculating the interaction measure was very similar to that for calculating the LD measure. The proposed measure of interaction characterizes the contribution of interaction between two loci to the cause of disease. We also investigated how two-locus disease models and population parameters affect the measure of interaction between two unlinked loci. Intuitively, interaction indicates the joint action of two genes in the development of disease. This implies that some haplotypes spanned by the interacting loci occur more often in the disease population than expected. In other words, the interaction between two unlinked loci generates LD in the disease population and the LD level generated by gene-gene interaction depends on the magnitude of the interaction between two unlinked loci. We have rigorously proved that the measure of LD between two unlinked loci generated by their interaction was proportional to the measure of the interaction, which provided us the motivation to propose a statistic for testing interaction between two unlinked loci by comparing the difference in LD between the disease and general populations. Here, we should point out that, after finishing this manuscript, we noticed that a similar statistic was proposed to test association between a single gene and disease.^{40} Zaykin et al.^{41} called it the “LD contrast test.” However, this LD contrast test was originally designed to test the association of SNPs by assuming a single disease model. It has not been extended to testing gene-gene interaction.

To use the proposed LD-based statistic to test gene-gene interaction between two unlinked loci, we first examined its distribution under the null hypothesis of no interaction. Through extensive simulation studies (under the assumption of large-sample theory), we showed that the null distribution of the proposed LD-based statistic in both homogeneous and admixed populations was close to a central χ^{2}_{(1)} distribution. We also calculated type I error rates of the LD-based statistic by simulation. Our results showed that type I error rates were close to the nominal significance levels. We also investigated the power of the new statistic in detecting gene-gene interaction by analytic methods. It shows that its power was a function of the interaction measure, which implies that this new statistic, indeed, can be used to test interaction between two unlinked loci. However, power of the proposed statistic is a complex function. For example, except for the measure of interaction, it also depends on allele frequencies. Moreover, when the measure of interaction is beyond some range, power is no longer an increasing function of the interaction measure (data not shown). Power comparison with logistic regression analysis demonstrated that this LD-based test statistic has much higher power in detecting interaction than does the logistic regression method.

The widely used strategies for tSNP selection are based on a single-disease-gene model. The criteria for tSNP selection is based on the LD levels between the tSNP and disease-susceptibility locus, which ensures a certain power to detect association of a single disease locus with the disease. Our theoretical analysis and power studies demonstrated that such selected tSNPs are highly unlikely to ensure that the interactions between unlinked two loci will be detected.

To further evaluate its performance for detection of interaction between two loci, the proposed LD-based statistic was applied to two published data sets. Our results showed that, in general, *P* values of the test statistic *T*_{I} were much smaller than those of other approaches, including logistic regression analysis.

Like all population-based methods for association studies, the proposed LD-based statistic for testing gene-gene interaction between two unlinked loci also suffers from the attribution-of-causality confound in situations of pleiotropy or overlapping clinical conditions. The detected interaction for a particular disease could actually relate to other diseases that may share common etiological effects with the disease of interest and are only indirectly associated with the disease of interest. Similar to population structure, epistatic selection will also create LD between two unlinked loci. If epistatic selection between two unlinked loci is irrelevant to the disease of interest, the level of LD created by epistatic selection in both cases and controls will be similar, and, in this case, the impact of epistatic selection on the false-positive rate is limited. However, when epistatic selection underlies the phenotypes that are indirectly associated with the disease of interest, it will cause confounding.

Similar to most models for LD, the proposed test statistic and measure of interaction between two unlinked loci require the assumption of HWE. Deviation from HWE will affect the false-positive rates. The measure of interaction in the presence of Hardy-Weinberg disequilibrium (HWD) is a complicated function of penetrance, allele frequencies, and the measure of HWD. A detailed analysis of the impact of HWD on the test for interaction is needed.

In the past years, more and more detailed and comprehensive evidence showed that genetic and molecular interactions govern cell behaviors, including cell division, differentiation, and death, and are primary factors for the development of diseases. In many cases, single-locus analysis fails to unravel the mechanism of disease. A locus-by-locus paradigm for genetic studies of complex diseases should be shifted to a new paradigm incorporating gene-gene interaction into genetic studies of complex diseases.

The results in this article are preliminary. Interaction between two linked loci or high-order interactions among multiple loci have not been studied. Gene-gene interaction is an important but complex concept. There are several ways to define gene-gene interaction. How the definition of gene-gene interaction on a population level reflects the genes' biochemical or physiological interaction is still a mystery. We hope that this work provides further motivation to conduct theoretical research in deciphering genetic and physiological meaning of gene-gene interactions and to develop more statistical methods for testing gene-gene interaction. In the coming years, the integration of gene-gene interaction into genomewide association analysis will be a major task in genetic studies of complex diseases.

## Acknowledgments

We thank three anonymous reviewers for helpful comments on the manuscript, which led to much improvement of the article. M.X. is supported by National Institutes of Health (NIH)–National Institute of Arthritis and Musculoskeletal and Skin Diseases grant P01 AR052915-01A1, NIH grants HL74735 and ES09912, and Shanghai Commission of Science and Technology grant 04dz14003. J.Z. is supported by NIH grant ES09912.

## AppendixA

## Appendix B

Assume that marker locus *M*_{1} has two alleles, *M*_{1} and *m*_{1}, and the marker locus *M*_{2} has two alleles, *M*_{2} and *m*_{2}. Let the frequencies of the haplotypes *D*_{1}*M*_{1}, *D*_{1}*m*_{1}, *d*_{1}*M*_{1}, and *d*_{1}*m*_{1} be *P*_{D1M1}, *P*_{D1m1}, *P*_{d1M1}, and *P*_{d1m1}, respectively. The frequencies of the haplotypes *D*_{2}*M*_{2}, *D*_{2}*m*_{2}, *d*_{2}*M*_{2}, and *d*_{2}*m*_{2} can be similarly defined. Let the frequencies of the haplotypes *M*_{1}*M*_{2}, *M*_{1}*m*_{2}, *m*_{1}*M*_{2}, and *m*_{1}*m*_{2} in the disease population be *q*^{A}_{11}, *q*^{A}_{12}, *q*^{A}_{21}, and *q*^{A}_{22}, respectively. Then, we have

Similarly, we have

and

Thus, after some algebra, we can obtain the LD between two marker loci in the disease population:

Recall that the LD between two unlinked disease loci in the disease population is given by

Therefore, the LD between two unlinked marker loci in the disease population can be rewritten as

## Appendix C

It is well known that the estimators of the haplotype frequencies , , and are asymptotically distributed as a multivariate normal distribution *N*[*P*,(1/2*n*_{G})Σ], where *P*=[*P*_{11},*P*_{12},*P*_{21}]^{T} and Σ=*diag*(*P*_{11},*P*_{12},*P*_{21})-*PP*^{T}. Let *P*^{A}=[*P*^{A}_{11},*P*^{A}_{12},*P*^{A}_{21}]^{T}. Similarly, is asymptotically distributed as *N*[*P*^{A},(1/2*n*_{A})Σ^{A}], where

Since is a function of the haplotype frequencies , , and , the estimated measure of LD, , is asymptotically distributed as shown by Serfling^{42}:

where

However, we can show that

First, we note that *h*/*P*_{D1D2}=1-*P*_{D1}-*P*_{D2}, *h*/*P*_{D1d2}=-*P*_{D2}, and *h*/*P*_{d1D2}=-*P*_{D1}. Let *V*=*C*Σ*C*^{T}. After some algebra, we have

Since (1-*P*_{D1}-*P*_{D2})*P*_{D1D2}-*P*_{D2}*P*_{D1d2}-*P*_{D1}*P*_{d1D2}=δ-*P*_{D1}*P*_{D2}, we have

Note that

Substituting equation (C3) into equation (C2) yields

Collecting the coefficient of δ in the above equation (C4), we obtain

Substituting equation (C5) into equation (C4), we have

which proves equation (C1). Similarly, is asymptotically distributed as *N*[δ_{A},(1/2*n*_{A})*V*_{A}]. Under the null hypothesis of no interaction between two unlinked loci, we have δ_{A}=δ=0. Therefore, the statistic *T*_{I} is asymptotically distributed as a central χ^{2}_{(1)} distribution under the null hypothesis.

Now, we show that, under some assumption, the statistic *T*_{I} is still a valid test in the admixed population. Consider an admixed population that is mixed from two subpopulations with proportions α and (1-α). It is known that the measure of LD in the admixed population is given by

where *P*^{(k)}_{Di} and δ^{(k)} are the frequency of the allele *D*_{i} and the measure of LD between two loci in the *k*th subpopulation (*k*=1,2), respectively. If we assume that

where *P*^{A(k)}_{Di} is the frequency of the allele *D*_{i} in the *k*th disease subpopulation, then we have

Therefore, under the assumption (C6), the statistic *T*_{I} is also asymptotically distributed as a central χ^{2}_{(1)} distribution under the null hypothesis of no interaction between two unlinked loci in the admixed population.

## References

*T*

^{2}test for genome association studies. Am J Hum Genet 70:1257–1268 [PMC free article] [PubMed]

^{Ser-15}phosphorylation and a G

_{1}/S arrest following ionizing radiation-induced DNA damage. J Biol Chem 279:31251–31258 [PubMed] [Cross Ref]10.1074/jbc.M405372200

^{+}-thalassemia and the sickle cell trait. Nat Genet 37:1253–1257 [PMC free article] [PubMed] [Cross Ref]10.1038/ng1660

**American Society of Human Genetics**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (896K)

- Composite measure of linkage disequilibrium for testing interaction between unlinked loci.[Eur J Hum Genet. 2008]
*Wu X, Jin L, Xiong M.**Eur J Hum Genet. 2008 May; 16(5):644-51. Epub 2008 Jan 23.* - A novel evolution-based method for detecting gene-gene interactions.[PLoS One. 2011]
*Rao S, Yuan M, Zuo X, Su W, Zhang F, Huang K, Lin M, Ding Y.**PLoS One. 2011; 6(10):e26435. Epub 2011 Oct 25.* - On selecting markers for association studies: patterns of linkage disequilibrium between two and three diallelic loci.[Genet Epidemiol. 2003]
*Garner C, Slatkin M.**Genet Epidemiol. 2003 Jan; 24(1):57-67.* - A novel statistic for genome-wide interaction analysis.[PLoS Genet. 2010]
*Wu X, Dong H, Luo L, Zhu Y, Peng G, Reveille JD, Xiong M.**PLoS Genet. 2010 Sep 23; 6(9):e1001131. Epub 2010 Sep 23.* - Multipoint linkage disequilibrium mapping approach: incorporating evidence of linkage and linkage disequilibrium from unlinked region.[Genet Epidemiol. 2003]
*Hsu FC, Liang KY, Beaty TH.**Genet Epidemiol. 2003 Jul; 25(1):1-13.*

- Restricted parameter space models for testing gene-gene interaction[Genetic epidemiology. 2009]
*Song M, Nicolae DL.**Genetic epidemiology. 2009 Jul; 33(5)386-393* - Analysis pipeline for the epistasis search - statistical versus biological filtering[Frontiers in Genetics. ]
*Sun X, Lu Q, Mukheerjee S, Crane PK, Elston R, Ritchie MD.**Frontiers in Genetics. 5106* - Kernel canonical correlation analysis for assessing gene-gene interactions and application to ovarian cancer[European Journal of Human Genetics. 2014]
*Larson NB, Jenkins GD, Larson MC, Vierkant RA, Sellers TA, Phelan CM, Schildkraut JM, Sutphen R, Pharoah PP, Gayther SA, Wentzensen N, Ovarian Cancer Association Consortium, Goode EL, Fridley BL.**European Journal of Human Genetics. 2014 Jan; 22(1)126-131* - To Control False Positives in Gene-Gene Interaction Analysis: Two Novel Conditional Entropy-Based Approaches[PLoS ONE. ]
*Zuo X, Rao S, Fan A, Lin M, Li H, Zhao X, Qin J.**PLoS ONE. 8(12)e81984* - From Interaction to Co-Association --A Fisher r-To-z Transformation-Based Simple Statistic for Real World Genome-Wide Association Study[PLoS ONE. ]
*Yuan Z, Liu H, Zhang X, Li F, Zhao J, Zhang F, Xue F.**PLoS ONE. 8(7)e70774*

- MedGenMedGenRelated information in MedGen
- PubMedPubMedPubMed citations for these articles
- SubstanceSubstancePubChem Substance links
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree

- Test for Interaction between Two Unlinked LociTest for Interaction between Two Unlinked LociAmerican Journal of Human Genetics. Nov 2006; 79(5)831PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...