- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# A Generalized Combinatorial Approach for Detecting Gene-by-Gene and Gene-by-Environment Interactions with Application to Nicotine Dependence

## Abstract

The determination of gene-by-gene and gene-by-environment interactions has long been one of the greatest challenges in genetics. The traditional methods are typically inadequate because of the problem referred to as the “curse of dimensionality.” Recent combinatorial approaches, such as the multifactor dimensionality reduction (MDR) method, the combinatorial partitioning method, and the restricted partition method, have a straightforward correspondence to the concept of the phenotypic landscape that unifies biological, statistical genetics, and evolutionary theories. However, the existing approaches have several limitations, such as not allowing for covariates, that restrict their practical use. In this study, we report a generalized MDR (GMDR) method that permits adjustment for discrete and quantitative covariates and is applicable to both dichotomous and continuous phenotypes in various population-based study designs. Computer simulations indicated that the GMDR method has superior performance in its ability to identify epistatic loci, compared with current methods in the literature. We applied our proposed method to a genetics study of four genes that were reported to be associated with nicotine dependence and found significant joint action between *CHRNB4* and *NTRK2.* Moreover, our example illustrates that the newly proposed GMDR approach can increase prediction ability, suggesting that its use is justified in practice. In summary, GMDR serves the purpose of identifying contributors to population variation better than do the other existing methods.

Most, if not all, phenotypic traits of biomedical relevance in humans and of economic importance in plants and animals are the result of a series of dynamic, interrelated, and hierarchical metabolic pathways under the control of jointly acting networks of genes and environmental factors.^{1}^{}^{}^{}^{–}^{5} When the change of a genetic factor or an alteration in environment perturbs the overall homeostasis of such a system, there may be detectable marginal effects on a phenotypically relevant outcome. When some factors approximately meet the criteria of conditional independence, as defined by Bayesian network theory, their marginal effects can be viewed as being independent of one another. This is the basic logic of traditional approaches that typically attempt to isolate one factor at a time and to ascribe the phenotype to some kind of additive or combinatorial effects of these factors. Such strategies, however, can fail to detect the determinants if their measured effects on variation depend on the context defined by other genes and/or by exposures to environments—that is, if there exist gene-by-gene interaction (epistasis) and/or gene-by-environment interaction (plastic reaction norms).^{6}^{,}^{7}

It has been well documented in the literature that, as natural properties of complex networks and the ubiquitous intermolecular dependence in gene regulation and biochemical and metabolic systems, joint actions are the norm rather than the exception in the inherited traits.^{7}^{}^{}^{}^{}^{–}^{12} Traditional methods for detecting these as statistical interactions are usually established by an extension under the concept of single-factor–based approaches.^{13} Such methods, in which the total number of possible parameters could rapidly outpace the total size of any sample with increase in dimension, have several practical limitations, such as having a heavy computational burden (often being computationally intractable) and increased type I and II errors and being less robust and potentially biased as a result of highly sparse data in a multifactorial model. Thus, they are hardly appropriate for tackling elusive gene-gene and gene-environment interactions.

Recently, several approaches have emerged as promising tools for detecting gene-by-gene and gene-by-environment interactions in either dichotomous or continuous phenotypes. For example, Ritchie and her colleagues^{14}^{}^{–}^{16} proposed an algorithm, called the “multifactor dimensionality reduction” (MDR) method, for balanced case-control or discordant sib-pair designs. Hahn and Moore^{17} presented a mathematical proof that shows that MDR ideally discriminates between discrete clinical endpoints by the use of multilocus genotypes. Recently, Martin et al.^{18} extended the MDR method for family-based designs, and Velez et al.^{19} proposed a balanced accuracy function to detect interactions in unbalanced data sets.

Since publication of the original report,^{14} MDR has been applied by many research groups to detect interactions for a number of complex disorders (for a detailed list of publications, see the Epistasis Blog). However, there still exist several limitations in the currently established MDR implementation that may restrict its practical use in genetic data analysis: (1) it does not allow for adjustment of covariates such as ethnicity, sex, weight, and/or age, and (2) it is applicable only to dichotomous phenotypes, not to continuous phenotypes that contain more information.

Nelson et al.^{20} developed the combinatorial partitioning method (CPM) for quantitative traits, which shares a similarity with the MDR method. Prohibitively intensive computation makes this method less practical for dealing with cases with more than two loci. Culverhouse et al.^{21} advocated the restricted partition method (RPM). Although it substantially reduces the computational burden, as compared with the CPM, the RPM still requires significant computational effort for high-dimensional data. Further, the validity of the RPM relies on a reasonably good partitioning of genotypes into subgroups implemented iteratively by multiple comparison tests.^{22} Moreover, neither Nelson et al.^{20} nor Culverhouse et al.^{21} included covariates in their approaches.

In this article, we propose a generalized MDR (GMDR) framework based on the score of a generalized linear model, of which the original MDR method is a special case. Our proposed approach has several advantages: (1) it permits adjustment for covariates, (2) it provides a unified framework for coherently handling both dichotomous and quantitative phenotypes, and (3) it is applicable to a variety of flexible population-based study designs—for example, it can be applied without modification to unbalanced case-control samples and to both random and selected samples. To help readers follow our approach, we first present the theory and then demonstrate, through a series of simulations for both continuous and dichotomous phenotypes, the improvements it leads to in testing accuracy and cross-validation consistency when an informative covariate exists. Finally, we apply our proposed novel approach to a genetic data set on tobacco dependence and find significant joint action of genes for nicotine dependence (ND).

## Methods

In this section, we first introduce the generalized linear model commonly used for either dichotomous or continuous phenotypes. We then introduce the concept of a score statistic into the current MDR framework and formulate our GMDR approach. We should emphasize that the score-based derivation should be considered merely a device for obtaining an appropriate statistic to classify multifactor cells into different two groups. It is not necessarily implied that the GMDR method is likelihood based; for example, we can use other measures computed via least-squares regression or other statistical methods for nonnormal continuous traits. Moreover, like MDR,^{16} the GMDR method can also be considered a constructive induction approach.

### Models

Let *y*_{i} denote the phenotype of individual *i*, either dichotomous or continuous, with expectation *E*(*y*_{i})=μ_{i}. In general, it can be represented by a generalized linear model in the exponential family of distributions that includes the normal, Poisson, and Bernoulli distributions^{23}^{,}^{24}:

where *l*(μ_{i}) is an appropriate link function, α is the intercept, *x*_{i} is the predictor-variable vector that codes gene-by-gene and/or gene-by-environment interactions of interest, *z*_{i} is the vector coding for covariates, and β and γ are the corresponding parameter vectors. In what follows, we call β the “target effects.” With dichotomous phenotypes following, say, a Bernoulli distribution, a natural link function is the logit,

For continuous phenotypes having a normal distribution, the natural link is the identity. In the presence of statistical interactions between the target attributes and covariates, the above model can be further extended to

where δ is the vector of the interaction effects and represents a direct (Kronecker) product.

### Score Statistics

The log-prospective likelihood of independent observations *y*_{i} and *i*=1,2,…,*n*, with conditioning on the predictor-variable vectors *x*_{i} and *z*_{i}, can be written as^{23}^{,}^{24}

where *y* is the vector of observations, Ω is the vector of parameters, Ω=(α,β,γ) in model (1) and Ω=(α,β,γ,δ) in model (2), and is a function of *l*(μ_{i}) with the property that when *l*(μ_{i}) is a canonical link function. The first partial derivative of the log-likelihood, also termed the “score,” is

where θΩ. Setting β=0 in model (1) yields the residual score vector

where component , the estimated expectation is , and are the maximum-likelihood estimates (MLEs) under the null hypothesis *H*_{0} (β=0) (i.e., no target effects of study), and is the contribution of individual *i* to the score for β_{j}. Likewise, we obtain the residual score vector for model (2) by setting β=0 and δ=0:

where component is analogous to that in equation (3), component

is the MLE under *H*_{0} (β=0 and δ=0), and is the contribution of individual *i* to the score for δ_{k}.

Then, we define the following score-based statistics for individual *i,* on the basis of normalized contributions:

and

where is the estimated variance of *y*_{i}, and where *S*^{T}_{i}, *S*^{TC}_{i}, and *S*^{T+TC}_{i}, respectively, measure the normalized contributions to the scores of the target effects, target-by-covariate interactions, and both target and target-by-covariate effects. We can use any one of the three according to our purpose. We use *S*^{T}_{i} to illustrate our GMDR method, which we call the “score-based MDR” method for the time being.

### Score-Based MDR Method

The score-based MDR method proposed in this article uses the same data-reduction strategy as does the original MDR method^{14}^{,}^{15}—that is, the possible cells classified by a set of factors are pooled into two distinct groups, effectively reducing the dimensionality from multidimensional to one-dimensional and thereby identifying, from all potential combinations, the specific combinations of factors that show the strongest association with the phenotype. To make the presentation self-contained, we first briefly review the current MDR procedure and then describe our generalization under the same framework, using the score statistic to define the two distinct groups. As shown below, the current MDR method is a specific case of our GMDR method.

Figure 1, adapted from the work of Ritchie et al.^{14} and Hahn et al.,^{15} illustrates the general steps involved in implementing the MDR method for case-control or discordant-sib studies. In the first step, the data are randomly split into some number of equal parts for cross-validation; for an illustrative purpose, the use of 10-fold cross-validation is shown in figure 1. One subdivision is used as the testing set and the rest as the independent training set. Then, steps 2–5 are run for the training set and step 6 for the testing set. (To reduce the fluctuations due to chance divisions of the data, each possible training set and its corresponding testing set are used, and the results are averaged. The consistency of the model across cross-validation training sets [i.e., how many times the same MDR model is identified in all the possible training sets] is also evaluated.) In the second step, a set of *n* genetic and/or discrete environmental factors is selected from the list of all factors. In the third step, the possible multifactor classes or cells defined by the *n* factors are represented in *n*-dimensional space. Then, the ratio of the number of cases to the number of controls is calculated within each multifactor cell. In the fourth step, each multifactor cell in *n*-dimensional space is labeled either as “high-risk” if the ratio of cases to controls meets or exceeds a preassigned threshold *T* (e.g., *T*=1), including the cells that have cases but no controls, as “low-risk” if the threshold is not exceeded, including the cells that have controls but no cases, or as “empty” otherwise. A model is formulated by pooling high-risk cells into one group and low-risk cells into another group. In the fifth step, all potential combinations of *n* factors are evaluated sequentially for their ability to classify cases and controls in the training data, and the best *n*-factor model that yields minimum misclassification error is chosen. In the sixth step, the independent testing set is used to estimate the prediction error of the best model selected from the fifth step. Finally, among this set of best models, we pick the model that has minimum prediction error and/or maximum cross-validation consistency.

^{14}and Hahn et al.

^{15}). The two methods share the same reduction strategy. The difference is that, in the GMDR method, we substitute a score

**...**

In our generalization, we replace the ratio of cases to controls by the score in each cell to discriminate between high-risk and low-risk cells and assess classification accuracy and prediction error, while keeping the rest of the procedure unchanged. First, we compute the MLEs and the score values of all individuals under the null hypothesis, *H*_{0}: β=0 for model (1) or β=0 and δ=0 for model (2). Since the null hypothesis assumes there are no effects of the putative factors or their interactions, the score values are the same for all different factor classifications. Now, in the third step, the cumulative score value is calculated within each multifactor cell. In the fourth step, each multifactor cell is labeled either as “high-risk” if the average score meets or exceeds a preassigned threshold *T* (e.g., 0) or as “low-risk” if the threshold is not exceeded. Correspondingly, we substitute the score values for the numbers of cases and controls, to evaluate the classification and prediction errors, and thereby identify the best model in later steps. The original MDR method is a specific version of the method proposed in this report. For balanced case-control studies with no covariates, the sample prevalence is . The case:control ratio within each cell is replaced by the cell’s average score—for example, 1:1 is equivalent to a score value 0.

This generalization offers much flexibility in the use of covariates, different study designs, and different types of phenotypes. The method allows for covariate adjustments and provides a unified framework for analyzing both continuous and dichotomous traits, as well as others, under generalized linear models. It can also be applied without modification to unbalanced case-control, random, and selected samples. Moreover, although we borrow the concept of score functions to formulate it, our GMDR method is not dependent on the usual score or likelihood properties. The validity of the GMDR method depends only on the availability of an appropriate statistic that can provide a measure of the association between the putative factors and the phenotype. Other statistics, such as moment and least-squares statistics, can also be used. Thus, like the MDR method, the GMDR method can be considered model free.

## Results

### Simulation Results

To evaluate the ability of the GMDR method to detect factor interactions, we simulated a series of data sets on a sample consisting of 1,000 unrelated subjects for both continuous and dichotomous phenotypes under three different epistatic models that have been considered before—that is, digenic, trigenic, and tetragenic interaction models—but each with one extra risk factor (covariate) that contributes to the phenotype. Genotypes were simulated, for a total of 10 unlinked diallelic loci with equifrequent alleles, including two, three, or four disease-causing genes and the rest nonfunctional loci, with the assumption of Hardy-Weinberg equilibrium and linkage equilibrium. To simplify our exposition, phenotypes were generated under model (1) with one covariate but no interaction between genes and the covariate. We simulated patterns for digenic, trigenic, and tetragenic interactions, similar to those in the work of Ritchie et al.^{14} and Culverhouse et al.,^{21} for models in which the independent-locus main effects are small—for example, diagonal (i.e., genotypes AABB, AaBb, and aabb are considered high-value groups and the rest low-value groups), antidiagonal (i.e., AAbb, AaBb, and aaBB vs. the others), and checkerboard (i.e., AABb, AaBB, Aabb, and aaBb vs. the others).

For the purpose of comparison between the GMDR and original MDR methods, we used a balanced case-control design for dichotomous phenotypes, although GMDR can also accommodate unbalanced designs. We simulated 500 cases and 500 controls on the basis of a logit model with α=-5.29, β=3.09, and γ=1, where the genotypes of high risk have a penetrance of ~0.1 and the others have a risk of ~0.005 when the value of the covariate is 0. The covariate was assumed to have a normal distribution, with mean 0 and variance 10, and to be observed for all subjects; when the covariate variance is 10, the separation between groups is ~1 SD.

Subjects were sampled randomly from a reference population for studying a continuous phenotype. Continuous phenotypes were generated in terms of a normal model with α=0, γ=1, and *e**N*(0,1) and a separation of β=0.5 between groups. As in the work of Culverhouse et al.,^{21} in addition to two groups—high value and low value—we also performed simulations under a diagonal model, with three groups for digenic interaction models, to assess the performance of GMDR in a more general case. This results in a bimodal or trimodal distribution. The group separation was further modeled by a (0, 1) covariate assumed to come from a Bernoulli distribution with probability 0.5 and to be available for all subjects.

Scores for all the individuals, both with and without inclusion of the covariate, were computed using equation (4), under the null hypothesis for two types of phenotypes. The GMDR method with a threshold of 0 was employed in the subsequent analysis, with the use of scores with or without covariate adjustment. In the case of a dichotomous phenotype, GMDR without adjustment was equivalent to the original MDR method with a threshold case-control ratio of 1:1. An exhaustive computational search strategy was performed for all possible one- to nine-locus models in our simulations. The average cross-validation consistency and prediction accuracy, as well as the SEMs, were computed on the basis of 200 simulation replicates. Since the different interaction forms, such as diagonal, antidiagonal, and checkerboard models, gave similar results, for the purpose of a clear presentation, we list only partial results.

Table 1 summarizes, for the dichotomous trait, the means and SEMs of both the cross-validation consistency and the prediction accuracy. As expected, with use of the correct model for analysis, both GMDR and MDR always gave maximum prediction accuracies and cross-validation consistencies. Analysis with use of a model in which only the one-locus main effects were considered resulted in the poorest performance among all incorrect models. The SEMs of prediction accuracy and cross-validation consistency were also the lowest for analyses under the correct model. As compared with the original MDR method, allowing for the covariate with GMDR increased prediction accuracies under the correct analysis model—for example, from 0.688 to 0.802, from 0.675 to 0.799, and from 0.636 to 0.762 for the digenic, trigenic, and tetragenic models, respectively. This indicates that GMDR can effectively eliminate the noise from the covariate and can increase prediction accuracy, whereas failing to consider the covariate would lead to an increased prediction error. Although all cross-validation consistencies listed in table 1 were 10.000 for both GMDR and MDR—that is, the same models were found in each possible training sample—it was not always true that the original MDR had the same cross-validation consistency as did GMDR. In some cases (data not shown), GMDR had higher cross-validation consistency—for example, the means (±SEMs) of cross-validation consistency and prediction accuracy with GMDR were 9.925±0.436 and 0.673±0.026, respectively, whereas those with MDR were 8.510±2.091 and 0.566±0.027, respectively, under one of the tetragenic models we evaluated. Taken together, we conclude that the GMDR method consistently had higher or at least equal prediction accuracy and cross-validation consistency and better ability than did the MDR method to identify the correct model.

Table 2 presents the means and SEMs of both the cross-validation consistency and the prediction accuracy for a continuous trait. Because the original MDR method cannot handle continuous traits, no analogous simulation was conducted for MDR. Here, we compared the results of GMDR with and without covariate adjustment. The results indicated that GMDR could identify the correct model irrespective of the presence of two or three underlying groups, demonstrating that GMDR is applicable to more-general cases, not to just discrete clinical endpoints or two risk groups of genotypes. Although GMDR with no covariate adjustment gave reasonably good estimates, it had consistently lower prediction accuracy and cross-validation consistency than did GMDR with covariate adjustment, verifying that ignoring a covariate leads to loss of prediction ability. The accuracy seemed to be decreased for trigenic and tetragenic interaction models, and this might be, in part, because of a lower frequency of the high-value group and heritability.

In summary, GMDR is valid for both dichotomous and quantitative traits and for balanced case-control and random samples, as well as for more than two penetrance functions. The existing methods, which fail to consider causative covariates, would lead to reduced accuracy arising from the increased background noise contributed by such covariates. The GMDR method, with inclusion of any covariate that confers an increased disease risk or affects a phenotypic value, is able to remedy such limitations because of its capability to account for the variation ascribable to the covariate and, thus, leads to improved accuracy.

### Application to ND Data

To illustrate use of the method proposed here, we present an application to identify susceptibility genes for ND, with a set of genotype data including 23 SNPs located in four candidate genes: brain-derived neurotrophic factor (*BDNF* [MIM 113505]); neurotrophic tyrosine kinase, receptor, type 2 (*NTRK2* [MIM 600456]); cholinergic receptor, nicotinic, alpha 4 (*CHRNA4* [MIM 118504]); and cholinergic receptor, nicotinic, beta 2 (*CHRNB2* [MIM 118507]). Detailed information on the gene structures and SNPs is given in tables tables33 and and4;4; for DNA extraction and genotyping information, please refer to our other reports.^{25}^{}^{–}^{27} The participants involved in this study were of either African American or European American ancestry and were enrolled during 1999–2004 in the U.S. Mid-South Tobacco Family (MSTF) cohort for family-based linkage and/or association studies. Detailed demographic and clinical characteristics of this sample have been reported elsewhere^{27} and are not included here. A total of 191 unrelated smokers and 191 nonsmokers were selected from this family cohort (the majority of this cohort are smokers) to meet the requirement of a balanced case-control design.

After we examined genotyping quality and excluded possible genotyping errors on the basis of the genotype data from other family member(s) of subjects, ethnicity, sex, and age were modeled as covariates to compute the scores under the null hypothesis. GMDR was performed with the computed score. For the purpose of comparison, we also used MDR^{14} to analyze the same data set. An exhaustive search of all possible one- to five-locus models was first performed for all 23 SNPs. If these models had not attained the maximum prediction accuracy and cross-validation consistency, higher-order models were then evaluated until the extrema were reached. *P* values were determined by the sign test, a robust nonparametric test implemented in the MDR software.^{15} Permutation testing was also conducted to gain empirical *P* values of prediction accuracy as a benchmark based on 10,000 shuffles.

Since inclusion of age as a covariate did not improve the prediction accuracy, we report the results from the analyses in which only ethnicity and sex were included as covariates. Given that the four-locus model had attained the best prediction accuracy and cross-validation consistency, higher-order models were not evaluated. Table 5 lists the best models, prediction accuracies, cross-validation consistencies, and *P* values by the sign test obtained from GMDR and MDR, for each number of loci from one to five. GMDR and MDR yielded the same best four-locus model that had maximum prediction accuracy and cross-validation consistency. However, GMDR had better prediction ability than did MDR. For example, the prediction accuracy and cross-validation consistency were 0.603 and 7, respectively, for GMDR, whereas they were 0.596 and 6, respectively, for MDR. GMDR yielded a *P* value of .011 by the sign test, whereas MDR yielded a *P* value of .055, which does not even reach the traditional cut-off significance level of .05. The empirical *P* values of prediction error by permutation testing were .014 and .021 for GMDR and MDR, respectively.

*P*Values Identified by GMDR and MDR for ND Data

The best prediction model identified in our analysis included one SNP, *rs2229959,* in *CHRNA4* and three SNPs, *rs993315, rs1122530,* and *rs736744,* in *NTRK2,* suggesting that the *CHRNA4* and *NTRK2* genes were significant contributors to ND in the MSTF cohort. The prediction accuracies of the one-locus models by GMDR (MDR) were 0.453 (0.456), 0.508 (0.503), 0.508 (0.503), and 0.503 (0.525) for SNPs *rs2229959, rs993315, rs1122530,* and *rs736744,* respectively, and the minimum *P* value was .377 (.623), suggesting that the contribution was not from their main effects but from the joint action of the two genes. Figure 2 shows the identified best model. The patterns of high-risk and low-risk cells differ across each of the different multilocus dimensions; that is, the influence that each genotype of SNP *rs2229959* in *CHRNA4* has on ND is dependent on the genotypes of the other three SNPs in *NTRK2* and vice versa, which also provides evidence of the joint action of the two genes (fig. 2).

**...**

Both *CHRNA4* and *NTRK2* have plausible biological bases for being involved in smoking behaviors that are modulated by a series of complex neurobiological and psychological processes, from nicotine metabolic pathways to neural signal transduction to the reward circuitry of the brain. Nicotine, the primary psychoactive, addictive agent in tobacco, produces pleasant and rewarding psychopharmacologic effects through functionally diverse neuronal nicotinic acetylcholine receptors (nAChRs).^{28}^{,}^{29} *CHRNA4* encodes the α4 subunit of nAChRs, which, together with the subunit β2 encoded by *CHRNB2,* form the most prevalent nAChRs in brain. *NTRK2* (also known as the “tyrosine kinase receptor gene” [*TRKB*]) encodes the neurotrophic tyrosine kinase receptor 2 (NTRK2), which is stimulated by neurotrophins and is responsible for the transduction of signals controlling neuropoiesis and neuron survival in the CNS and peripheral nervous system.^{30} The binding of NTRK2 to BDNF regulates short-term synaptic functions and long-term potentiation of brain synapses.^{31} Furthermore, *NTRK2* is essential for the development of γ-aminobutyric acid (GABA)ergic neurons and regulates synapse formation, in addition to its role in the development of axon terminals.^{32} Significant joint contribution supports their roles in the etiology of ND. Although *CHRNA4* and *NTRK2* are not directly interacted each other from a biological point of view, they still exhibit significant joint actions between them, indicating that, as found by other investigators, such joint actions of genes located in biochemically distinct circuits are common.^{33}^{,}^{34} Despite the potential importance from a biological viewpoint, no noticeable joint action was detected between the SNPs in *CHRNA4* and *CHRNB2* or between those in *BDNF* and *NTRK2* in this data set. The possible reasons may include the narrow allelic spectrum of these genes in our sample, low linkage disequilibrium between the SNPs under study and the causative locus, and/or insufficient statistical power due to small sample size.

## Discussion

Although the magnitude and prevalence of interactions or joint actions of multiple factors in biological systems are largely unknown, “cryptic” interaction and decanalization (canalization is a particular sort of joint action) have been increasingly appreciated in exquisite studies,^{35}^{}^{}^{}^{–}^{39} suggesting that they may be the rule rather than the exception. The possible mechanisms contributing to such joint actions may include, but are not limited to, the following. First, apparent interaction is an inherent property of a network system. As recognized by Kacser and Burns^{40} and Nijhout,^{41} the effect of a gene on the flux (phenotype) is context dependent, as a result of enzyme saturation even in an unbranched multistep enzymatic pathway where the encodings of the genes are independent of one another. A highly interconnected metabolic network behaves similarly, except that the nonlinearity becomes more complicated.^{41}^{,}^{42} Second, there is a vast repertoire of joint action mechanisms, with positive and negative feedback regulation at several levels, including the biomolecular, functional module; tissue and organ implicated in transcription, translation, and/or signal transduction; and biochemical, metabolic, and physiological processes.^{43}^{}^{}^{–}^{46} And, third, it has been hypothesized that interactions are a consequence of evolutionary processes.^{47} Phenotypic robustness to genetic and nongenetic perturbations, canalization, developmental homeostasis, and buffering can all be attributed to a response to stabilizing selection and other selective forces in evolution.^{33}^{,}^{34}^{,}^{48}^{,}^{49} If these actions are the result of effects of factor levels that differ in magnitude or direction contingent on the background, they may lead to a weak marginal correlation between the levels of each factor individually and the phenotype. This makes these determinants elude traditional hunting strategies that consider them only in isolation. To track down such determinants with interactive behaviors is a daunting challenge.

Although the ubiquity of joint actions appears to be a natural property of complex inherited traits, the nature of joint actions has not yet been well investigated and understood. Central reasons include the lack of application of appropriate methodologies and a common rift between biological mechanism and statistical abstraction. For example, “epistasis,” a term coined for a specific type of gene-by-gene interaction, has evolved to have different meanings in biological and statistical genetics.^{12}^{,}^{50} To date, most of the findings and biologically supported models have been those of the joint action of multiple factors without a clear distinction of whether they can be adequately described without statistical “interaction” terms. Interactions are represented as a deviance from additivity in a linear model in statistics, with the result that whether and to what extent they exist depends on the scale of measurement employed for analysis, which is rarely determined by biological principles. To shed light on the biological basis for phenotype formation and trait variation, it will be necessary to have innovative methodologies that integrate the scale on which a trait is measured with the mathematical model used.^{51}^{}^{–}^{53}

The conflicting definitions of interaction in biology and statistics can be reconciled under the emerging concept of the phenotypic landscape in hyperspace,^{41}^{,}^{54}^{}^{–}^{56} in which different aspects of the same phenomic architecture are described. A phenotype can be hypothesized as a function of the underlying genetic and environmental factors and can be geometrically plotted as a landscape in a hyperspace, each axis of which describes a range of variation for the corresponding factor, specifically on the scale in which that factor is measured. (A subset of underlying factors that build the phenotype comprises a “slice” of the whole phenotypic landscape, if all other factors are held constant.) The topographical features of the landscape, characterized by parameters such as gradient, curvature, etc., are determined by the developmental network that governs the joint action of the underlying factors, which provides a straightforward relationship between the terminology of biological “interaction” and the geometry of landscape. An individual is a point in the hyperspace with location determined by the values of his/her underlying factor levels and the phenotypic value at the corresponding coordinate on the phenotype surface. The point can have different profiles along the axes or other directions depending on its locality, implying differential response to alterations of the underlying factors. The factor(s) controlling rate-limiting step(s), or the “hub” node(s) of the network, may have a steep profile while the others still have relatively flat slopes and curvatures, so that the phenotype is sensitive to the former but robust to the latter; but it must be remembered that the shape of the profiles depends on how the factors (the axis scales) are measured. The profiles of a point are region specific—that is, they vary with position. Factors may have steep slopes in regions that have narrow ranges for the limits of robustness but are relatively flat in regions that have broader ranges possible. Individuals in a population locate in a limited region of the landscape, and the total phenotypic variation is determined by both the distribution of individuals—that is, their spectrum and density—and the local geometry of the various regions—for example, the limits for robust variation. When a population under selection moves from one region to another, there is phenotypic evolution. Biological joint action (“interaction”), the underlying mechanism generating phenotype, determines the topography of the hyperdimensional landscape, whereas statistical interaction reflects, in addition, how the phenomic architecture is measured over the distribution of individuals in a population, not just the intrinsic property of the interactive system in which the factors are embedded. The model of phenotypic landscape that captures the factor-phenotype mapping relationships well offers a general framework for unifying the insights from studies at the molecular genetic, gross phenotypic, and evolutionary biological levels.

The *biological* concept of interaction focuses on characterizing biological mechanisms, whereas the *statistical* concept is purely descriptive of population variation. Although constructing the landscape is a major aim in contemporary biology, hunting those determinants that contribute to population variation is, for pragmatic reasons, more important for public health and for making genetic improvements in crops and animals. Not all changes in the underlying factors yield large marginal effects on phenotypic variation because of buffering in the system. Only those factors that vary sufficiently to exceed the limits for robust variation are responsible for population variation. Factors having no measurable effects, although playing important roles from a biological viewpoint, are of relatively less interest. The identification of phenotypically relevant factors is the core mission of genetics and epigenetics. Considerable effort is being expended in attempts to evolve powerful methods for identification of factors with interactive behaviors in the statistical sense, unfortunately often without taking biological plausibility into account.

Among the recently emerging methods,^{16}^{,}^{22}^{,}^{57} combinatorial approaches such as MDR, the CPM, and the RPM have a straightforward correspondence to the concept of phenotype landscape and could bridge the gap between statistical theory and its application to the questions of biological interest. On the basis of the recent progress in combinatorial approaches, we have developed a more general combinatorial approach that can accommodate both qualitative and quantitative phenotypes, can allow for both discrete and continuous covariates, and can offer more flexibility for a study design. The original MDR method is a specific application of our new approach. In other words, the new approach can do not only whatever the original MDR method can do but also what the MDR method fails to do, such as handling quantitative traits and covariates. The results herein on simulations demonstrate that this new method can substantially increase the prediction accuracy when the phenotype is subject to the influence of covariate(s), even when applied to complex models that may or may not be common in the real world. Our working example also provides support that the use of the new approach is justified in practice and illustrates that, even when a few factors are involved, there is no need (in this example) to invoke complex statistical interaction to describe their joint action. In contrast to the CPM and RPM, GMDR, like MDR, looks for the major signal in the variation (i.e., whether there is a difference attributable to the underlying factors) and ignores minor signals (i.e., how many underlying groups there are). Thus, GMDR does not need to classify groups by using an analysis of variance implemented in the CPM or multiple comparisons in the RPM, and it can thereby largely reduce the computational burden and be more feasible for use with multilocus models. Also like MDR, GMDR tends to avoid chance fluctuations due to incorrect grouping arising from type I and II errors. For these reasons, we believe that GMDR can serve the purpose of identifying major factors contributing to population variation better than can other existing methods. The software for the reported GMDR method in this study can be downloaded from the GMDR program Web site.

Several problems and limitations associated with the existing MDR methods, as discussed in the literature,^{14}^{}^{–}^{16} have been circumvented within our GMDR statistical framework, such as modification for continuous phenotypes. The theory of phenotype landscape can also give a clearer biological interpretation of joint action. One of the remaining problems is how to evaluate prediction errors for the cells that are empty in the training data set but are not empty in the testing data set. High dimensionality and a small sample usually lead to many such cells. This means that the model has no clear ability to make predictions for those cells. One option is to simply leave those empty cells out when estimating prediction errors. An alternative strategy, as implemented in our GMDR algorithm, is to treat them as misclassification cells when summing the scores of high-risk and low-risk cells. Such a strategy is one way, consistent with statistical parsimony, to impose a penalty on oversubdividing a small sample.

The problem of high-dimensional computation still remains with this new approach. The computational expense in the current version is significant when >10 factors are considered but could be much reduced by limiting the combinations examined to the relatively few that are biologically more plausible.^{58} Initial attempts to use new strategies such as parallel genetic algorithms are also encouraging. We have started to tackle the problem of higher-dimensional computation by incorporating better optimization algorithms. Up until now, GMDR has been applicable only to population-based (unrelated) observations. Its extension to family-based designs will require further development of the GMDR method.

## Acknowledgments

The original MDR Java source code was downloaded from ^{Epistasis.org}. We greatly appreciate Dr. Jason Moore and his colleagues at the Dartmouth Medical School, for making their MDR Java source code available for this project. We also thank two anonymous reviewers for their constructive comments and suggestions for the manuscript. This project was funded in part by National Institutes of Health grants DA-12844 (to M.D.L.) and GM28356 (to R.C.E.) and National Science Foundation of China grant 30000097 (to X.Y.L.).

## Web Resources

The URLs for data presented herein are as follows:

*BDNF, NTRK2, CHRNA4,*and

*CHRNB2*)

## References

*BDNF*haplotypes in European-American male smokers but not in European-American female or African-American smokers. Am J Med Genet B Neuropsychiatr Genet 139:73–80 [PubMed]

*NTRK2*) with vulnerability to nicotine dependence in African-Americans and European-Americans. Biol Psychiatry 61:48–55 [PubMed] [Cross Ref]10.1016/j.biopsych.2006.02.023

*CHRNA4*) with nicotine dependence. Hum Mol Genet 14:1211–1219 [PubMed] [Cross Ref]10.1093/hmg/ddi132

**American Society of Human Genetics**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (299K)

- A combinatorial approach to detecting gene-gene and gene-environment interactions in family studies.[Am J Hum Genet. 2008]
*Lou XY, Chen GB, Yan L, Ma JZ, Mangold JE, Zhu J, Elston RC, Li MD.**Am J Hum Genet. 2008 Oct; 83(4):457-67. Epub 2008 Oct 2.* - A unified GMDR method for detecting gene-gene interactions in family and unrelated samples with application to nicotine dependence.[Hum Genet. 2014]
*Chen GB, Liu N, Klimentidis YC, Zhu X, Zhi D, Wang X, Lou XY.**Hum Genet. 2014 Feb; 133(2):139-50. Epub 2013 Sep 21.* - Gene-gene interactions of the brain-derived neurotrophic-factor and neurotrophic tyrosine kinase receptor 2 genes in geriatric depression.[Rejuvenation Res. 2009]
*Lin E, Hong CJ, Hwang JP, Liou YJ, Yang CH, Cheng D, Tsai SJ.**Rejuvenation Res. 2009 Dec; 12(6):387-93.* - Detecting epistatic interactions contributing to quantitative traits.[Genet Epidemiol. 2004]
*Culverhouse R, Klein T, Shannon W.**Genet Epidemiol. 2004 Sep; 27(2):141-52.* - The genetics of nicotine dependence.[Curr Psychiatry Rep. 2006]
*Li MD.**Curr Psychiatry Rep. 2006 Apr; 8(2):158-64.*

- Weighted Risk Score-Based Multifactor Dimensionality Reduction to Detect Gene-Gene Interactions in Nasopharyngeal Carcinoma[International Journal of Molecular Sciences...]
*Li CF, Luo FT, Zeng YX, Jia WH.**International Journal of Molecular Sciences. 15(6)10724-10737* - Analysis of Gene-Gene Interactions[Current protocols in human genetics / edito...]
*Gilbert-Diamond D, Moore JH.**Current protocols in human genetics / editorial board, Jonathan L. Haines ... [et al.]. 2011 Jul; 0 1Unit1.14* - The Gene-Gene Interaction of INSIG-SCAP-SREBP Pathway on the Risk of Obesity in Chinese Children[BioMed Research International. 2014]
*Liu FH, Song JY, Shang XR, Meng XR, Ma J, Wang HJ.**BioMed Research International. 2014; 2014538564* - Multivariate generalized multifactor dimensionality reduction to detect gene-gene interactions[BMC Systems Biology. ]
*Choi J, Park T.**BMC Systems Biology. 7(Suppl 6)S15* - Tag SNPs in complement receptor-1 contribute to the susceptibility to non-small cell lung cancer[Molecular Cancer. ]
*Yu X, Rao J, Lin J, Zhang Z, Cao L, Zhang X.**Molecular Cancer. 1356*

- Cited in BooksCited in BooksPubMed Central articles cited in books
- MedGenMedGenRelated information in MedGen
- PubMedPubMedPubMed citations for these articles
- SNPSNPPMC to SNP links
- SubstanceSubstancePubChem Substance links

- A Generalized Combinatorial Approach for Detecting Gene-by-Gene and Gene-by-Envi...A Generalized Combinatorial Approach for Detecting Gene-by-Gene and Gene-by-Environment Interactions with Application to Nicotine DependenceAmerican Journal of Human Genetics. Jun 2007; 80(6)1125PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...