- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# Search for Haplotype Interactions That Influence Susceptibility to Type 1 Diabetes, through Use of Unphased Genotype Data

^{1,2,3}Faming Liang,

^{4}Willem R. M. Dassen,

^{5}Pieter A. Doevendans,

^{6}and Mathisca de Gunst

^{2,7}

^{1}Institute of Mathematics and Statistics, University of Kent at Canterbury, Kent, United Kingdom;

^{2}European Unit for Research and Analysis of Non-Deterministic Operational Models (EURANDOM), Eindhoven, The Netherlands;

^{3}The Chinese Academy of Sciences, Beijing;

^{4}Department of Statistics, Texas A&M University, College Station;

^{5}Department of Cardiology, Maastricht University, Maastricht, The Netherlands;

^{6}Interuniversitairy Cardiology Institute of The Netherlands, Utrecht; and

^{7}Department of Mathematics, Free University, Amsterdam

**This article has been corrected.**See Am J Hum Genet. 2006 February; 78(2): 360.

## Abstract

Type 1 diabetes is a T-cell–mediated chronic disease characterized by the autoimmune destruction of pancreatic insulin-producing β cells and complete insulin deficiency. It is the result of a complex interrelation of genetic and environmental factors, most of which have yet to be identified. Simultaneous identification of these genetic factors, through use of unphased genotype data, has received increasing attention in the past few years. Several approaches have been described, such as the modified transmission/disequilibrium test procedure, the conditional extended transmission/disequilibrium test, and the stepwise logistic-regression procedure. These approaches are limited either by being restricted to family data or by ignoring so-called “haplotype interactions” between alleles. To overcome this limit, the present study provides a general method to identify, on the basis of unphased genotype data, the haplotype blocks that interact to define the risk for a complex disease. The principle underpinning the proposal is minimal entropy. The performance of our procedure is illustrated for both simulated and real data. In particular, for a set of Dutch type 1 diabetes data, our procedure suggests some novel evidence of the interactions between and within haplotype blocks that are across chromosomes 1, 2, 3, 4, 5, 6, 7, 8, 11, 12, 15, 16, 17, 19, and 21. The results demonstrate that, by considering interactions between potential disease haplotype blocks, we may succeed in identifying disease-predisposing genetic variants that might otherwise have remained undetected.

## Introduction

Insulin-dependent diabetes mellitus (IDDM [MIM 222100]), or type 1 diabetes, is a common chronic disease characterized by autoimmune destruction of pancreatic β cells and complete insulin deficiency (Cordell and Todd ^{1995}; Schranz and Lernmark ^{1998}; Friday et al. ^{1999}). The importance of some genetic factors for the etiology of type 1 diabetes, such as human leukocyte antigen (HLA), has been established unequivocally, although their precise mechanism has not been identified. Evidence that the immune system and apoptosis play a role is accumulating. Both processes contribute to the deterioration of β cells in the islets of Langerhans in the pancreas. Despite this information, no definite genetic cause can be determined in most patients, not even in the presence of a positive family history. In this article, we present a method, for testing the influence of haplotype interactions on developing disease, that can be used when unphased genotypes are available for a number of cases and controls, and we apply this method to genotype data of patients with type 1 diabetes and of healthy controls. Here, as in the article by Bugawan et al. (^{2003}), “haplotype interaction” is defined as the statistical dependence between alleles at different loci.

The increasing availability of polymorphic markers such as SNPs, automated genotyping technology, and large collections of family-based (or case-control–based) data have enabled the design of genomewide screens for several populations. Such screens have led to the location of susceptibility loci for type 1 diabetes in various chromosomal regions, suggesting that type 1 diabetes is a multigenic disorder, in the sense that onset of the disease requires the simultaneous presence of a subset of susceptibility genes. Most recent research efforts have concentrated on HLA genes (see Cox et al. [2001] and Pugliese [^{2001}] for reviews). The importance of the HLA class II haplotypes was shown by Noble et al. (^{2002}) in families with at least two children with insulin-dependent diabetes.

Once a disease-predisposing region has been localized, a number of potentially causative genetic variants may exist in the region, including a large number of SNPs. Whereas, for monogenic diseases, one base change in the coding region of a gene very often is sufficient to cause the disease, for multigenic diseases the effect of any single genetic variant on the risk of the disease may be small, which makes identification of these variants difficult (Drysdale et al. ^{2000}). Furthermore, the following questions related to identification of the multiple risk variants arise. First, it is not clear which combination of variants has a causative role in the disease. Second, it remains unknown whether susceptibility for the disease arises because of the effects of these variants acting independently or because of some important interactions between the variants.

These questions have received increasing attention recently (see, for example, Valdes and Thomson ^{1997}; Cox et al. ^{1999}; Dassen et al. ^{2001}; Cordell and Clayton ^{2002}; Bugawan et al. ^{2003}). Cordell and Clayton (^{2002}) proposed a simple but powerful stepwise logistic-regression procedure that allows for testing the dominance effects of different combinations of polymorphisms, as well as genotype interactions in the analysis of case-control data. In particular, they measured genotype interactions in terms of penetrance for developing disease. However, haplotype interactions, since the underlying haplotype pairs of unphased genotypes may have different disease risks, so that there are disease-predisposing interactions, cannot be dealt with in their approach. To illustrate this, for the moment, we consider two diallelic variants of interest in a region: variant 1, with one of the unphased genotypes *aa, AA,* and *aA;* and variant 2, with one of the unphased genotypes *bb, BB,* and *bB*. There are nine possible combinations (also called “genotypes”) observed at the two variants: *aa/bb, aa/bB, aa/BB, AA/bb, AA/bB, AA/BB, aA/bb, aA/bB,* and *aA/BB,* where, for example, *aA/bb* means that the alleles in variants 1 and 2 are {*a*,*A*} and {*b*,*b*}, respectively. All of these genotypes except for *aA/bB* can be uniquely decomposed into a pair of haplotypes. For *aA/bB,* there are two compatible possible haplotype pairs, (*a*,*b*)/(*A*,*B*) and (*a*,*B*)/(*A*,*b*). The pairing described here indicates that allele *a* is coupled with allele *b* or allele *a* is coupled with allele *B*.

It is only when these two haplotype pairs have different disease risks that there may be potential disease-predisposing interactions between *a* and *b* or *a* and *B*. As pointed out by a reviewer, even when the haplotype pairs do have different disease risks, it does not necessarily mean that the alleles interact in anything other than a statistical sense, since this phenomenon could occur if alleles *a* and *b,* say, were in linkage disequilibrium with (and, thus, marking a haplotype containing) another predisposing variant not included in the analysis. Note that the stepwise logistic-regression procedure takes genotypes as explanatory variables and, therefore, the possible difference between the effects of the underlying haplotypes on the disease is ignored.

An alternative test is called the “haplotype method” (Valdes and Thomson ^{1997}), which compares the relative frequencies of alleles at a secondary locus on haplotypes that are identical at a primary locus (or loci). The problem with the haplotype method is that, often, the haplotypes are not known. Although one can statistically infer the haplotypes from unphased genotypes, it is unclear how to judge the significance of the results from the haplotype method if we want to take into account the possible haplotyping errors. Several other approaches have been described for simultaneous identification of genetic factors through use of unphased genotype data, such as the modified transmission/disequilibrium test procedure (Cucca et al. ^{2001}) and the conditional extended transmission/disequilibrium test (Koeleman et al. ^{2000}). These approaches are also limited, by being restricted either to family data or to haplotype data. This and the fact that there are 2^{m-1} possible haplotype pairs for a genotype of *m* heterozygous sites, which results in a considerable number of potential haplotype interactions when *m* is large, motivated us to develop a special procedure for testing such interactions. The proposed method is based on minimal entropy, reflecting the principle that a good prediction of haplotype interactions should extract a maximum amount of information from data and, thus, most parsimoniously explain the underlying haplotype structure, given unphased genotypes. In general, the computation of the entropy statistic is very intensive. To solve this problem, we have developed a new Markov chain Monte Carlo algorithm called the “structure-annealing algorithm.”

Two types of approaches for the investigation of interaction can be distinguished: those that consider interaction in the sense of linkage disequilibrium between closely linked loci (Wall and Pritchard ^{2003}) and those that consider interaction in the sense of effects on disease risk (Cordell and Todd ^{1995}; Cordell et al. ^{2001}). In this article, we focus on the linkage disequilibrium approach while investigating interaction between all loci and, hence, also between possibly unlinked loci. For any two haplotype blocks, let us denote by *p*_{1a} and *p*_{2b} the probabilities of occurrence for allele *a* at block 1 and for allele *b* at block 2, respectively. Let *p*_{ab} be the probability of simultaneous occurrence of *a* and *b*. We are trying to test whether, for all *a* and *b,* *p*_{ab}=*p*_{1a}*p*_{2b}. We assess the evidence for interactions between and within (possibly unlinked) haplotype blocks on different chromosomal regions by using a permutation procedure. Since the strength of a linkage disequilibrium pattern is not, typically, a monotonic function of recombination distance when there exist selective forces that favor certain haplotypes over others, as might be the case for type 1 diabetes (Fain and Eisenbarth ^{2001}), we needed to develop an approach that is independent of this distance. Naturally, we are mainly interested in identification of disease-predisposing interactions by comparisons between cases and controls. The disease-predisposing interactions are found in a second stage, by contrasting the interaction patterns observed for patients with the interaction patterns observed for healthy controls. These interactions could facilitate understanding of the pathological mechanisms involved in the disease, as well as the further identification of some haplotype blocks that provide significant association with the disease only when their interactions with other blocks are taken into account.

As an illustration of our method, we present in this article a reanalysis of a set of genotypes that was obtained from a cohort of 89 Dutch patients with type 1 diabetes and 47 healthy control individuals, with a 65-polymorphism detection assay originally designed for unraveling the multigenic cause of atherosclerosis (Dassen et al. ^{2001}). Since both diabetes mellitus and atherosclerosis can be regarded as metabolic diseases with many overlapping biochemical and clinical parameters, the variants that are susceptible to atherosclerosis may also be the cause of type 1 diabetes. Dassen et al. (^{2001}) examined whether certain types of combinations of SNPs confer susceptibility to type 1 diabetes in the cohort by logistic regression and self-learning neural networks. They found that a set of four polymorphisms could predict 79.9% of the cases correctly. However, a significant number of polymorphisms could not be interpreted by their method. Note that all of these variants were selected from the pathways of lipid and homocysteine metabolism, regulation of blood pressure and coagulation, inflammation, cellular adhesion, and matrix integrity. Therefore, we wondered whether the variants that were unexplained in the above-mentioned study may serve as transitive (or supporting) variants, in the sense that they interact with some etiological variants within and between these pathways.

Before we applied the proposed procedure to the above-mentioned Dutch type 1 diabetes data, we evaluated the power of our approach by conducting a simulation study in which four different combinations of mutation and recombination rates were considered. The results are presented below. They suggest that a high accuracy can be achieved if appropriate critical values for our entropy statistics are selected. Note that, although the coalescent model that we have used for our simulations has been shown to be very helpful in modeling haplotype populations (Stephens et al. ^{2001}), it is still not easy to statistically test whether this model fits real data, such as the Dutch type 1 diabetes data. Therefore, the thresholds that were obtained from the simulations were used as a guide to the corresponding parameters as we applied our method to the data. The results of our data analysis show some evidence for a haplotype interaction network that is potentially associated with type 1 diabetes and that includes the up interactions between the haplotype blocks from the chromosomes pairs (1,4), (1,12), (1,19), (6, 7), and (17, 21), as well as the down interactions between blocks from the chromosome pairs (2,7), (3,19), (5,7), (6,21), and (7,11). There are several other less significant pairs. Here, “up interaction” and “down interaction” mean that there exists a significant increase or decrease, respectively, in interaction between two blocks for patients over that for controls. We further found some disease-predisposing intrablock interactions on chromosomes 1, 6, 7, 8, and 11. Finally, we searched for loci interactions that may account for these block interactions. As a result, a total of 25 potential disease-predisposing interactions between loci are predicted, which indicates 19 gene-gene interactions among 19 candidate genes. Having found four dominant variants (Dassen et al. ^{2001}), we predicted, from the interaction network, 19 transitive variants. Our results clearly demonstrate that, by considering interactions between haplotype blocks, we may succeed in identifying disease-predisposing genetic variants that might otherwise have remained undetected.

## Methods

### Haplotype Likelihood

Let *G*=(*G*_{1},…,*G*_{n})^{T} denote the observed genotypes for *n* individuals from a population, where *G*_{i}=(*g*_{i1},…,*g*_{iL})^{T}, *g*_{ij} is the genotype of individual *i* at locus *j,* and *L* is the total number of observed loci per individual. For simplicity, let *g*_{ij} take values of 0, 1, or 2 for the cases in which its genetic haplotype at locus *j* is homozygous and identical to a prespecified reference, homozygous but different from the reference, or heterozygous, respectively. In addition, we let *g*_{ij}=7 if allele 0 is missing at locus *j,* *g*_{ij}=8 if allele 1 is missing, and *g*_{ij}=9 if both alleles are missing. A genotype is called “ambiguous” if it has at least two heterozygous sites. Let *H*=(*H*_{1},…,*H*_{n})^{T}, where *H*_{i}=(*H*_{i1},*H*_{i2}) denotes the unobserved haplotype pair of *G*_{i}, and *H*_{i}_{i}, the set of all possible haplotype pairs compatible to *G*_{i}. Given *G*, under the assumption of Hardy-Weinberg equilibrium (Weir ^{1996}, chapter 3), the “haplotype likelihood” can then be written as

where *p*(·) denotes the population frequency of the corresponding haplotype, and *p*=(*p*_{1},…,*p*_{m0}). Here we assume that, overall, there are *m*_{0} possible haplotypes compatible with *G*.

### Haplotype Entropy

While performing a haplotype inference, we are usually interested only in *H*, and, hence, *p* works as a nuisance parameter in equation (1). Here, we follow Zhang et al. (^{2001}) in eliminating the nuisance parameter by a maximization procedure—that is, we replace *p* in equation (1) with its maximum likelihood estimate (MLE). Thus, we have the following profile log likelihood:

where *k*_{0} denotes the number of different haplotypes in *H*, and *s*_{1},…,*s*_{k0} denote their respective frequencies. We define *S*(*H*)=-*l*(*G*|*H*), where *S*(*H*) is the entropy of the frequencies of different haplotypes in *H*, and *s*(*G*)=*min*{*S*(*H*):*HiscompatiblewithG*}. Note that *S*(*H*) attains its minimum at , the MLE of *H* in equation (1), so that

For example, suppose that

Then, there are two possible ways to decompose these genotypes into haplotypes—namely,

and

where *h*_{1}=(0,0,0)^{T}, *h*_{3}=(0,1,0)^{T}, *h*_{5}=(1,0,0)^{T}, *h*_{7}=(1,1,0)^{T}, and *h*_{8}=(1,1,1)^{T}. The corresponding values of the haplotype likelihood shown in equation (1) are

and

where the unknown population frequencies of the five different haplotypes in equation (3) satisfy the equation

and the unknown population frequencies of the four different haplotypes in equation (4) are constrained by

Given *H*_{1}, and under the constraint of equation (5), the maximum of the logarithm of the likelihood in equation (3) is given by

Analogously, given *H*_{2} and under the constraint of equation (6), the maximum of the logarithm of the likelihood in equation (3) is equal to

Obviously, *S*(*H*_{2})<*S*(*H*_{1}). Hence, *s*(*G*)=*S*(*H*_{2}).

In this article, we call *s*(*G*) the haplotype entropy of *G*. The quantity *s*(*G*) measures the diversity of the underlying haplotypes compatible with *G**,* since the entropy is a well-known measure of variation for a system in information theory (Jones ^{1979}). The stronger the interactions among the loci of *G**,* the less diverse the underlying haplotypes, and the smaller the value of *s*(*G*). To explain this claim intuitively, we consider only three diallelic loci, at which there are eight possible haplotypes—namely, *h*_{1}=(0,0,0)^{T}, *h*_{2}=(0,0,1)^{T}, *h*_{3}=(0,1,0)^{T}, *h*_{4}=(0,1,1)^{T}, *h*_{5}=(1,0,0)^{T}, *h*_{6}=(1,0,1)^{T}, *h*_{7}=(1,1,0)^{T}, and *h*_{8}=(1,1,1)^{T}. Let *p*(*h*_{i}) be the population frequency of *h*_{i} for 1*i*8. The population haplotype entropy, defined as is a measure of the diversity of the above haplotype population. In practice, we might have only a sample of genotypes of size *n*—say, *G*—which are assumed to be generated from these haplotypes according to the Hardy-Weinberg equilibrium. Then, the haplotype entropy in equation (2) gives rise to an empirical version of the above population entropy. To see how the haplotype entropy changes as the strength of interaction (i.e., dependence) increases, we first calculate this entropy when there are no interactions among the three loci. In this situation, the above eight haplotypes have the equal probability of occurrence 1/8 in individuals. As a result, the haplotype population reaches the highest diversity as the population entropy attains the maximum value of *log*8 (Jones ^{1979}, chapter 2). Now, we consider the situation in which there exist some dependences among the three loci. Note that these dependences are apparent as increased frequencies of specific haplotypes compared with what would be expected if alleles at the three loci are combined at random. For example, if we set *p*(*h*_{1})=1/2, *p*(*h*_{5})=1/2, and *p*(*h*_{i})=0, *i*≠1,5, then the three loci are fully determined by the first locus. With the entropy being equal to *log*2, the resulting haplotype population yields a smaller diversity than the previous one. We observe that, as an empirical version of the population entropy, is close to its population value when *n* is large. Therefore, in general, the population haplotype entropy—and, thus, its empirical version, —tends to decrease as the strength of these dependences increases.

### Testing for Interaction between Two Haplotype Blocks

Let *G*=(*G*_{1},…,*G*_{n})^{T} be partitioned into two blocks—say,

Suppose that we are interested in testing whether there exists interaction between the two blocks *G*^{(1)} and *G*^{(2)}. This problem can be stated as testing the hypotheses that the two blocks are independent (i.e., the null hypothesis) versus the hypothesis that the two blocks are dependent (i.e., the alternative hypothesis). As pointed out in the “Haplotype Entropy” subsection, if the null hypothesis is true, *s*(*G*) will tend to have a large value; otherwise, it will tend to be small. Hence, *s*(*G*) can be used as a test statistic for this test. Because the distribution of *s*(*G*) under the null hypothesis is unknown, the following procedure is designed to calculate the *P* values of the test:

- Step1: Generate
*n*^{′}random permutations of (*G*^{(2)}_{1},…,*G*^{(2)}_{n}), and denote them by (*G*^{(2)}_{j,1},…,*G*^{(2)}_{j,n}),*j*=1,…,*n*^{′}. - Step2: Form a random sample
*G*^{*}_{j}*,**j*=1,…,*n*^{′}*,*where*G*^{*}_{j}is formed by pairing (*G*^{(1)}_{1},…,*G*^{(1)}_{n}) with (*G*^{(2)}_{j,1},…,*G*^{(2)}_{j,n}). - Step3: Calculate the haplotype entropy for each
*G*^{*}_{j}. An empirical*P*value can then be defined by the proportion of values of*s*(*G*^{*}_{j}) that are*s*(*G*); that is, #{*s*(*G*^{*}_{j}):*s*(*G*^{*}_{j})*s*(*G*)}/*n*^{′}.

The number *n*^{′} is usually set to a moderate number. For example, it is 500 and 1,000 in this study.

On the basis of the central limit theorem, an empirical *Z* score statistic,

can also be defined for the test, where *A* and *V* are the sample mean and variance, respectively, of the values of *s*(*G*^{*}_{j}). The empirical *P* value calculated in step 3 can be used to examine whether the between-block interaction existing in *G* was obtained by chance, whereas the empirical *Z* score statistic more sensitively measures the length of the distance between the genotypes under investigation and the population of genotypes without block interactions.

The above procedure will be used below to test the significance of the pairwise interactions among haplotype blocks or loci. In each case, the significance of an interaction will be decided by a threshold for *P* values. Assessment of the overall significance, to account for multiple testing, is not straightforward, because there are many correlations among the tests. An alternative approach is to control the false discovery rate (FDR), which is defined by the expected proportion of false positives among those called significant: *E*[*V*^{*}/*R*^{*}|*R*^{*}>0]. Here, for a given threshold, *V*^{*} is the total number of false positives, and *R*^{*} is the total number of interactions called significant according the threshold. We opt for the recent proposal of Storey and Tibshirani (^{2003}) to estimate the FDR and calculate the *q* value, a measure of statistical significance in terms of FDR, for each individual test under dependence.

### Structure Annealing Algorithm

In this section, we propose a new algorithm, the so-called “structure annealing algorithm,” to minimize *S*(*H*). The algorithm is proposed on the basis of the following observation. Let *G*=(*G*^{(1)},*G*^{(2)}) be a random partition of *G*, and let *H*=(*H*^{(1)},*H*^{(2)}) be the corresponding partition of *H*. It is easy to see that, if *H* is compatible with *G*, then *H*^{(1)} is compatible with *G*^{(1)}. Furthermore, if *S*(*H*^{(1)}) is a good approximation of *s*(*G*^{(1)}), then *S*(*H*^{(1)},*H*^{(2)}) should be a good approximation of *s*(*G*), provided that *H* is compatible with *G* and the number of loci in *G*^{(2)} is not large. This observation motivated our use of the following sequential way to minimize the objective function *S*(*H*).

Suppose now that *G* is partitioned into *z* blocks, *G*=(*G*^{(1)},…,*G*^{(z)}), where *G*^{(b)} comprises *k*_{b} loci and . It is preferable that *k*_{b} be set to a small number—for example, *k*_{b}8 for all examples in this article. The structure annealing algorithm consists of two building blocks: a local updating algorithm and an extrapolation algorithm. The local updating algorithm (described in appendix A) is designed to simulate from the distributions

for *b*=1,…,*z*, where *t*_{b} is called the “temperature” of this distribution, and , which is compatible with . The extrapolation algorithm (described in appendix B) is designed to extrapolate to . The structure annealing algorithm starts with the simulation from by the local updating algorithm, where . Since the block size of *G*^{(1)} is usually small, the iteration number of the local updating steps is also moderate at this step. We denote this iteration number by *m*_{1}, and set *m*_{1}=10,000 for all examples in this article. Then, the algorithm proceeds for *z*-1 steps. The (*b*+1)th step consists of two substeps, which are described as follows.

- 1. Extrapolation: extrapolate the haplotype
*,*which is obtained at the last iteration of the*b*th step, to a compatible haplotype pair of . - 2. Local updating: simulate from the distribution by the local updating algorithm for
*m*_{b+1}steps.

The *m*_{b} is a monotone increasing function of *b;* for example, we set *m*_{b}=*m*_{1}×*b* for *b*=1,2,…,*z*-1 and *m*_{z}=10×*m*_{1}×*z*. Here, as in the article by Kirkpatrick et al. (^{1983}), we set a large iteration number for the last step simulation.

## Results

### Simulated Data Sets

We used a coalescent-based program called MS, by R. Hudson, to simulate haplotypes for the four different situations described by quantities (θ,*R*)=(4,0),(4,4), (4,20), and (16,16). Here θ=4*N*_{e}μ, *R*=4*N*_{e}*r*, *N*_{e} is the effective population size, μ is the total per-generation mutation rate across the region sequenced, and *r* is the length, in morgans, of the region sequenced.

For each setting of (θ,*R*), this generated 40 independent data sets, each containing 40 haplotypes. For each data set, the haplotypes were randomly paired to form 20 genotypes. As a result, for each case of (θ,*R*), we had 40 sets of 20 genotypes. They are denoted by *G*_{1},…,*G*_{40}, with *G*_{i}=(*G*_{i,1},…,*G*_{i,20})^{T}. We split each *G*_{i,j} into two parts of equal length, *G*^{(1)}_{i,j} and *G*^{(2)}_{i,j}, for *i*=1,…,40 and *j*=1,…,20. In total, we have 80 genotype segments. With these segments, 20 new data sets, which are denoted by *G*^{*}_{1},…,*G*^{*}_{20}, are formed, where *G*^{*}_{k} is formed by attaching the segment *G*^{(2)}_{20+k,j} to the segment *G*^{(1)}_{k,j} for *k*=1,…,20. The above construction procedure shows that there are two independent blocks in each *G*^{*}_{k}.

In the following, we will regard *G*^{*}_{1},…,*G*^{*}_{20} as samples from a population in which the two genotype blocks are independent, whereas we will regard *G*_{1},…,*G*_{20} as samples from a population in which the two genotype blocks are dependent. To evaluate the power of our procedure, we applied it to these genotype data sets. The resulting *P* values and *Z* scores are summarized in figure 1. To find the interesting blocks, we further analyzed these *P* values by setting lower and upper thresholds of .01 and .15. We say two blocks are dependent if the corresponding *P* value is .01, whereas we say they are independent if the corresponding *P* value is .15. The performance of our procedure is measured by the proportions of false positives and negatives, *F*_{a} and *F*_{n}. That is, *F*_{a} is the proportion of false rejections of the null hypothesis when the null hypothesis is true, and *F*_{n} is the proportion of false nonrejections of the null hypothesis when the alternative is true. For the above simulated data, we have (*F*_{a},*F*_{n})=(0,2/20) when (θ,*R*)=(4,0), and (*F*_{a},*F*_{n})=(0,0) when (θ,*R*)=(4,4), (4,20), and (16,16). These results show that our procedure is, indeed, an effective tool for detecting haplotype interactions. As pointed out in the Introduction, the coalescent model can capture certain main features in a haplotype population (Stephens et al. ^{2001}). The above simulated coalescent models might share some common features with real haplotype data. Thus, these thresholds were used to guide our choice of the corresponding thresholds when we applied our method to the Dutch type 1 diabetes data below.

### Type 1 Diabetes Data

Thirty-six candidate genes, listed in table 1, were selected from pathways that are potentially implicated in the development and progression of atherosclerosis: lipid and homocysteine metabolism, regulation of blood pressure and coagulation, inflammation, cellular adhesion, and matrix integrity (Cheng et al. ^{1999}; Dassen et al. ^{2001}). They have all been reported in the Online Mendelian Inheritance in Man (OMIM) database. Dassen et al. (^{2001}) described an assay, for genotyping a panel of 65 SNPs that represent variation within these genes, that is an early version of RMS Research Assay for Cardiovascular Disease Genetics designed by Roche Molecular Systems. Most of these SNPs have been shown to be implicated with some metabolic diseases, such as cardiovascular disease, coronary artery disease, hypertension, asthma, obesity, atherosclerosis, myocardial infarction, hyperlipidemia, Alzheimer disease, and others (see table 1 and ^{OMIM} for more details). The rest of these SNPs are either the polymorphisms at (or close to) the promoter regions that may (directly or indirectly) play certain dysregulation roles for the genes of interest or the polymorphisms at coding regions with nonsynonymous changes (Cheng et al. ^{1999}; Dassen et al. ^{2001}; Flori et al. ^{2003}; Vatay et al. ^{2003}). For example, V67 was selected because it could have a protective role against type 2 diabetes (NIDDM) (Vatay et al. ^{2003}). V66 was included because it often interfered with our ability to call V67 correctly. We had no prior functional information, other than that its proximity to V67 could mean that it would also have impact on the function of the gene TNF. As pointed out in the Introduction, since both diabetes mellitus and atherosclerosis can be regarded as metabolic diseases with many overlapping biochemical and clinical parameters, the variants that are susceptible to atherosclerosis may also be the cause of type 1 diabetes. Therefore, this assay was also applied to a Dutch cohort with diabetes that includes 136 unrelated individuals (89 patients with type 1 diabetes with impaired endothelial function and 47 healthy control individuals). Endothelial function was assessed by measuring changes in forearm blood flow after pharmacological interventions. The DNA samples from the 136 individuals were genotyped by use of PCR. This led to 136 genotypes of 65 loci. Nine loci (V58, V59, V66, V67, V5, V57, V51, V52, and V30) were not used in the following data analysis, since these loci have the so-called “heavy-missing” problem, where 21% of the 136 individual genotypes were incomplete in the PCR experiments. The heavy-missing problem may introduce the bias in our data analysis. The cutoff point of 21% was selected on the basis of our experience. We ended up with a 136×56 data matrix. Each genotype can be divided into 16 blocks, according to their chromosome identities (see table 1 for more details).

We started with the search for pairwise interactions among these 16 unlinked blocks. The search was performed on the cases and controls separately. The *P* values for the cases and controls were compared by plotting them in graphs, as shown in figuresfigures22 and and3,3, respectively. We obtained 10 pairs of interacting blocks, located on chromosome pairs (1,4), (1,12), (1,19), (2,7), (3,19), (5,7), (6,7), (6,21), (7,11), and (17,21) (seetable 2 for more details). These block pairs were selected by use of the following criteria: for the up interaction, we claimed that there was an increase in haplotype interaction if the *P* value of the controls was >.15, the *P* value of the cases was .01, and the *Z* score of the cases was −2. This says that, in contrast to the healthy individuals, there is a significant interaction between two haplotype blocks under consideration in the individuals with disease. For the down interaction, we claimed that there was a decrease in haplotype interaction if the *P* value of the cases was >.15, the *P* value of the controls was .01, and the *Z* score of the controls was −2. This implies that, in contrast to the healthy individuals, there is no significant interaction between two haplotype blocks under consideration in the individuals with disease. Among these selected blocks, the up-interaction pairs for chromosome pairs (1,4), (1,12), (1,15), (1,19), (6,7), and (17,21) indicate that the pathways harboring these variants may have been modified by adding some interactions between some genes in the individuals who had the disease. Analogously, the down-interaction pairs on chromosome pairs (3,19), (2,7), (6,21), (7,11), (5,7), (12,15) indicate that the related pathways may have been changed, since interactions between some genes are disrupted. Note that, with the *P* value thresholds .01 and .15 for cases and controls, respectively, the corresponding estimated FDRs of these multiple tests for cases and controls are 0.017 and 0.029. There will be more interaction pairs if we take .035 and .2 as the thresholds for cases and controls, respectively. The FDRs will then become 0.040 and 0.048 (see table 2 for more details).

*P*values of testing the interactions of blocks 1–8 with the other blocks, for the cases and controls in the Dutch type 1 diabetes data. The dotted lines are for the cases, and the lines with small triangles are for the controls. The normal

**...**

*P*values of testing the interactions of blocks 9–16 with the other blocks, for the cases and controls in the Dutch type 1 diabetes data. The dotted lines are for the cases, and the lines with small triangles are for the controls. The normal

**...**

To see how these interactions modify the related pathways, we ran our procedure on the pairs of variants on these blocks. Consequently, 25 pairs of variants were found to show certain evidence of susceptibility to the disease. Table 2 indicates that these variants are distributed on 19 genes: NPPA, SELE, ADOB, AGTR1, ADRB2, LPA, TNF, TNFb, DCP1, ADD1, SCNN1A, APOE, NOS3, LPL, LIPC, PON1, CBS, APOA4, and APOC3. Note that APOB, ADRB2, LPA, APOE, LPL, LIPC, PON1, and APOA4 are on the pathway of lipid metabolism; CBS is on the pathway of homocysteine metabolism; NPPA, AGTR1, ADRB2, DCP1, SCNN1A, and NOS3 are on the pathway of blood pressure; SELE is on the pathway of coagulation; SELE, TNF, and TNFb are on the pathway of inflammation; and ADD1 is on the pathway of matrix integrity. Thus, within the pathway of lipid metabolism there are seven up or down interactions, denoted by the symbols (+) and (−) respectively, among some genes. They are V9:V22 (APOB:LPL) (+), V8:V20 (APOB:LIPC) (−), V4:V26 (LPA:PON1) (+), V26:V7 (PON1:APOA4) (−), V25:V10 (PON1:APOC3) (−), and V25:V12 (PON1:APOC3) (−). These interactions are predisposing to the disease. Similarly, within the pathway of blood pressure, there is one down interaction: V50:V38 (ADRB2:NOS3) (−). The rest are related to interactions among the six pathways mentioned above. Here, up interaction (down interaction) is trying to describe the biological phenomenon through which the pathways of lipid metabolism, homocysteine metabolism, blood pressure, inflammation, and matrix integrity are modified, by creating (disrupting) interactions among some genes that lie in these pathways. Similar to what Sudbery (^{1998}, p. 144) has suggested, the up interactions would suggest that those interactions lead to a susceptibility to the disease, whereas the down interactions could imply that the related interactions may have a protective effect on developing the disease. These results indicate a complicated feature of (possibly nonmultiplicative) effects of the interactions on the risk for type 1 diabetes.

Note that Dassen et al. (^{2001}) have identified a set of dominant variants—V4, V15, V28, and V50—that are on chromosomes 6, 11, 19, and 5, respectively. This, combined with the above results, yields the following transitive and disease-predisposing variants, in the sense that there are significant increases (or decreases) of interactions of these variants with some dominant variants: V26, V37, V38, V39, V7, V8, V10, V11, V12, V13, V65, V68, V20, V25, and V47.

In the next step, we screened for interactions in linked regions. For simplicity, we adopted the following strategy. Taking block 1 as example, we sequentially tested six subblock pairs for the cases and controls: the first pair was {1},{2,3,4,5,7}, with 1 being the splitting location; the second pair was {1,2},{3,4,5,7}, with 2 being the splitting location; and so on. Here, the numbers 1, 2, 3, 4, 5, 6, and 7 denote the seven variants in block 1. The six subblock pairs are uniquely defined by six splitting locations: 1, 2, 3, 4, 5, and 6. We compared the resulting six pairs of *P* values and *Z* scores in table 3. It suggests that there exists some disease-predisposing interaction between subblock pairs {1,2,3,4} and {5,6,7}. Following the same argument as above, for block 6 we may conclude that variant V64 might be a transitive disease-predisposing variant, because the dominant variant, V4, is at subblock {1,2,3}. The evidence of disease-predisposing interactions within the other blocks is reported in table 4, which yields the transitive variant V14. Note that, in practice, we need to test the interactions for all bipartitions of seven loci, since the strength of linkage disequilibrium patterns is not, typically, a monotonic function of genetic distance. Our procedure can be easily extended to this general setting, since it does not use any information on genetic distances among these loci.

## Discussion

The logistic regression mentioned in the Introduction is a very important genotype-based tool for detecting dominant polymorphisms and epistatic effects (i.e., genotype interactions) that are associated with disease. One disadvantage of this method over some haplotype-based methods is that it ignores the potential disease-predisposing haplotype interactions. To contend with this disadvantage, we have presented a procedure for evaluating the contributions of these haplotype interactions to susceptibility to disease, in which the entropy is used to measure the diversity of a haplotype population. Our procedure can be easily generalized to other measures of the haplotype diversity (Weir ^{1996}; Clayton ^{2002}). Of course, for applications, we should combine these two methods, in order to extract more complete information from unphased genotype data, in the following steps: first, apply the logistic regression to detect dominant disease-predisposing variants and genotype interactions; then, as a complement, use our procedure to find potential haplotype interactions; finally, predict the transitive variants by finding the variants that are interacting with the dominant ones.

In the first step, we assume a sample of *n*_{1} cases and *n*_{2} controls, each of whom is genotyped at *m* polymorphisms. Let *p*_{j} be the probability of individual *j* being a case rather than a control. Following McCullagh and Nelder (^{1989}), we model *p*_{j} as

where *x*_{1},…,*x*_{m} are covariates depending on the genotypes of the individual, and β_{0},…,β_{m} are coefficients to be estimated. To examine the effects of a set of polymorphisms, we can test whether the data are significantly better represented when these polymorphisms are included in the model compared with when they are not in the model, through use of likelihood-ratio tests (Cordell and Clayton ^{2002}). This is equivalent to testing whether the corresponding coefficients are significantly different from 0. Similarly, we can account for the genotype interactions by adding some epistatic terms to the above model. A commonly used strategy for evaluation of the effects of the different polymorphisms is to fit these models in a stepwise fashion. Following Cordell and Clayton (^{2002}), for the Dutch type 1 diabetes data, we first code *x*_{j}=-0.5, 0.5, and 0.5 for genotypes 0, 2, and 1, respectively, and we also code 0.5 for the cases in which genotypes are missing. We set .05 as a nominal significance level for all these tests involved in the stepwise logistic-regression procedure. This yields seven dominant disease-susceptibility alleles on chromosomes 3, 6, 7, 6, 11, 19, and 2, respectively: V41(AA), V4(TT), V26(GG), V64(GG), V15(GG), V28(−), V9(missing), and one genotype interaction between V41(AA) and V64(GG), where, for example, in the notation “V41(AA),” V41 is the name of the variant and (AA) is one of its alleles (see table 5). The result is slightly different from the prediction-based logistic-regression procedure of Dassen et al. (^{2001}); this might be because of different criteria being used.

In the second step, we start with a search for the haplotype interaction between blocks located on different chromosomes, followed by testing of the interactions within each block. If two blocks are found interacting, we can further narrow the search area to identify which variants in the blocks are involved in this interaction. For the Dutch type 1 diabetes data, in the “Results” section we have shown nine pairs of interacting blocks that are predisposed to type 1 diabetes. Combining with the result from the first step, we can infer some transitive disease-predisposing variants, as shown in the “Results” section. The results demonstrate a complicated gene-gene interaction network, which might predispose to type 1 diabetes through modifying the pathways of lipid metabolism, blood pressure, inflammation, coagulation, and matrix integrity.

Use of interaction between unlinked genomic regions has been suggested for improving power to detect loci of small effect on the disease phenotype; for example, in type 1 diabetes (Cordell et al. ^{1995}^{, }^{2000}; Bugawan et al. ^{2003}), type 2 diabetes (Cox et al. ^{1999}), and inflammatory bowel disease (Cho et al. ^{1998}). Cordell et al. (^{1995}) reported that there are interactions between the loci IDDM1 (chromosome 6p21) and IDDM2 (chromosome 11p15) and between the loci IDDM1 and IDDM4 (on chromosome 11q13.3), in the context of the logistic regression model. Cox et al. (^{1999}) showed that the loci on chromosomes 2 and 15 interact to increase susceptibility to type 2 diabetes, in the context of a nonparametric LOD score. Cox et al. (^{2001}) performed a systematic screen for correlation between family-specific nonparametric LOD scores to evaluate evidence of interactions between some unlinked regions on chromosomes 1, 2, 3, 4, 6, 11, and 19. These methods are usually restricted to family data. Unlike these authors, we focus here on interactions between genetic variants in a list of potential candidate genes across a number of chromosomes, where some of these variants have already been shown to be associated with some metabolic diseases. Moreover, the proposed approach is specified for unphased genotype data (possibly with missing problems) from case-control studies. Thus, our method could be a valuable contribution to a genomewide association study of a complex disease, especially when direct determination of the molecular haplotypes from experiment or family data is not feasible.

Although significant and consistent linkage evidence was reported for the susceptibility intervals IDDM8 (on chromosome 6q27), IDDM4 (on 11q), and IDDM5 (on 6q25), evidence for most other intervals varies in different data sets, probably because of a weak effect of the disease genes, genetic heterogeneity, random variation, or inappropriate correction for multiple tests (see Pugliese ^{2001}). To reduce the possible effect of genetic heterogeneity, we need to confirm our initial finding by analyzing other populations in future studies. Since we compared correlated variants, it is important to take into account the potential effects of multiple tests on the power of our procedure. For our case, there are 120 pairwise tests among 16 haplotype blocks. A simple Bonferroni (or Dunn-Sidak) correction leads to the adjusted threshold of 4.17×10^{-4} for *P* values if we want to achieve the significance level of .05; there are only seven block pairs in table 2 that remained nominally significant after this correction. Such a correction seems too conservative, because of high dependences among these tests. This has been confirmed by Bugawan et al. (^{2003}) on the basis of a permutation procedure. Unfortunately, using resampling methods such as permutation can be computationally prohibitive in our case. However, we have shown that the recently developed procedure of Storey and Tibshirani (^{2003}) is applicable to our setting.

## Acknowledgments

We are grateful to the Editor, the Deputy Editor, and other members of the Editorial Board, as well as to two anonymous reviewers, for their very constructive comments that have led to improvement of the presentation and results of the present article. We thank Roche Molecular Systems (Alameda, CA) for providing the multiplex genotyping assays, under a research collaboration. We thank Drs. Suzanne Cheng and Paul Schiffers for kindly providing some information on the SNPs used in this paper. We also thank Professors W. van Zwet, W. J. M. Senden, and M. Vingron and Dr. P. Lindsey for useful discussions. This work was partially performed when the first author was working for EURANDOM, in The Netherlands. The work was supported, in part, by the Programme of Computational Molecular Biology of EURANDOM, by the Institute for Mathematical Sciences, National University of Singapore, and by grant No. 01/1/21/19/217 from the Biomedical Research Council of Singapore.

## Appendix A : Local Updating

The local updating algorithm includes two operators, ν-mutation and peer learning. In every iteration, they are selected to perform with probability 0.2 and 0.8, respectively. Of course, the probabilities can be tuned by the user, but a large performing probability is usually assigned to the peer learning operator, since it tends to force the haplotypes to coalesce. The two operators are described below.

#### ν-Mutation Operator

In the ν-mutation operator, a total of *max*{1,γ_{b}ν} haplotype pairs at the heterozygous (*g*_{ij}=2) or missing (*g*_{ij}=9,8,7) loci are randomly selected to undergo changes, where γ_{b} is the total number of heterozygous and missing loci in , and ν is the mutation rate specified by the user. The ν is usually set to a small number; for example, we set ν=0.001 for all examples in this article. The changes are accepted or rejected according to the Metropolis-Hastings rule (Metropolis et al. ^{1953}; Hastings ^{1970})—that is, the new haplotypes are accepted with probability *min*(1,*r*_{m}), where

where *T*(·→·) denotes the transition probability between the current and new haplotypes. The transition proceeds as follows. If the pair (*h*_{b,ij,1},*h*_{b,ij,2}) is selected to undergo a change and if *g*_{ij}=2, then the values of *h*_{b,ij,1} and *h*_{b,ij,2} will be simply swapped by setting *h*_{b,ij,1}=1-*h*_{b,ij,1} and *h*_{b,ij,2}=1-*h*_{b,ij,2}. If *g*_{ij}=9, one of the pairs (0,0), (0,1), (1,0), and (1,1) are equally likely to be reassigned to (*h*_{b,ij,1},*h*_{b,ij,2}). Similarly, if *g*_{ij}=8 or 7, one of the possible haplotype pairs is also equally likely to be reassigned to (*h*_{b,ij,1},*h*_{b,ij,2}). The other selected haplotype pairs will be mutated in the same way, but independently. It is easy to see that the transition is symmetric, in the sense that .

#### Peer Learning Operator

The peer learning operator works as follows.

1.Randomly select one haplotype—say, *h*_{b,u,v}—from the set {*h*_{b,1,1},*h*_{b,1,2};…;*h*_{b,n,1},*h*_{b,n,2}}.

2.Randomly select one haplotype—say, *h*_{b,s,t}—from the set

with probability , where *w*_{b,i,j}=*exp*{-*d*(*h*_{b,u,v},*h*_{b,i,j})/*t*_{sel}}, *d*(*h*_{b,u,v},*h*_{b,i,j}) is the number of different haplotypes at the first loci of *h*_{b,u,v} and *h*_{b,i,j}, and *t*_{sel} is the so-called “selection temperature.”

3.For each genotype *g*_{uj}*,* if *g*_{uj}=0 or 1, we keep *h*_{b,uj,v} unchanged; if *g*_{uj}=2, 9, 8, or 7 and *h*_{b,uj,v}=*h*_{b,sj,t}, we keep *h*_{b,uj,v} unchanged with probability *p*_{l} and change *h*_{b,uj,v} to *h*_{b,sj,t} with probability 1-*p*_{l}; if *g*_{uj}=2, 9, 8, or 7 and *h*_{b,uj,(v)}≠*h*_{b,sj,t}, we keep *h*_{b,uj,v} unchanged with probability 1-*p*_{l} and change *h*_{b,uj,v} to *h*_{b,sj,t} with probability *p*_{l}. We update the complementary pair of *h*_{b,u,v} accordingly, such that they are compatible with *g*_{u}.

4.According to the Metropolis-Hastings rule, accept the new haplotype pair with probability *min*{1,*r*_{l}}, where

Here, the transition probability equals

where α_{1} is the total number of the common haplotypes of and at the heterozygous and missing loci, α_{2} is the total number of the different haplotypes of and at the heterozygous and missing loci, and α_{3} counts the total number of times that the haplotype values in the complementary haplotype pair of *h*_{b,u,v} are randomly assigned. The *p*_{l} is a user-specified parameter. We set *p*_{l}=0.9 for all examples in this article. The transition probability can be computed similarly.

This operator makes it possible for haplotypes to coalesce together very fast if it is feasible.

## Appendix B : Extrapolation

The extrapolation operator extrapolates to by attaching the haplotype pairs compatible with *G*^{(b+1)}. We call a haplotype “original” if it first appears in in some scanning order—for instance, the natural order (*h*_{b,1,1},*h*_{b,1,2};…; *h*_{b,n,1},*h*_{b,n,2}) used in this article, where (*h*_{b,i,1},*h*_{b,i,2}) is the haplotype pair of the *i*th genotype in ; otherwise, we call it “duplicate.” The extrapolation proceeds, in the prefixed scanning order, as follows. If a haplotype and its complementary pair are both “original,” it is extrapolated independently—that is, if *g*_{ij} is a heterozygous or missing allele, then (*h*_{b+1,ij,1},*h*_{b+1,ij,2}) is equally likely to be set to one of the possible haplotype pairs. If a haplotype is “duplicate,” then it will be extrapolated according to the corresponding original copy. Note that, in this case, the extrapolation for the corresponding original copy has been finished. For example, if *h*_{b,u,v} is a duplicate of *h*_{b,s,t}, and if *g*_{uj} is a heterozygous or missing allele, then *h*_{b+1,uj,v} will be set to the same value as *h*_{b+1,sj,t} with probability *p*_{e} and will be set to a value that differs from *h*_{b+1,sj,t} with probability 1-*p*_{e}. The complementary pair of *h*_{b+1,uj,v} will be set accordingly, such that the pair is compatible with *g*_{u}. We usually set *p*_{e} to a large value—say, 0.95—for all examples in this article. Obviously, the extrapolation operator will provide a good starting point for the simulation from the distribution .

## Electronic-Database Information

The URLs for data presented herein are as follows:

## References

_{2}-adrenergic receptor haplotypes alter receptor expression and predict in vivo responsiveness. Proc Natl Acad Sci USA 97:10483–10488 [PMC free article] [PubMed] [Cross Ref]10.1073/pnas.97.19.10483

**American Society of Human Genetics**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (281K)

- Inference on haplotype effects in case-control studies using unphased genotype data.[Am J Hum Genet. 2003]
*Epstein MP, Satten GA.**Am J Hum Genet. 2003 Dec; 73(6):1316-29. Epub 2003 Nov 20.* - Direct analysis of unphased SNP genotype data in population-based association studies via Bayesian partition modelling of haplotypes.[Genet Epidemiol. 2005]
*Morris AP.**Genet Epidemiol. 2005 Sep; 29(2):91-107.* - A permutation procedure for the haplotype method for identification of disease-predisposing variants.[Ann Hum Genet. 2001]
*Li H.**Ann Hum Genet. 2001 Mar; 65(Pt 2):189-96.* - Haplotype frequency estimation in patient populations: the effect of departures from Hardy-Weinberg proportions and collapsing over a locus in the HLA region.[Genet Epidemiol. 2002]
*Single RM, Meyer D, Hollenbach JA, Nelson MP, Noble JA, Erlich HA, Thomson G.**Genet Epidemiol. 2002 Feb; 22(2):186-95.* - Challenges and strategies for investigating the genetic complexity of common human diseases.[Diabetes. 2002]
*Rich SS, Concannon P.**Diabetes. 2002 Dec; 51 Suppl 3:S288-94.*

- Detecting Genetic Interactions for Quantitative Traits with U-Statistics[Genetic epidemiology. 2011]
*Li M, Ye C, Fu W, Elston RC, Lu Q.**Genetic epidemiology. 2011 Sep; 35(6)457-468* - Multi-locus Association Testing with Penalized Regression[Genetic Epidemiology. 2011]
*Basu S, Pan W, Shen X, Oetting WS.**Genetic Epidemiology. 2011 Dec; 35(8)755-765* - Mapping Haplotype-haplotype Interactions with Adaptive LASSO[BMC Genetics. ]
*Li M, Romero R, Fu WJ, Cui Y.**BMC Genetics. 1179* - Effects of interacting networks of cardiovascular risk genes on the risk of type 2 diabetes mellitus (the CODAM study)[BMC Medical Genetics. ]
*van Greevenbroek MM, Zhang J, Kallen CJ, Schiffers PM, Feskens EJ, de Bruin TW.**BMC Medical Genetics. 936*

- MedGenMedGenRelated information in MedGen
- PubMedPubMedPubMed citations for these articles
- StructureStructurePublished 3D structures
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree