Logo of geneticsGeneticsCurrent IssueInformation for AuthorsEditorial BoardSubscribeSubmit a Manuscript
Genetics. Sep 2004; 168(1): 503–511.
PMCID: PMC1448119

Sequencing Complex Diseases With HapMap


Determining the patterns of DNA sequence variation in the human genome is a useful first step toward identifying the genetic basis of a common disease. A haplotype map (HapMap), aimed at describing these variation patterns across the entire genome, has been recently developed by the International HapMap Consortium. In this article, we present a novel statistical model for directly characterizing specific sequence variants that are responsible for disease risk based on the haplotype structure provided by HapMap. Our model is developed in the maximum-likelihood context, implemented with the EM algorithm. We perform simulation studies to investigate the statistical properties of this disease-sequencing model. A worked example from a human obesity study with 155 patients was used to validate this model. In this example, we found that patients carrying a haplotype constituted by allele Gly16 at codon 16 and allele Gln27 at codon 27 genotyped within the β2AR candidate gene display significantly lower body mass index than patients carrying the other haplotypes. The implications and extensions of our model are discussed.

CURRENT studies of genetic mapping have been developed to a point at which the quantitative variation of almost any complex trait can be dissected into its underlying Mendelian factors on the basis of molecular linkage maps (Lander and Botstein 1989) or association studies (Wu et al. 2002; Lou et al. 2003). Such Mendelian factors, i.e., the so-called quantitative trait loci (QTL), are hypothesized regions with alleles triggering an effect on quantitative variation. A number of statistical methods have been developed to map QTL for various mapping population structures and different marker and character types, each aimed at precisely identifying QTL that are close enough to markers for their positional cloning using biotechnology approaches, such as chromosome walks (Wu and Casella 2005). These methods have led to the detection of QTL contributing to complex traits in a range of species from microorganisms to plant and animal species to humans (Mackay 2001).

The basic principle for QTL mapping is the cosegregation of the alleles at a QTL with those at one or a set of known polymorphic markers genotyped on a genome. If a QTL is cosegregating with molecular markers, the genetic effects of QTL and their genomic positions can be estimated from the markers. This approach is robust and powerful for the detection of major QTL and presents the most efficient way to utilize marker information when marker maps are sparse. However, this approach is limited in two aspects. First, because the markers and QTL bracketed by them are located at different genomic positions, the significant linkage of a QTL detected with given markers cannot give any information about the sequence structure and organization of QTL. Second, the inference of the QTL positions using the nearby markers is sensitive to marker informativeness, marker density, and mapping population type. As a result of these, only a few QTL detected from markers have been successfully cloned (Frary et al. 2000), despite a considerable number of QTL reported in the literature.

A more accurate and useful approach for the characterization of genetic variants contributing to quantitative variation is to directly analyze DNA sequences associated with a particular disease. If a string of DNA sequence is known to increase disease risk, this risk can be reduced by the alteration of this string of DNA sequence using a specialized drug. The control of this disease can be made more efficient if all possible DNA sequences determining its variation are identified in the entire genome.

With the recent development of the human genome project, massive amounts of DNA sequence data have been available across the human genome (International HapMap Consortium 2003). In particular, single-nucleotide polymorphisms (SNPs), being the most common type of variant in the DNA sequence, provide a powerful means for genotyping the whole genome. This facilitates the complete identification of specific sequence variants responsible for complex diseases. A linear arrangement of alleles (i.e., nucleotides) at different SNPs on a single chromosome, or part of a chromosome, is called a haplotype. The cosegregation of SNP alleles on haplotypes leads to nonrandom association, i.e., linkage disequilibrium (LD), between these alleles in the population. Empirical analyses of LD for SNPs have shown that nearby SNPs in the human genome tend to display highly significant levels of LD and are often distributed in block-like patterns, rather than displaying random- or even-spaced distribution as originally predicted (Patil et al. 2001; Dawson et al. 2002; Gabriel et al. 2002). SNPs within haplotype blocks are much more strongly associated with each other than are those between different blocks. Haplotype diversity within each block can be well explained by only a finite number of SNPs, called tag SNPs or representative SNPs. The existence of these tag SNPs means that it is not necessary to associate a disease with all SNPs in the DNA sequence to understand the complete genetic control of the disease.

In this article, we present a novel statistical model for determining specific DNA sequences that are associated with the phenotypic variation of disease risk. This model is derived on the basis of multilocus haplotype analysis using a finite number of tag SNPs. We derived a closed-form solution for estimating the effects of haplotypes, haplotype frequencies, allele frequencies, and the degrees of LD of various orders among tag SNPs underlying the disease. We performed simulation studies to test the statistical behavior of this haplotype-based sequence-mapping model. A worked example is used to validate our model, in which a DNA sequence variant is detected to significantly reduce human obesity.


Suppose there is a random sample of size n drawn from a natural human population at Hardy-Weinberg equilibrium. In this sample, a number of SNPs are genotyped genome-wide, aimed at the identification of DNA sequences responsible for a complex disease. Consider R (R > 1) tag SNPs for a haplotype block. Each of these R SNPs has two alleles denoted by Arkr (kr = 1, 2; r = 1, … R), with allele frequencies denoted by prkr for the rth SNP. We use superscripts and subscripts to distinguish between different SNPs and different alleles within SNPs, respectively. These SNPs form 2R possible haplotypes expressed as A1k1A2k2ARkR, whose frequencies are denoted by pk1k2kR. The haplotype frequencies are composed of allele frequencies at each SNP and linkage disequilibria of different orders among SNPs (Lou et al. 2003). The random combination of maternal and paternal haplotypes generates 2R−1(2R + 1) diplotypes expressed as A1k1A2k2ARkRA1l1A2l2ARlR (1 ≤ k1l1 ≤ 2, … , 1 ≤ kRlR ≤ 2). These R SNPs form 3R observable multilocus zygotic genotypes, generally expressed as A1k1A1l1/A2k2A2l2/…/ARkRARlR. When at most one SNP is heterozygous, the diplotype is consistent to its zygotic genotype. However, when two or more SNPs are heterozygous, the genotype will have different diplotypes and, therefore, the number of multilocus genotypes will be less than the number of diplotypes. Let Pk1k2 … kRl1l2 … lR and Pk1l1/k2l2/ …/kRlR denote the diplotype and genotype frequencies, respectively, and nk1l1/k2l2/ …/kRlR denote the observations of various genotypes.

Our interest is to search for the haplotype diversity that can explain phenotypic variation in a complex disease. The association between haplotype diversity and phenotypic variation has been detected in several studies of drug responses (Judson et al. 2000; Bader 2001). This allows us to assume that a particular haplotype is different from other haplotypes for a given disease. Because haplotypes (comprising diplotypes) cannot be directly observed, the effects of different haplotypes on the phenotype need be postulated from observed zygotic genotypes. The inference of diplotypes for a particular genotype is statistically a missing data problem that can be formulated by a finite mixture model (Wu and Casella 2005).

Likelihood function:

In this study, the complete data are diplotype configurations at a given set of SNPs for each genotype and for disease outcomes of subjects, whereas the observed data are the genotypes of these SNPs and the disease outcomes. The missing data are at the connection from the genotypes to diplotypes. For any given genotype, all possible diplotypes can be written out. For example, genotype A11A11/A21A21 has one diplotype A11A21A11A21, as does genotype A11A11/A21A22, the diplotype A11A21A11A22. This is because in these situations where at most one SNP is heterozygous, the genotype and diplotype are identical. However, for a double-heterozygous genotype A11A12/A21A22, we have two diplotypes, A11A21A12A22 and A11A22A12A21.

Table 1 lists all possible genotypes and diplotypes at two SNPs genotyped from a sample of size n. Each genotype (and therefore each diplotype) is composed of two haplotypes, one from the mother and the other from the father. Two haplotypes composing a diplotype come from four possible haplotypes, A11A21, A11A22, A12A21, and A12A22, whose frequencies are expressed as p11, p12, p21, and p22, respectively. The diplotype frequencies can be expressed in terms of the haplotype frequencies (Table 1). Two diplotypes A11A21A12A22 and A11A22A12A21 of a double heterozygote A11A12/A21A22 have frequencies p11p22 and p12p12, respectively. Thus, the relative frequencies of these two diplotypes for this double heterozygote are a function of haplotype frequencies, which can be expressed as p11p22/(p11p22 + p12p21) and p12p21/(p11p22 + p12p21), respectively.

Possible diplotype configurations of nine genotypes at two SNPs and their haplotype composition frequencies

Table 1 also gives the relative expected frequencies of haplotypes contained in a given genotype. All genotypes, except for the double heterozygote, contain one or two known haplotypes. For example, genotype A11A11/A21A21 has one haplotype A11A21, whereas A11A11/A21A22 has one half haplotype A11A21 and one half haplotype A11A22. The double heterozygote contains four possible haplotypes, with the relative frequencies p11p22/(p11p22 + p12p21) for haplotypes A11A21 and A12A22 and p12p21/(p11p22 + p12p21) for haplotypes A11A22 and A12A21.

The haplotype frequencies, arrayed by Ωp = (p11, p12, p21, p22), that belong to population genetic parameters can be estimated using the nine observed genotypes (G) for two SNPs (Table 1). The log-likelihood function of unknown haplotype frequencies given observed genotypes can be written as a multinomial form, i.e.,

equation M17

Assuming that diplotypes are associated with phenotypic variation in a disease, we formulate a likelihood for unknown population (Ωp) and quantitative genetic parameters (Ωq) given observed phenotypes (y) and SNP genotypes (G). Generally speaking, a given 2-SNP genotype, A1k1A1l1/A2k2A2l2, can be partitioned into two possible diplotypes, A1k1A2k2A1l1A2l2 and A1k1A2l2 A1l1A2k2. Thus, such a log-likelihood function can be formulated on the basis of a two-component mixture model, i.e.,

equation M18

where the mixture proportion,

equation M19

represents the relative frequency of subject i whose diplotype is A1k1A2k2A1l1A2l2, and

equation M20

are the probability density functions for subject i who has two possible diplotypes, respectively, with the genotypic means of μk1k2l1l2 for diplotype A1k1A2k2A1l1A2l2 and μk1l2l1k2 for diplotype A1k1A2l2A1l1A2k2 and the common residual variance of σ2. Note that subscript i is used to describe the genotypic means in the two density functions above because these means are diplotype- (and therefore subject-) dependent although there are only a total of 10 genotypic means for two SNPs. These means and variance are contained in vector equation M21.

Suppose there is a particular haplotype, A1k1A2k2, which is different from the other three haplotypes, A1k1A2k2, A1k1A2k2, and A1k1A2k2 (where a bar means the alternative of that allele), in its effect on a complex disease. The phenotypic means of the three genotypes that contain these two distinct groups of haplotypes are denoted as μ1 for A1k1A1k1/A2k2A2k2, μ2 for A1k1A1k1/A2k2A2k2, and μ3 for A1k1A1k1/A2k2A2k2. On the basis of quantitative genetic theory, these three μ's can be partitioned into the overall mean (μ), the additive effect (a) due to the substitution of different haplotypes, and the dominant effect (d) due to the interaction between different haplotypes, i.e., μ1 = μ + a, μ2 = μ + d, and μ3 = μ − a. In Table 1, the genotypic means are also given for different genotypes by assuming that haplotype A11A21 is different from the rest of the haplotypes.

The log-likelihood function described by Equation 2 can be expanded to include all possible SNP genotypes, which is now expressed as

equation M22

given that these SNP genotypes are independent. It can be seen from the above likelihood function that, although most zygote genotypes contain a single component (diplotype), the double heterozygote is the mixture of two possible diplotypes weighted by π and 1 − π. Thus, an advanced statistical method should be implemented to obtain the MLEs of the underlying population and quantitative genetic parameters.

An integrative EM algorithm:

We derived a closed-form solution for estimating the unknown parameters with the EM algorithm (Dempster et al. 1977). Haplotype frequencies can be expressed as a function of allelic frequencies and LD. For a two-SNP haplotype, we have

equation M23

where D is the linkage disequilibrium between the two SNPs. Thus, once haplotype frequencies are estimated, we can estimate allelic frequencies and LD by solving Equation 3. The estimates of haplotype frequencies are based on the log-likelihood function of Equation 1, whereas the estimates of diplotype genotypic means (and therefore the overall mean, haplotype additive, and dominant effects) and residual variance are based on the log-likelihood function of Equation 2. These two different types of parameters can be estimated using an integrative EM algorithm.

In the E step, the expected number (πi) and probabilities (Πi) of subject i of a double-heterozygous genotype who carries diplotype A11A21A12A22 are calculated by

equation M24
equation M25

Note that for all the other genotypes, such probabilities do not exist.

In the M step, the probabilities calculated in the previous iteration are used to estimate the haplotype frequencies using

equation M26
equation M27
equation M28
equation M29

Assuming that haplotype A11A11 has an effect different from the other three haplotypes, these probabilities are used to estimate the haplotype additive and dominant effects and the residual variance by

equation M30
equation M31
equation M32
equation M33

where [n with dot above] = n11/12 + n12/11 and [n with umlaut] = n11/22 + n12/22 + n21/21 + n21/22 + n22/22. Iterations including the E and M steps are repeated among Equations 514 until the estimates of the parameters converge to stable values. The sampling errors of these parameters can be estimated by calculating Louis's (1982) observed information matrix.

Hypothesis tests:

We can test two major hypotheses in the following sequence: (1) the association between two SNPs by testing their LD and (2) the difference of a given haplotype from the rest of the haplotypes in its effect on the disease outcome by testing the significance of haplotype additive and dominant effects. The LD between two given SNPs can be tested using two alternative hypotheses:

equation M34

The log-likelihood-ratio test statistic (LR) for the significance of LD is calculated by comparing the likelihood values under H1 (full model) and H0 (reduced model) using

equation M35

where the tilde and hat denote the maximum-likelihood estimates (MLEs) of unknown parameters under H0 and H1, respectively. The LR1 is considered to asymptotically follow a χ2 distribution with 1 d.f. The MLEs of allelic frequencies under H0 can be estimated using the EM algorithm described above, but with the constraint p11p22 = p12p21.

Diplotype or haplotype effects on a complex trait can be tested using two alternative hypotheses expressed as

equation M36

The log-likelihood-ratio test statistic (LR2) under these two hypotheses can be similarly calculated. The LR2 may asymptotically follow a χ2 distribution with 2 d.f. However, the approximation of a χ2 distribution may be inappropriate when some regularity conditions, such as normal and uncorrelated residuals, are violated. The permutation test approach proposed by Churchill and Doerge (1994), which does not rely upon the distribution of the LR2, may be used to determine the critical threshold for determining the existence of a QTL.

R-SNP sequence model:

The idea for sequencing a complex trait is described for a two-SNP model. It is possible that the two-SNP model is too simple to characterize genetic variants for quantitative variation. With the analytical line for the two-SNP sequencing model, we can readily extend our model to include an arbitrary number of SNPs whose sequences are associated with the phenotypic variation.

Consider R SNPs that form 3R observable multilocus zygotic genotytpes, generally expressed as A1k1A1l1/A2k2A2l2/… /ARkRARlR. These genotypes are collapsed from a total of 2R−1(2R + 1) diplotypes expressed as A1k1A2k2ARkR A1l1A2l2ARlR (1 ≤ k1l1 ≤ 2, … , 1 ≤ kRlR ≤ 1). A key issue for the multi-SNP sequencing model is how to distinguish among 2r−1 different diplotypes for the same genotype heterozygous at r loci. The relative frequencies of these diplotypes can be expressed in terms of haplotype frequencies. The integrative EM algorithm can be employed to estimate the MLEs of haplotype frequencies. Lou et al. (2003) provided a general formula for expressing haplotype frequencies in terms of allele frequencies and linkage disequilibria of different orders. The MLEs of the latter can be obtained by solving a system of equations.

In the multi-SNP sequencing model, we face many haplotypes and haplotype pairs. An Akaike information criterion- or Bayes information criterion-based model selection strategy (Burnham and Andersson 1998) has been framed to determine the haplotype that is most distinct from the rest of the haplotypes in explaining quantitative variation. However, in practice, a simultaneous analysis of too many SNPs will encounter considerable computational load and, also, may not be necessary for the explanation of disease variation. These two factors should be considered in a further study of the determination of a maximal number of SNPs for sequencing mapping.



We performed Monte Carlo simulation experiments to examine the statistical properties of the model proposed for association studies between variation in DNA sequence and a complex trait. Our simulation studies were designed to consider the effects of different heritability levels (H2 = 0.1 vs. 0.4), different sample sizes (n = 100 vs. 400), and different gene action modes (additive vs. dominant vs. overdominant) on the estimation precision of parameters from our model. The ratio of dominant (d) over additive effect (a) is used to determine the degree of dominant. A haplotype is regarded as additive, dominant, or overdominant if this ratio is zero, one, and greater than one, respectively (Lynch and Walsh 1998).

Suppose two SNPs are segregating in a natural population at Hardy-Weinberg equilibrium. The allele frequencies and linkage disequilibrium between these two SNPs are given in Table 2. We assume that one of the four possible haplotypes for the two SNPs is different from the rest of the haplotypes in their effects on the phenotype of a complex trait, which leads to three distinct groups of diplotypes. By assigning the genetic values for these two contrasting haplotypes under different gene action modes (Table 2), we calculate the genotypic values for all possible diplotypes and further simulate their phenotypic values on the basis of normal distribution. The residual variance is determined according to different heritability levels (0.1 vs. 0.4) expressed as the relative proportion of genetic variance to total observed variance. The genetic variance is determined on the basis of the genotypic values of all diplotypes and their frequencies.

The MLEs and their square root mean square errors (in parentheses) of population and quantitative genetic parameters of DNA sequences associated with the phenotypic variation estimated from 200 simulation replicates under different (heritability-sample ...

Table 2 describes the results of parameter estimation from our simulation using the proposed model. In general, all parameters (including population and quantitative genetic) can be reasonably estimated. As expected, the precision of population parameter estimation is a function of sample size and does not rely on the H2 value or the gene action mode. Large sample sizes always lead to more precise estimation of allelic frequencies and linkage disequilibrium (Table 2).

The estimation precision of quantitative parameters is dependent on heritability level, sample size, and gene action mode. In almost all situations, the additive effects can be estimated better than the dominant effect. The estimation of additive and dominant effects responds differently to genetic action mode, depending on the degree of dominance. For example, given the same heritability H2 = 0.1 and same sample size n = 100, the sampling error of the additive effect estimation is increased by 121% from additive to dominant modes, but is stable from dominant to overdominant modes. These two percentage values are 1 and 42% for the sampling errors of the dominant effect estimation. It appears that the additive effect estimation is more sensitive than the dominant effect estimation to the changes of heritability and sample size.

We analyze the power to detect a significant diplotype difference from our model under different combinations of H2 value, sample size, and gene action mode (Table 2). To do so, we first obtain empirical critical threshold values for declaring the significance of diplotype difference through simulations. The thresholds at the significance level α = 0.05 or 0.01 are estimated as the 95th and 99th percentile for the simulated test statistics calculated from simulated data involving no diplotype difference over 1000 LR2 values. It is clear that increased H2 levels and increased sample sizes can increase the power to detect diplotype differences. When H2 = 0.4 or n = 400, such power is 100% (Table 2).

An additional simulation study was performed to investigate the statistical behavior of the multi-SNP sequence model. The results from the 3-SNP sequence model are summarized in Table 3. First, all genetic parameters can be reasonably estimated from the 3-SNP sequence model, although the estimation precision of the dominance effect is reduced compared to that from the 2-SNP model. Second, the power to detect significant sequence variants and the estimation precision of different genetic parameters from the 3-SNP model can be improved with increased heritability levels, increased sample sizes, and decreased gene interactions.

The MLEs and their square root mean square errors (in parentheses) of population and quantitative genetic parameters of DNA sequences associated with the phenotypic variation estimated from 200 simulation replicates under different (heritability-sample ...

A worked example:

A real example from an obesity study is used to demonstrate the usefulness of our model. Numerous genes have been investigated as potential obesity-susceptibility genes (Mason et al. 1999; Chagnon et al. 2003). The β-2 adrenoceptor (BAR-2) is a major lipolytic receptor in human fat cells, whose function was found to be determined by known polymorphisms in codons 16 and 27 (Green et al. 1995). In a genetic study of obesity involving 149 women by Large et al. (1997), the Arg16Gly polymorphism at codon 16 was found to be associated with altered BAR-2 function with Gly16 carriers showing a fivefold increased agonist sensitivity. The Gln27Glu polymorphism at codon 27 is markedly associated with obesity traits. The homozygotes for Glu27 display an average fat mass excess of 20 kg and 50% larger fat cells than controls. Although the Arg16Gly polymorphism is not significantly associated with obesity and the Gln27Glu polymorphism is not significantly associated with the BAR-2 function, genetic variability in these two sites of the human BAR-2 gene could be of major importance for obesity, energy expenditure, and lipolytic BAR-2 function in adipose tissue, at least in women (Large et al. 1997).

To determine whether sequence variants at these two polymorphisms of BAR-2 are associated with obesity phenotypes, we investigated a group of 155 women of ages 32 to 86 years old with a large variation in body fat mass. Each of these patients was determined for her genotypes at codon 16 with two alleles, Arg16 (A) and Gly16 (G), and at codon 27 with two alleles, Gln27 (C) and Glu27 (G), within the BAR-2 gene and measured for body mass index (BMI). These two SNPs form four haplotypes designated as AC, AG, GC, and GG, which lead to nine genotypes, AA/CC, AA/CG, AA/GG, AG/CC, AG/CG, AG/GG, GG/CC, GG/CG, and GG/GG, and the 10 corresponding diplotypes, [AC][AC], [AC][AG], [AG][AG], [AC][GC], and [AC][GG] or [AG][GC], [AG][GG], [GC][GC], [GC][GG] and [GG][GG]. Our model is used to associate diplotype differences with variation in BMI. The MLEs of the haplotype frequencies, allele frequencies, and linkage disequilibrium between the two SNPs were obtained (Table 4). These two SNPs are highly associated with each other, whose LD is estimated as 0.1261. The allele frequencies are ~0.61 for allele G at codon 16 and 0.38 for allele G at codon 27, respectively, suggesting that both of them have fairly high heterozygosity.

The maximum-likelihood estimates (MLEs) and their standard errors (SEs) of population and quantitative genetic parameters for DNA sequences associated with phenotypic variation in BMI for 155 patients

Our model is based on the assumption that one haplotype is different from the rest of the haplotypes. In a practical situation, as in our example used here, the issue of how these haplotypes differ in their effects is not known. By assuming that any one haplotype is different from the rest of the haplotypes, we can find a best-difference pattern that is based on the estimates of the LR2's using Equation 16. When haplotype GG, GC, AG, or AC is assumed to differ in BMI from the other haplotypes, the LR2 values were estimated as 10.35, 3.11, 1.52, and 2.32, respectively. Obviously, haplotype GG is considered as a reference haplotype that has an effect on BMI in a comparison with the other haplotypes. The difference between haplotype GG and the rest of the haplotypes can significantly explain some variation in BMI. The resultant LR2 for testing the association between DNA sequence and BMI phenotype (10.35) is well beyond the critical threshold value at the significance level of 1% (6.07) estimated from 1000 simulation replicates. Haplotype GG displays large additive and dominant effects. Patients with homozygous genotype GG/GG have significantly lower BMI than those with homozygous genotype AA/CC, as indicated by a large additive effect (a = −1.77). For double-heterozygote AG/GC, diplotype [GG][AC] reduces BMI by 3.05 (the dominant effect) compared to diplotype [AG][GC]. The difference between haplotype GG and the rest of the haplotypes can explain 6.3% of the observed variation in BMI. We estimate the standard errors of the MLEs of the population and quantitative genetic parameters on the basis of Louis's (1982) observed information matrix, suggesting that all MLEs have reasonable estimation precision although the estimates of quantitative genetic parameters are not as precise as those of population genetic parameters due to a small sample size used (see the simulation result in Tables 2 and and33).


The elucidation of the entire human genome using SNPs will make it possible to develop a haplotype map of the human genome. Although such a HapMap has been recently available due to international collaborations (International HapMap Consortium 2003), there is a serious lack of powerful statistical tools for specific utility of this expensive HapMap to detect genes and genetic variations that affect health and disease. Such a statistical model has been developed and presented in this article. The idea behind our statistical modeling is that differences between SNP-constructed haplotypes may be associated with differing susceptibility to disease (International HapMap Consortium 2003).

SNPs reside at sites in the DNA sequence where individuals differ at a single DNA base. Sets of nearby SNPs on the same chromosome are inherited in blocks. Blocks may contain a large number of SNPs, but a few SNPs are enough to uniquely identify the haplotypes in a block (Gabriel et al. 2002). The HapMap is a map of these haplotype blocks constructed by tag SNPs, those that explain most of haplotype diversity. The HapMap should be valuable by reducing the number of SNPs required to examine the entire genome for association with a phenotype from the 10 million SNPs that exist to ~500,000 tag SNPs. This will make genome scan approaches to finding regions with genes that affect diseases much more efficient and comprehensive, since effort will not be wasted typing more SNPs than necessary and all regions of the genome can be included.

Our model is founded on the discovery of tag SNPs in the genome, thus allowing for a fast scan for the association between variation in DNA sequence and traits. This model has three advantages. First, it can materialize the genetic basis for quantitative variation by directly characterizing specific DNA sequences predisposing to a certain disease. The traditional statistical models for genetic mapping attempt to postulate the position of hypothesized QTL that are linked with known markers genotyped from the genome (Lander and Botstein 1989; Lou et al. 2003). The QTL detected from these models are regarded as “hypothesized” because it is not possible to know their DNA sequences and, therefore, physiological function. As opposed to the traditional “indirect” approach, our model presents a “direct” approach. At present, the utility of the direct approach is limited to sequencing the functional parts of candidate genes with known biochemical or physiological function. With the release of HapMap, our model will make the direct approach more useful and more efficient in searching for causal variants throughout the whole genome. Our model is also different from traditional association studies (e.g., Zaykin et al. 2002) whose aims are the significance test for the association between haplotypes and phenotypes (mostly discrete traits) rather than the precise estimation of haplotype effects. Second, our model is statistically simple and computationally fast. The most difficult part for the estimation from our model is to construct diplotype configurations for heterozygous genotypes at two or more SNPs. The estimation of population genetic parameters is based on a multinomial-likelihood function of the observed genotype data, whereas the estimation of quantitative genetic parameters is based on a mixture-based likelihood function including different diplotypes. These two likelihood functions can be easily integrated to a unified estimation framework implemented with the EM algorithm.

Finally, our model is flexible to different genetic and experimental settings. The results from a simulation study indicate that the association between DNA sequence and phenotype can be well detected when the trait has a modest heritability level (0.1) or a modest sample size (100) is used. Our model can also obtain fairly precise estimation of parameters when diplotypes display overdominance in the situation with modest heritability and sample size. The specific utility of our model to a real example from an obesity study leads to the successful detection of a DNA sequence (haplotype) at codons 16 and 27 genotyped within the β2AR candidate gene (Chagnon et al. 2003) for its significant impact on human obesity. This haplotype, composed of the Gly16 form of codon 16 and the Gln27 form of codon 27, tends to reduce BMI when it is combined with itself or any other haplotypes and accounts for ~6% of the total observed variation in BMI for 155 patient samples.

Although our simulation and example were based on 2- or 3-SNP analyses, our sequencing model has been developed to allow for the detection of sequence variants involving any number of SNPs within a haplotype block. In addition to its use in studying genetic associations with disease, our sequencing model can be extended to study the genetic factors contributing to variation in response to environmental factors, in susceptibility to infection, and in the effectiveness of and adverse responses to drugs and vaccines (Evans and McLeod 2003; Weinshilboum 2003). It can also be modified to estimate the effects of sequence-sequence interaction on a complex trait. It is possible that a haplotype within a candidate gene interacts with haplotypes from other candidate genes. The characterization of specific DNA sequence variants for diseases should allow the development of tests to predict which drugs or vaccines would be most effective in individuals with particular genotypes for genes affecting drug metabolism.


We thank two anonymous referees for their constructive comments on this manuscript. This work is supported by an Outstanding Young Investigator Award of the National Natural Science Foundation of China (30128017), a University of Florida Research Opportunity Fund (02050259), and a University of South Florida biodefense grant (7222061-12) to R.W. The publication of this manuscript was approved as journal series no. R-10063 by the Florida Agricultural Experiment Station.


  • Bader, J. S., 2001. The relative power of SNPs and haplotype as genetic markers for association tests. Pharmacogenomics 2 11–24. [PubMed]
  • Burnham, K. P., and D. R. Andersson, 1998 Model Selection and Inference. A Practical Information-Theoretic Approach. Springer, New York.
  • Chagnon, Y. C., T. Rankinen, E. E. Snyder, S. J. Weisnagel, L. Perusse et al., 2003. The human obesity gene map: the 2002 update. Obes. Res. 11 313–367. [PubMed]
  • Churchill, G. A., and R. W. Doerge, 1994. Empirical threshold values for quantitative trait mapping. Genetics 138 963–971. [PMC free article] [PubMed]
  • Dawson, E., G. R. Abecasis, S. Bumpstead, Y. Chen, S. Hunt et al., 2002. A first-generation linkage disequilibrium map of human chromosome 22. Nature 418 544–548. [PubMed]
  • Dempster, A. P., N. M. Laird and D. B. Rubin, 1977. Maximum likelihood from incomplete data via EM algorithm. J. R. Stat. Soc. Ser. B 39 1–38.
  • Evans, W. E., and H. L. McLeod, 2003. Pharmacogenomics drug disposition, drug targets, and side effects. N. Engl. J. Med. 348 538–549. [PubMed]
  • Frary, A., T. C. Nesbitt, S. Grandillo, E. Knaap, B. Cong et al., 2000. fw2.2: a quantitative trait locus key to the evolution of tomato fruit size. Science 289 85–88. [PubMed]
  • Gabriel, S. B., S. F. Schaffner, H. Nguyen, J. M. Moore, J. Roy et al., 2002. The structure of haplotype blocks in the human genome. Science 296 2225–2229. [PubMed]
  • Green, S. A., J. Turki, I. P. Hall and S. B. Liggett, 1995. Implications of genetic variability of human 2-adrenergic receptor structure. Pulm. Pharmacol. 8 1–10. [PubMed]
  • International HapMap Consortium, 2003. The international HapMap project. Nature 426 789–794. [PubMed]
  • Judson, R., J. C. Stephens and A. Windemuth, 2000. The predictive power of haplotypes in clinical response. Pharmacogenomics 1 5–16. [PubMed]
  • Lander, E. S., and D. Botstein, 1989. Mapping Mendelian factors underlying quantitative traits using RELP linkage maps. Genetics 121 185–199. [PMC free article] [PubMed]
  • Large, V., L. Hellstrom, S. Reynisdottir, F. Lonnqvist, P. Eriksson et al., 1997. Human beta-2 adrenoceptor gene polymorphisms are highly frequent in obesity and associate with altered adipocyte beta-2 adrenoceptor function. J. Clin. Invest. 100 3005–3013. [PMC free article] [PubMed]
  • Lou, X.-Y., G. Casella, R. C. Littell, M. K. C. Yang and R. L. Wu, 2003. A haplotype-based algorithm for multilocus linkage disequilibrium mapping of quantitative trait loci with epistasis in natural populations. Genetics 163 1533–1548. [PMC free article] [PubMed]
  • Louis, T. A., 1982. Finding the observed information matrix when using the EM algorithm. J. R. Stat. Soc. Ser. B 44 226–233.
  • Lynch, M., and B. Walsh, 1998 Genetics and Analysis of Quantitative Traits. Sinauer, Sunderland, MA.
  • Mackay, T. F. C., 2001. The genetic architecture of quantitative traits. Annu. Rev. Genet. 35 303–339. [PubMed]
  • Mason, D. A., J. D. Moore, S. A. Green and S. B. Liggett, 1999. A gain-of-function polymorphism in a G-protein coupling domain of the human β1-adrenergic receptor. J. Biol. Chem. 274 12670–12674. [PubMed]
  • Patil, N., A. J. Berno, D. A. Hinds, W. A. Barrett, J. M. Doshi et al., 2001. Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294 1719–1723. [PubMed]
  • Weinshilboum, R., 2003. Inheritance and drug response. N. Engl. J. Med. 348 529–537. [PubMed]
  • Wu, R. L., and G. Casella, 2005 Statistical Genomics: A Quantitative Trait Loci Perspective. Springer, New York.
  • Wu, R. L., C.-X. Ma and G. Casella, 2002. Joint linkage and linkage disequilibrium mapping of quantitative trait loci in natural populations. Genetics 160 779–792. [PMC free article] [PubMed]
  • Zaykin, D. V., P. H. Westfall, S. S. Young, M. A. Karnoub, M. J. Wagner et al., 2002. Testing association of statistically inferred haplotypes with discrete and continuous traits in samples of unrelated individuals. Hum. Hered. 53 79–91. [PubMed]

Articles from Genetics are provided here courtesy of Genetics Society of America
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • MedGen
    Related information in MedGen
  • PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...