- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# Accounting for Unmeasured Population Substructure in Case-Control Studies of Genetic Association Using a Novel Latent-Class Model

^{1}Centers for Disease Control and Prevention and

^{2}Department of Epidemiology, Emory University, Atlanta

## Abstract

We propose a novel latent-class approach to detect and account for population stratification in a case-control study of association between a candidate gene and a disease. In our approach, population substructure is detected and accounted for using data on additional loci that are in linkage equilibrium within subpopulations but have alleles that vary in frequency between subpopulations. We have tested our approach using simulated data based on allele frequencies in 12 short tandem repeat (STR) loci in four populations in Argentina.

## Introduction

Although the case-control study is one of the primary tools of epidemiology, it has fallen out of favor in studies of the association of a candidate gene with occurrence of disease, because of the possible effect of population stratification (Li ^{1972}; Lander and Schork ^{1994}; Ewens and Spielman ^{1995}). Population stratification occurs when the population under study is assumed to be homogeneous with respect to allele frequencies but in fact comprises subpopulations that have different allele frequencies for the candidate gene. If these subpopulations also have different risks of disease, then subpopulation membership is a confounder (Kleinbaum et al. ^{1982}), and an association between the candidate gene and disease may be incorrectly estimated without properly accounting for population structure.

Unfortunately, the relevant population structure may not be known. Epidemiologic studies may measure crude indicators of subpopulation membership such as race, but the relevant subpopulations may, in fact, be more finely stratified. As a result, genetic epidemiologists have developed methods based on case-parent triads and using the transmission/disequilibrium test (TDT) to measure the association between a candidate gene and disease status (Self et al. ^{1991}; Spielman et al. ^{1993}). However, these approaches require genotyping both of case patients and of their parents (resulting in both an increase in required sequencing and the requirement that at least one parent is available). Worse, some case-parent triads are not informative. Although alternative approaches exist using other relatives (Spielman and Ewens ^{1998}) or a single parent (Sun et al. ^{1999}), all such approaches require some additional ascertainment of relatives and some additional genotyping. Finally, it should be recognized that effects of population stratification may be reintroduced into TDT-related methods that allow for missing parental data. In particular, the assumption that the distribution of genotypes of the sampled parents can be used to make inferences about the missing parents is analogous to the assumption that gene frequencies among case patients can be compared with those among control patients.

Recently, however, several factors have led to a resurgence of interest in case-control studies of gene-disease association (Risch and Merikangas ^{1996}; Morton and Collins ^{1998}; Risch and Teng ^{1998}). Researchers have begun collecting specimens, for genetic analysis in large epidemiologic studies and surveys (National Center for Health Statistics ^{1994}; Surguchov et al. ^{1996}; Daly et al. ^{2000}), that can be used to study a variety of gene-disease and gene-environment associations. Many case-control studies can be conducted using the same stored specimens, without requiring genotypes of relatives of case subjects. Although Wacholder et al. (^{2000}) argue that population stratification of an extent large enough to distort results is unlikely to occur in many realistic situations, it is still important to develop methods that allow for control of population stratification when analyzing case-control studies.

Fortunately, if population substructure affects allele frequencies of the candidate gene, then it should also affect allele frequencies of other genes as well (Devlin and Roeder ^{1999}; Pritchard and Rosenberg ^{1999}). Markers—that is, genes that are markers of population substructure and that (1) segregate independently both from each other and from the candidate gene and (2) are not themselves associated with disease or in linkage disequilibrium with genes associated with disease—can be used to make inferences about the existence of population substructure in a sample (Pritchard and Rosenberg ^{1999}) and even to reconstruct the underlying population substructure in an observed sample (Pritchard et al. ^{2000a}). Additionally, binary markers (e.g., single-nucleotide polymorphisms) can be used to control for differences in relatedness between cases and controls that occur when population substructure confounds the relation between disease and a candidate gene (Devlin and Roeder ^{1999}; Bacanu ^{2000}; Devlin, ^{in press}).

In this study, we use a novel latent-class analysis to use data on markers to make inferences about the association between a candidate gene and the occurrence of disease in a population that may be subject to population stratification. Latent-class methods have been used extensively in sociology to analyze questionnaire data by using correlations in responses to related questions to make inferences about subgroups of people with common attitudes or beliefs (see, e.g., Henry ^{1983}). Inferences concerning population substructure in a single sample, using correlations in genotypes at loci that are unrelated to disease, can also be accomplished using latent-class analysis. However, a case-control study comprises two separate samples (one of case subjects and the other of control subjects); if different subpopulations have different disease risks, we can expect the proportions of case patients from each subpopulation (class probabilities) to differ from the corresponding proportions of control subjects. Two separate latent-class analyses, one using data from case subjects and the other using data from control subjects, can lead to logical inconsistency, because different population substructure might be inferred in each population. If this occurs, data from case subjects and control subjects could not be recombined to calculate the odds ratio for the association between the candidate gene and disease. The approach we take here properly accounts for the differences between the sample of case subjects and the sample of control subjects, while assuming that case subjects and control subjects derive from the same target population.

## Model

The quantities of primary interest are those that relate disease (denoted by the binary variable *D*) to a (possibly vector-valued) genetic risk factor *G*. This relation may be confounded by the existence of population stratification. Unfortunately, we may not know which subpopulations have the differential rates of disease or prevalence of the candidate gene *G* that, if not properly accounted for, will result in improper inference about the relation between *D* and *G.* In addition, separate sampling of cases and controls must be properly accounted for in any analysis.

As a heuristic approximation of the complex genetic history that may have led to the current population substructure, we assume that the overall population comprises *K* subpopulations, each having different frequencies of *G* and *D.* In the development below, we suppress an index *i* corresponding to the *i*th individual. We denote by *Z* the (unmeasured) covariable *Z* that indicates the subpopulation to which an individual belongs. Because different subpopulations may have different frequencies of other mutually independent marker genes that are unrelated to disease, we propose to use a novel latent-class approach to infer the population substructure while simultaneously estimating parameters relating *G* to *D.* Let *X*^{c}_{} denote the allele at marker on chromosome *c*=1, 2 (numbering of chromosomes is arbitrary) and let *X*=(*X*^{1}_{1}, *X*^{2}_{1},, *X*^{2}_{L}), where *L* is the number of marker loci. In the analysis that follows, we assume that Hardy-Weinberg equilibrium holds in each subpopulation. Relaxing this assumption by considering *X*_{} to represent genotype data is possible; however, human populations rarely show much divergence from Hardy-Weinberg equilibrium once population substructure has been accounted for (Committee on DNA Forensic Science ^{1996}, pp. 104 and references cited therein).

We assume that the genes at the marker loci are unrelated to disease, that is,

We further assume that, for persons in the same subpopulation, the marker loci are in linkage equilibrium with the candidate gene *G,* so that

This assumption can be met, for example, by choosing marker loci on different chromosomes from the chromosome where *G* is found. Finally, we assume that *Z* is a confounder but not an effect modifier—that is, that

where we take for identifiability. In a case-control study, we cannot usually expect to estimate μ, although we will see that the δ_{k}s are, in fact, estimable and that there is even some information on μ. An immediate consequence of equations (1) and (2) is that *Pr*[*X*|*G*,*Z*,*D*]=*Pr*[*X*|*Z*]. We assume Hardy-Weinberg equilibrium holds within each stratum, so that

where *p*_{kj}=*Pr*[*X*^{c}_{}=*j*|*Z*=*k*] is the proportion of persons in subpopulation *k* having allele *j* at marker locus .

Because case subjects and control subjects can be considered as representative samples from the segments of the population with and without disease, we base our inference on *Pr*[*X*,*G*|*D*]. To account for population stratification, we write

Assume that *G* takes *M*+1 values *g*_{0}0,,*g*_{M}; let π^{d}_{k}=*Pr*[*Z*=*k*|*D*=*d*] be the proportions of persons in each subpopulation by disease status; let γ_{km}= *Log*{*Pr*[*G*=*g*_{m}|*D*=0,*Z*=*k*]/*Pr*[*G*=*g*_{0}|*D*=0,*Z*=*k*]}; and let γ_{k}=(γ_{k1},,γ_{kM}). After some algebra, we find that

Likelihood (5) is for a single individual; the likelihood for all individuals in the study is the product of terms such as (5) for each participant.

We may choose β, π^{0}_{k}, and π^{1}_{k} as separate parameters to be maximized; it is possible to show that choosing π^{0}_{k} and π^{1}_{k} as independent parameters is equivalent to a model in which we choose π^{0}_{k} and δ_{k} as parameters. The situation is more complicated with parameters γ_{km}. For example, if *G* has *r* alleles, then there are *r*(*r*+1)/2-1 values of γ_{km} for each *k.* However, if Hardy-Weinberg equilibrium holds in each subpopulation, then only *r*-1 parameters are required to specify all the γ_{km}s for a given *k.* Unfortunately, even if Hardy-Weinberg equilibrium holds in each subpopulation, it will not hold among control subjects if the candidate gene is, in fact, associated with disease (Sasieni ^{1997}). This is because the distribution of *G* among control subjects is given by

Hence, the overall magnitude of the departures from Hardy-Weinberg equilibrium among control subjects is primarily determined by μ, as defined in equation (3). If we assume a rare disease (corresponding to μ being large and negative), then *Pr*[*G*=*g*_{j}|*D*=0,*Z*=*k*]≈*Pr*[*G*=*g*_{j}|*Z*=*k*], and we can maximize (5) directly with respect to parameters β, π^{0}_{k}, π^{1}_{k} and parameters in the model for *Pr*[*G*=*g*_{j}|*Z*=*k*]. Even if the disease is rare, the distribution of *G* among case subjects does not correspond to Hardy-Weinberg equilibrium unless β=0.

In the absence of an approximation of rare disease, we can still proceed without difficulties, as long as *G* is binary (i.e., if certain genotypes correspond to low risk and others to high risk). In this case, there is a single γ_{k} for each *k*, which may be treated as an independent parameter in place of *Pr*[*G*=1|*Z*=*k*]. We feel that it is unlikely that a reasonable estimate of μ can be obtained using case-control data alone, and, hence, either the approximation of rare disease should be made or several analyses using various binary genotypes *G* should be undertaken.

Although the likelihood (5) can be evaluated directly, the large number of parameters suggests use of the E-M algorithm. In this approach, the subpopulation to which each individual belongs is treated as missing data. This is easily accomplished, because all calculations in the E step can be carried out in closed form and the values of π^{d}_{k} and *p*_{kj} can be estimated in closed form. To estimate the parameters β and γ_{k}, a simple maximization must be carried out, corresponding to fitting the model

to *K*2×(*M*+1) tables, using maximum likelihood. In this calculation, the “data” are the expected proportion of persons having *D*=*d*, *Z*=*k*, and *G*=*g*, available from the previous E step. If *M*=1 (i.e., if *G* is binary), then the calculation reduces to a logistic regression analysis in which *G* is considered the outcome and *D* and *Z* are explanatory variables. If *M*>1, then the approximation of rare disease should be made and an appropriate model for γ_{km} should be chosen to reflect Hardy-Weinberg equilibrium among the controls. For example, if *M*=2 and outcomes *G*=*g*_{0},*g*_{1} and *g*_{2} correspond to persons having zero, one, or two copies of a disease-causing allele, then we take γ_{k}=(*ln*2+α_{k},2α_{k}), where α_{k} is the log of the odds that a person in the *k*th subpopulation has the disease-causing allele.

Likelihood (5) can be maximized using the E-M algorithm for a fixed number of subpopulations *K.* To estimate the number of subpopulations, we propose to select the value of *K* that minimizes the Akaike information criterion (AIC), which is given by -2*logL*+2*P*, where *P* is the number of parameters fit. If *P*_{G} is the number of parameters required to specify γ_{k} for a single stratum and *P*_{β} is the number of free parameters in β, then *P*=*K**(*P*_{G}+ total no. of marker alleles − no. of marker loci) + 2*(*K*-1)+*P*_{β}. To estimate *K,* we start with a single population (*K*=1) and increase *K* by 1 until the AIC begins to increase. This procedure assumes that the first minimum in the AIC corresponds to the global minimum. In some small-scale simulations, this appears to be the case (results not shown). Moreover, when the number of subpopulations *K* is greater than or equal to the number used to generate the data, the values of β appear to change very little (results not shown). Additional details on the E-M algorithm used are found in the Appendix.

Because of the large number of parameters fit, we recommend that variance estimates be calculated using a parametric bootstrap procedure (Efron and Tibshirani ^{1998}), conditional on the total numbers of case subjects and control subjects. In this procedure, simulated data sets are constructed using the parameter estimates obtained from fitting the latent-class model. Specifically, for each observation data on subpopulation is generated conditional on case or control status using the estimated values of π^{1}_{k}, for case subjects, or of π^{0}_{k}, for control subjects. Then, data on the candidate gene is simulated using (6) and the estimated values of β and the appropriate γ_{k}. Finally, marker values are simulated using the estimated values of *p*_{kj}. A total of *T* such data sets are generated, and estimates of β, denoted by , are obtained. The variance of can then be estimated to be the empirical variance of the values, and confidence intervals can be calculated using the percentiles of the values (Efron and Tibshirani ^{1998}).

### Example 1: Discrete Subpopulations

A classic example of population substructure affecting a case-control study occurred in a population that was an admixture of European and Pima ancestry (Knowler et al. ^{1988}). In this study, an association between a candidate gene and insulin-dependent diabetes type 1 actually resulted from confounding caused by population substructure. To illustrate our approach, we considered an analogous scenario based on an admixture of Europeans and American Indians. Sala et al. (^{1998}^{, }^{1999}) have published allele frequency data on twelve short tandem repeat (STR) loci in Argentineans of European ancestry, as well as in three Argentinean American Indian groups (Mapuche, Tehuelche, and Wichi). We have used these allele frequencies to simulate a population that comprises four subpopulations that differ in disease risk and frequency of a candidate-gene allele that is associated with disease.

Because Sala et al. (^{1998}^{, }^{1999}) sampled ~10 times more persons of European ancestry than persons of any of the other three ethnic groups, we combined some STR alleles to reduce the number of alleles having zero frequency in one or more American Indian populations. As a general rule, we combined adjacent alleles until the allele frequency in at least one population was 5%. The resulting allele frequencies are shown in table 1. An exception was HPRTB, where allele frequencies of zero were allowed for small numbers of repeats in the American Indian groups, since there appears to be a consistent increase in number of repeats in the non-European groups. Occurrence of alleles in one population that are missing in another makes identification of population substructure easier; hence, our decision to combine alleles actually makes it more difficult to identify subpopulations. All STR loci but HPRTB are autosomal; to avoid generating gender, we used the HPRTB allele frequencies to generate data as if HPRTB were an autosomal locus.

We generated 500 data sets using the allele frequencies in table 1, assuming that Argentinean Europeans constituted 70% of a hypothetical target population and that each American Indian group constituted 10%. In addition, data on a biallelic candidate gene was generated, which was assumed to be in Hardy-Weinberg equilibrium in each subpopulation. Persons who were homozygous for the disease-causing allele had an increased risk of disease corresponding to a log-odds ratio of 1.0 (relative risk =2.72). Persons who were heterozygous for the disease-causing allele had no increase in risk. The prevalence of the disease-causing allele was chosen to be 0.277, 0.341, 0.446, and 0.557 in the European, Mapuche, Tehuelche, and Wichi populations, respectively (the frequencies of allele 3 of locus D6S366). The log of the odds of disease among persons with zero or one copies of the disease-causing allele was −5, −4, −3, and −3 in the European, Mapuche, Tehuelche, and Wichi populations, respectively. These values correspond to a prevalence of disease among persons without the disease-causing allele of 0.7%, 1.8%, 4.7%, and 4.7%, respectively. Data were generated until 125 case patients and 125 control patients were obtained. Because the disease is rare, the distribution of ethnic groups among control patients was approximately that of the target population (70.5%, 10.1%, 9.6%, and 9.8% in the 500 simulated data sets). However, the distribution of ethnic groups in the case patients was noticeably different, with 26.1% European, 10.7% Mapuche, 29.8% Tehuelche, and 33.4% Wichi.

In tables tables22 and and3,3, we show the results of a number of analyses of these simulated data. The crude analysis corresponds to calculation of the association between disease and the candidate gene using a single 2×3 table. The second analysis is the latent-class analysis that estimates β_{1} and β_{2} simultaneously, assuming the disease is rare. The third and fourth analyses are the latent-class binary genotype model estimates of β_{1} (using data only from persons with zero or one copy of the disease-causing allele) and β_{2} (using data only from persons with zero or two copies of the disease-causing allele). Finally, we give results of two analyses that use the true subpopulation data, in which β is estimated by maximization of the likelihood for marker and candidate-gene data, given case/control status and knowledge of subpopulation. The first makes the rare-disease approximation (i.e., assumes Hardy-Weinberg equilibrium in control patients) and estimates β_{1} and β_{2} simultaneously. The second estimates β_{1} (using data only from persons with zero or one copies of the disease-causing allele) and β_{2} (using data only from persons with zero or two copies of the disease-causing allele) using only binary candidate-allele data. For all simulations, the average and empirical standard error of parameter estimates from the 500 simulations are presented.

Because β_{1} and β_{2} from the crude analysis differ markedly from the values used to generate the data (β_{1}=0 and β_{2}=1), the population substructure has a large effect. However, the results of the latent-class analysis are close to the true values, even though we used only 12 STR loci to reconstruct the population substructure. In addition, the standard errors of the rare-disease latent-class estimators are only slightly higher than that of the maximum-likelihood estimator obtained using information on true subpopulation membership (e.g., 0.453 for the latent-class rare-disease estimate of β_{2}, compared with 0.405 for the equivalent analysis using the true population substructure). This indicates that group membership is determined with fairly high precision. The standard error for the binary-genotype analyses is higher than the rare-disease approximation, because each analysis uses fewer data than the rare-disease model does. Given the estimate of β_{1}≈0 from either the rare-disease analysis or the binary-genotype analysis using only persons with zero or one copy of the disease allele, another valid analysis would be a comparison of persons with zero or one copy of the disease allele with persons with two copies in a binary-genotype analysis.

To examine the effect of the number of STR loci on our estimator, we also analyzed the simulation data sets using only the first six STR loci in table 1, by means of the rare-disease model (table 3). The estimator of β_{1} is still good, but β_{2} is noticeably further from its true value. However, even with only six STR loci, adequate performance can be achieved by increasing the sample size to 500 (250 case patients and 250 control patients).

The estimated number of subpopulations, , was chosen to minimize the AIC, as was described in section 2. The value of obtained by our method was, on average, lower than the true value of 4, possibly because one subpopulation constitutes only 10% of cases and controls. When we increased the sample size to 250 case patients and 250 control patients, the average number of subpopulations detected increased to four (which was also the number of subpopulations most frequently selected).

We assessed the coverage (proportion of intervals containing the true value) of confidence intervals obtained using the parametric bootstrap procedure described in the previous section. For each of 200 data sets (each with 125 case patients, 125 control patients and using all 12 markers), we generated 200 bootstrap replicates and calculated confidence intervals for β_{1} and β_{2} using the percentile method (Efron and Tibshirani ^{1998}). Figures Figures11 and and22 compare the nominal and actual coverage of these confidence intervals. Ideal behavior corresponds to a 45° line corresponding to nominal and actual coverage being equal. The 95% confidence interval for β_{1} contained the true value of 0 in 98% of the simulations, and the 95% confidence interval for β_{2} contained the true value 1.0 in 97% of the simulations. Ideally, >200 bootstrap replicates should be used to calculate a confidence interval, and we chose only 200 replicates per data set, to reduce the computational burden of analyzing 200 data sets. In practice, at least 500 replicates should be used. The bootstrap can also be used to estimate the standard error of . The average bootstrap estimators of the standard error of and for the rare-disease model are 0.313 and 0.498, close to the standard errors of the simulated data sets (0.293 and 0.453, respectively).

_{1}(proportion of 100α% confidence intervals that contain the true value of β

_{1}) for the discrete subpopulation data in example 1.

_{2}(proportion of 100α% confidence intervals that contain the true value of β

_{2}) for the discrete subpopulation data in example 1.

To assess the performance of our method when stratification was not present, we also generated case-control data as above, but sampled individuals exclusively from the European subpopulation. The results of analyses of 500 simulated data sets using the rare-disease model are summarized in table 4. The method performed well, properly identifying the true number of subpopulations (1) in 74% of the data sets. The average of the estimates of parameters β_{1} and β_{2} is also close to their true values, and the variability of these estimates is close to the values obtained by maximum likelihood, using a model that ignores stratification.

### Example 2: Continuous Admixture of Ancestral Populations

The latent-class model we have described assumes the existence of discrete subpopulations, each with a set of characteristic allele frequencies. Although this model may accurately describe some populations, a more common situation may be many small, related subpopulations or a continuous mixture of ancestral populations. However, even if the underlying population is a continuous mixture, the discrete-subpopulation model may provide adequate inference on the odds ratio relating the candidate gene and disease. It is known that a stratified analysis with a few well-chosen strata often can control for confounding, even if the confounding is caused by continuous variables (Rosenbaum and Rubin ^{1984}). To assess this, we conducted a simulation study in which data were generated using a continuous mixture model (corresponding to an infinite number of subpopulations). Specifically, we assumed that the population was a continuous admixture of four ancestral populations. We assumed the four Argentinean populations described in example 1 were the ancestral populations. Following Pritchard et al. (^{2000a}), for each individual we generated a Dirichlet random variable *Y* with four components *y*_{1},,*y*_{4}. The *k*th component of *Y* represents the probability that an allele for this individual is from ancestral population *k.* As a result, the frequency of allele *j* at locus for an individual with random variable *Y* can be written as *y*_{1}*p*_{1j}++*y*_{4}*p*_{4j}, where, in a slight abuse of notation, *p*_{kj} denotes the frequency of allele *j* at locus in the *k*th ancestral population. The parameters of the Dirichlet distribution used were (0.7, 0.1, 0.1, 0.1), so that 70% of the total genome of the target population was of European origin, with a contribution of 10% from each of the American Indian populations. This choice of parameters ensures a wide range of variability among individuals, and ~40% of persons had a plurality of their genome taken from one of the American Indian populations. We also assumed the risk of disease was a linear function of *Y.* Letting ν=(-5.0, -4.0, -3.0, -3.0), we took the odds of disease for a person with Dirichlet vector *y* to be ν·*y*. Hence, the prevalence of disease among persons without the disease-causing allele ranged from 0.7%, for persons with entirely European ancestry, to 4.7%, for persons with exclusively Tehuelche or Wichi ancestry. Among cases, the proportions of persons with European, Mapuche, Tehuelche, and Wichi as the most prevalent ancestral component were 46%, 10%, 21%, and 23%, respectively, whereas the equivalent proportions among controls were 60%, 9%, 15%, and 16%.

The results of fitting the latent-class model to these data are shown in table 5. Generally the mean value of the estimates of β_{1} and β_{2} were comparable to the situation in example 1, in which the population was a discrete mixture. On average, three subpopulations were chosen using the AIC criterion. For the relatively small sample size we considered, the bias in the estimated log-odds ratio for the latent-class model was ~0.08. The ratio of the standard error of the latent-class estimator of β over the standard error of the “full data” estimator of β obtained by maximizing the likelihood of the genotype given disease status and knowledge of the Dirichlet vector *Y* is larger than the equivalent comparison when the underlying population substructure is discrete. To determine how sample size affects performance of the latent discrete latent-class model with continuous admixture data, we increased the sample size to 500 cases and 500 controls. These results, also shown in table 5, indicate that the bias of the latent-class model decreases considerably when the sample size is increased. The increase in variability of the latent-class estimators over the full-data model is also reduced. Additionally, the estimated number of subpopulations increased. Although we have not considered it, it is reasonable to expect that an increase in the number of informative markers would also improve performance.

Because the number of subpopulations seemed small in light of the large variability of the Dirichlet distribution used to generate the data, it seemed possible that the coverage of confidence intervals calculated using the parametric bootstrap would be too low (recall that, for a given data set, each bootstrap replicate is generated assuming subpopulations, where is the estimated number of subpopulations obtained by minimizing the AIC for that data set). Surprisingly, this apparently was not the case. Coverage of bias-corrected (Efron and Tibshirani ^{1998}) confidence intervals for β_{1} and β_{2} for our simulations with 125 cases and 125 controls are shown in figures figures33 and and4,4, respectively. The departure from linearity in figure 4 is not significant (Kolmogorov-Smirnov test, *P*>.15), indicating failure to reject the hypothesis that the actual coverage is equal to the nominal coverage). The average bootstrap estimators of the standard error of and for the rare-disease model are 0.293 and 0.468, close to the standard errors of the simulated data sets (0.291 and 0.485, respectively).

_{1}(proportion of 100α% confidence intervals that contain the true value of β

_{1}) for the continuous admixture data in example 2.

## Hypothesis Testing

We have focused on parameter estimation in this paper. However, several approaches to hypothesis testing are also possible. One approach corresponding to a Wald test is to fit the latent-class model and obtain bootstrap confidence intervals for the odds ratio parameters; the null hypothesis is rejected at the level of 100(1-α)% if the corresponding confidence interval excludes the null value.

An alternative would be a permutation test in which the case or control status was randomly reassigned (in such a way that the total number of cases and controls was preserved). Then, the latent-class model could be fit to the permuted data. A significant association at the level of 100(1-α)% would be found if the odds ratio β_{k}, estimated from the true data, were larger than the corresponding quantile of β_{k} values, estimated using the permuted data.

A final alternative would be to use a likelihood-ratio test, maximizing the likelihood as described above and then again, while holding the odds ratio fixed at its null value. Although this approach is computationally easier, the large number of nuisance parameters (the marker-allele frequencies) makes it somewhat suspect. Additionally, the null and alternative likelihood calculations would have to be constrained to have the same number of subpopulations, which is contrary to the spirit of our approach. Hence, the likelihood-ratio test probably should not be used without further simulation studies of its validity.

## Discussion

Differences in allele frequencies between subpopulations result from population genetic processes, including mutation, selection, genetic drift, and population dynamics (e.g., inbreeding or migration). As a result of these processes, a relation may exist between the differences in allele frequency across subpopulations in a candidate gene and differences in allele frequency in STR or other marker genes. If selection does not act on alleles of the candidate gene, then differences in allele frequency in a candidate gene between subpopulations and differences in allele frequency in STR or other marker genes should be comparable. As a consequence, the extent to which confounding can be caused by population substructure should be related to the ease with which it can be detected and accounted for, with larger effects being easier to detect (Pritchard and Rosenberg ^{1999}). The simulation example we considered had a great deal of population stratification and a correspondingly large amount of confounding. We were able to account for the effect of population stratification using only 12 STR loci (even after some pooling of alleles); even our analyses that used only 6 STR loci were successful. Presumably, a population with less stratification would require more marker loci (and, possibly, a larger sample size) to resolve the population structure; however, we would expect that the confounding, caused by population structure, of the association of a candidate gene with disease would be concomitantly smaller. Selection acting on alleles of the candidate gene may alter this relation. If selection reduces differences in allele frequencies of the candidate gene, then population substructure is still identifiable, but confounding is less than otherwise might be expected. If selection increases differences in allele frequency in the candidate gene, then the situation is more serious. However, in this case, the candidate gene is itself an informative marker of population substructure. If the candidate gene is the only gene with allele frequencies that differ between subpopulations, then our approach (and any other based on inferring population substructure using marker genes) will fail. However, this case is unlikely to arise in human populations. By selection pressure, we mean differences in reproductive fitness; the allele frequencies of genes that may be associated with adult-onset cancer, heart disease, or other chronic diseases are unlikely to be altered by selection.

Two general approaches exist to account for population stratification. One is to attempt to model the population substructure; this is the approach we took and is also the approach of Pritchard and colleagues (Pritchard et al. ^{2000a,} ^{2000b}). The other is the genomic control (GC) approach (Devlin and Roeder ^{1999}; Bacanu et al. ^{2000}; Devlin et al., ^{in press}). We believe that, within the modeling approach, our approach is superior to that of Pritchard et al. for four reasons. First, our model is a unified treatment of both occurrence of disease and population substructure, whereas that of Pritchard et al. is a two-step approach that estimates substructure first and then tests conditional on the imputed structure. Because our model is unified, we can provide parameter estimates rather than just test hypotheses; the approach of Pritchard et al. cannot be generalized easily to provide parameter estimates. Second, our procedure accounts for the variability in selection of the number of subpopulations, while the test of Pritchard et al. is conditional on the number of subpopulations that are inferred using only the marker data. The bootstrap procedure that we propose for calculation of confidence intervals accounts for variability in the estimated number of subpopulations by estimation of this parameter for each bootstrap replicate. Third, the procedure of Pritchard et al. requires a Gibbs sampler that changes the number of parameters in the model, and this type of sampler is notorious for convergence problems. Our model uses a straightforward likelihood approach. Finally, our model accounts for differences in subpopulation structure that will occur between cases and controls that are ignored by Pritchard et al. who infer substructure without accounting for case and control status. For example, in our simulation, although case patients and control patients were simulated from a population that was 70% European, only 26% of case patients were from the European subpopulation. Because Pritchard et al. test for differences in allele frequencies conditional on population substructure, the candidate gene cannot contribute information about population substructure. In our approach, substructure and disease-gene association are calculated simultaneously; hence, the candidate gene can contribute to inferences about substructure. This is useful because, if population substructure results in confounding, the candidate gene is necessarily informative about population substructure.

Comparison with GC is more difficult, because the GC approach to population stratification is different. In a sense, the GC approach is more general, because it applies to any situation in which cases and controls might have differences in homozygosity (the other major situation being cryptic relatedness, which occurs when case patients may be more likely than control patients to be closely related through a common ancestor). However, the GC approach is limited to binary marker and candidate alleles and requires the additional assumption that the effect of population structure is constant over all loci. Our approach is likelihood-based and, hence, should have better power in situations where a latent-class model correctly describes underlying population substructure. However, an advantage to GC is that an underlying model of the population substructure does not have to be specified. Furthermore, both approaches reduce to the same unstratified analysis when there is no population substructure. It is likely that GC requires more marker loci (although large numbers of biallelic SNP markers should be available soon), whereas continued identification of loci that are highly informative of population substructure (e.g., Dean et al. ^{1994}; Shriver et al. ^{1997}; Parra et al. ^{1998}) should reduce the number of loci required for the latent-class analysis. Finally, GC provides hypothesis tests but not parameter estimates. Because of the differences between GC and our approach, a direct comparison of power, using the example we have considered, is not possible.

Both the approach of Pritchard et al. and GC are tests of association, not methods of estimating association parameters. Estimation of the magnitude of the association between a candidate gene and disease is important, even when population substructure is present. In addition, even though a “significant” amount of population substructure is present, the actual effect on the disease-gene odds ratio can easily be relatively small. Knowing the magnitude of the effect of population stratification on the odds ratio estimate may also be important in assessing the extent of bias in case-control studies in which this stratification may have been ignored.

We used data on 12 STR loci to infer subpopulation membership. These loci were chosen because their use in forensic applications has resulted in publication of allele frequencies in various subpopulations. Forensic applications do not require (and, in fact, are complicated by) varying allele frequencies across subpopulations. Characterization of a set of loci that have maximum variability across human subpopulations to improve identification of the effect of population stratification on case-control studies would be useful. To a great extent, this parallels efforts to find markers that distinguish subpopulations for mapping by admixture linkage disequilibrium or for estimation of ethnic affiliation (Dean et al. ^{1994}; Stephens et al. ^{1994}; Shriver et al. ^{1997}; Parra et al. ^{1998}; Collins et al. ^{2000}). Our initial success and those of Pritchard et al. suggest that persons conducting case-control studies should consider obtaining genotype information from cases and controls at unrelated loci, such as the forensic STR loci we used here to assess and control for the possible effects of population stratification.

## Appendix A

Because of the large number of parameters in our latent-class models, it is important to choose good starting values and to take steps to reduce the chance of the program finding a local (rather than global) maximizer of the likelihood. In this appendix, we discuss the algorithm we used to achieve these goals.

We generated starting values for the E-M algorithm as follows: We first identified a family of variables *t*^{(r)}_{i}, on the basis of a linear score for each allele. Variables *t*^{(r)}_{i} were chosen using principal components, so that they encompassed as much of the variability as possible in the allele-frequency data. To accomplish this, suppose that allele *j* at locus is assigned a numerical score *c*^{(r)}_{j} and let , where *n*_{ij} is the number of copies of the *j*th allele at the th locus in the *i*th study participant. If the values of *n*_{ij} are taken to be the *i*th row of a matrix *Y,* values of *c*^{(r)}_{j} correspond to the values of the eigenvector corresponding to the *r*th largest eigenvalue of the matrix , where is a centered version of *Y.*

Let ρ^{(r)}_{i} denote the rank of *t*^{(r)}_{i} among the study participants. Then, the initial probability that the *i*th individual was in stratum *k* was taken to be *f*^{(r;0)}_{ik}*e*^{-0.5*(ρi-μk/σ)2} for *k*=1,,*K* where μ_{k}=*n*(*k*-0.5)/*K* and σ=*n*(*K*-0.5)/*K*. For this choice, note that the for any *r,* *i,* *k,* and *k*^{′}.

To avoid excessive influence of the initial value *f*^{(r;0)}_{ik}, we adopted the following strategy. Let *f*^{(m)}_{ik} denote the estimate that the *i*th person is in subpopulation *k* after *m* steps of the E-M algorithm. Rather than using *f*^{(m)}_{ik} to determine new estimates of the parameters β, γ, π^{d}_{j}, and *p*_{kj}, we used . We used α_{0}=0.5 and selected α_{m} for *m*1, as follows. If β^{(m)} and γ^{(m)} denote the estimates of β and γ obtained after *m* steps, then we used α_{m}=α_{m-1}, unless , in which case we used α_{m}= 1/2α_{m-1}, where ||*x*|| denotes the Euclidean norm of the vector *x.* The algorithm was judged to have converged when δ_{m}<10^{-7}, as long as α_{m}<10^{-7}.

For one, two, or three subgroups, our algorithm invariably found the same maximum-likelihood estimates of β, γ, and π^{d}_{j} when the starting values were changed. However, for four or more subgroups, a change in the starting value sometimes resulted in small changes in the final parameter estimates. Hence, whenever the number of subgroups was four or more, we restarted the E-M algorithm five times, using the five largest principal component directions, as described above.

Although the steps described above do not guarantee that the parameter estimates we obtained are global maximizers of the likelihood, they do define the specific algorithm used to obtain our parameter estimates. It is possible that, in other situations (e.g., fewer or less-informative marker alleles or smaller differences in subpopulations), some of the choices we made should be altered.

## References

*a*) Inference of population structure using multilocus genotype data. Genetics 155:945–959 [PMC free article] [PubMed]

*b*) Association mapping in structured populations. Am J Hum Genet 67:170–181 [PMC free article] [PubMed]

**American Society of Human Genetics**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (226K)

- Joint modeling of genetic association and population stratification using latent class models.[Genet Epidemiol. 2001]
*Ripatti S, Pitkäniemi J, Sillanpää MJ.**Genet Epidemiol. 2001; 21 Suppl 1:S409-14.* - Testing for association in the presence of population stratification: a simulation study comparing the S-TDT, STRAT and the GC.[Biom J. 2006]
*Wawro N, Bammann K, Pigeot I.**Biom J. 2006 Jun; 48(3):420-34.* - Population substructure and isolation by distance in three continental regions.[Am J Phys Anthropol. 1999]
*Eller E.**Am J Phys Anthropol. 1999 Feb; 108(2):147-59.* - On selecting markers for association studies: patterns of linkage disequilibrium between two and three diallelic loci.[Genet Epidemiol. 2003]
*Garner C, Slatkin M.**Genet Epidemiol. 2003 Jan; 24(1):57-67.* - Using ancestry-informative markers to define populations and detect population stratification.[J Psychopharmacol. 2006]
*Enoch MA, Shen PH, Xu K, Hodgkinson C, Goldman D.**J Psychopharmacol. 2006 Jul; 20(4 Suppl):19-26.*

- Semi-supervised spectral clustering with application to detect population stratification[Frontiers in Genetics. ]
*Liu B, Shen X, Pan W.**Frontiers in Genetics. 4215* - Statistical distributions of test statistics used for quantitative trait association mapping in structured populations[Genetics, Selection, Evolution : GSE. ]
*Teyssèdre S, Elsen JM, Ricard A.**Genetics, Selection, Evolution : GSE. 44(1)32* - Molecular Reclassification of Crohn's Disease: A Cautionary Note on Population Stratification[PLoS ONE. ]
*Maus B, Jung C, Mahachie John JM, Hugot JP, Génin E, Van Steen K.**PLoS ONE. 8(10)e77720* - Long Term Consequences of Membership in Trajectory Groups of Delinquent Behavior in an Urban Sample: Violence, Drug Use, Interpersonal and Neighborhood Attributes[Aggressive behavior. 2013]
*Brook JS, Lee JY, Finch SJ, Brown EN, Brook DW.**Aggressive behavior. 2013 Nov; 39(6)440-452* - Environmental Confounding in Gene-Environment Interaction Studies[American Journal of Epidemiology. 2013]
*VanderWeele TJ, Ko YA, Mukherjee B.**American Journal of Epidemiology. 2013 Jul 1; 178(1)144-152*

- Cited in BooksCited in BooksPubMed Central articles cited in books
- MedGenMedGenRelated information in MedGen
- PubMedPubMedPubMed citations for these articles