# Estimation of Haplotype Frequencies, Linkage-Disequilibrium Measures, and Combination of Haplotype Copies in Each Pool by Use of Pooled DNA Data

^{1}Suenori Chiku,

^{1}Eisuke Inoue,

^{1}Makoto Tomita,

^{1}Takayuki Morisaki,

^{3}Hiroko Morisaki,

^{3}and Naoyuki Kamatani

^{2}

^{1}Algorithm Team, Japan Biological Information Research Center, Japan Biological Informatics Consortium, and

^{2}Division of Genomic Medicine, Department of Applied Biomedical Engineering and Science and Institute of Rheumatology, Tokyo Women’s Medical University, Tokyo; and

^{3}Department of Bioscience, National Cardiovascular Center Research Institute, Osaka

## Abstract

Inference of haplotypes is important for many genetic approaches, including the process of assigning a phenotype to a genetic region. Usually, the population frequencies of haplotypes, as well as the diplotype configuration of each subject, are estimated from a set of genotypes of the subjects in a sample from the population. We have developed an algorithm to infer haplotype frequencies and the combination of haplotype copies in each pool by using pooled DNA data. The input data are the genotypes in pooled DNA samples, each of which contains the quantitative genotype data from one to six subjects. The algorithm infers by the maximum-likelihood method both frequencies of the haplotypes in the population and the combination of haplotype copies in each pool by an expectation-maximization algorithm. The algorithm was implemented in the computer program LDPooled. We also used the bootstrap method to calculate the standard errors of the estimated haplotype frequencies. Using this program, we analyzed the published genotype data for the *SAA* (*n*=156), *MTHFR* (*n*=80), and *NAT2 *(*n*=116) genes, as well as the *smoothelin* gene (*n*=102). Our study has shown that the frequencies of major (frequency >0.1 in a population) haplotypes can be inferred rather accurately from the pooled DNA data by the maximum-likelihood method, although with some limitations. The estimated *D* and *D*′ values had large variations except when |*D*| values were >0.1. The estimated linkage-disequilibrium measure ρ^{2} for 36 linked loci of the *smoothelin* gene when one- and two-subject pool protocols were used suggested that the gross pattern of the distribution of the measure can be reproduced using the two-subject pool data.

## Introduction

Inference of haplotypes is important for many genetic approaches, including the process of assigning a phenotype to a genetic region (Risch et al. ^{1996}; Hodge et al. ^{1999}; Rieder et al. ^{1999}). Extended marker haplotypes may provide additional power in the detection of associations (Kruglyak ^{1999}; Templeton ^{1999}; Judson et al. ^{2000}; Martin et al. ^{2000}; Zöllner and von Haeseler ^{2000}).

In testing for the presence of linkage disequilibrium or in estimating its strength, the frequencies of haplotypes and the frequencies of alleles in a population should be evaluated. Thus, estimation of the haplotype frequencies in a population is the first step in analysis of linkage disequilibrium. On the one hand, when the family data are available, we can extract the phase data and either estimate or determine the haplotypes by using software such as Linkage Package (Lathrop et al. ^{1985}) and Genehunter (Kruglyak et al. ^{1996}). On the other hand, when the family data are not available, Hardy-Weinberg equilibrium is assumed for the population data, and the haplotype frequencies are estimated by the parsimony method (Clark ^{1990}), the expectation-maximization (EM) algorithm (Excoffier and Slatkin ^{1995}; Hawley and Kidd ^{1995}; Long et al. ^{1995}; Kitamura et al. ^{2002}), or the Phase algorithm, which is based on Bayesian inference (Stephens et al. ^{2001}). Fallin and Schork (^{2000}) have demonstrated high accuracy in haplotype-frequency estimation for biallelic diploid samples by use of the EM algorithm. We previously developed a program, LDSupport, that estimates both haplotype frequencies in a population and the diplotype configuration for each subject (Kitamura et al. ^{2002}). A diplotype configuration is a combination of two haplotype copies in a subject. Recently, Zhang et al. (^{2001}) and Xu et al. (^{2002}) compared Phase- and EM-algorithm–based methods and reported that the two methods exhibited similar performance, whereas Stephens et al. (^{2001}) argued that the Phase method outperformed the EM method.

In the present study, we extended the function of LDSupport and constructed a new algorithm so that our program can handle genotype data from pooled DNA samples. Using published and unpublished data, we tested the accuracy of haplotype frequencies estimated by the new algorithm implemented in the new program, LDPooled.

## Methods

### DNA Pools

Suppose that we have genomic DNA samples from many subjects. Because of limitations of either cost or time, we wish to reduce the total number of typings. We therefore make *N* DNA pools, each of which contains the samples from *M* different subjects. The selection of the samples for pooling of DNA is performed at random, and the sample from a subject is selected only once. We then perform quantitative DNA typing by using each DNA pool for *L* linked loci. The loci can be either biallelic or multiallelic. The numbers of allele copies for each locus are assumed to be accurately determined by the quantitative DNA typing. By the terms “an allele copy” and “a haplotype copy,” we refer to an allele or a haplotype carried at a particular locus by a particular subject or a particular pool. If a subject is homozygous at a locus, then he or she is interpreted as carrying an allele but two allele copies at that locus. Note that there should be 2*M* allele copies (not 2*M* alleles) at a locus in a single pool and that, when *M*=1 (i.e., a single-subject pool), the situation is equivalent to general DNA typing for each subject.

### EM Algorithm

*Step 1: Assignment of real values to haplotype frequencies.—*Let *A*_{i} be the number of alleles at the *i*th locus. The number of possible haplotypes for *L* loci is *U*=*i*=1*LA*_{i}*. *We assign real values to the frequencies of haplotypes as the first step of the estimation. Let *p*_{i} be the frequency of the *i*th haplotype in a population, where *p*_{i}0 for *i*=1,2,..,*U**. *Naturally, .

*Step 2: Combination of haplotypes.—*A pool of DNA contains samples from *M* subjects. Therefore, 2*M* haplotype copies should be present in a pool. When 2*M* haplotype copies are selected (permitting repetitive sampling) from the total of *U* haplotypes, at least one of the combinations of haplotype copies should be consistent with the observed pooled genotype data at all *L* loci for the pool. Let *C*_{jm} be the *m*th combination of haplotype copies that is consistent with the observed genotype data for the *j*th pool for *L* loci, where *m*=1,2,..,*Q*_{j}*. **Q*_{j} denotes the number of combinations of haplotype copies consistent with the observed genotype data for the *j*th pool.

*Step 3: Likelihood calculation.—*Under the assumption of Hardy-Weinberg equilibrium, the prior probability of *C*_{jm} is

where *R*_{jmi} denotes the number of the copies of the *i*th haplotype within *C*_{jm} and *T*_{jm} denotes the number of different haplotypes within *C*_{jm}*. *Note that for any *j* and *m.* The likelihood of the data for the *j*th pool given the haplotype frequencies is calculated as *. *Overall likelihood for all the *N* pools should be *L*_{all}=*j*=1*NL*_{j}*, *since the events of combinations of haplotype copies for different subjects should be independent when Hardy-Weinberg equilibrium is assumed.

*Step 4: Expectation.—*The posterior probability of *C*_{jm} for the *j*th pool is calculated by Bayes’ theorem, as follows:

Therefore, the expected number of copies of the *i*th haplotype in the entire pool is *.*

*Step 5: Maximization.—*Maximization is performed by substituting *E*_{i}/(2*MN*) for *p*_{i} for all *i.*

*Step 6: Iteration.—*Steps 2–5 are repeated until *L*_{all} converges. *L*_{max} denotes the value of *L*_{all} when it converged. The value of *p*_{i} after the final step of iteration is interpreted as *, *the maximum-likelihood estimate of *p*_{i}*.*

### Calculation of Posterior Probability of *C*_{jm} Given the Maximum-Likelihood Estimates

The posterior probability of *C*_{jm} for the *j*th pool, given that the population frequencies of the haplotypes are for *i*=1,2,..,*U**, *is obtained by applying the obtained estimates to steps 2–4. Thus, *B*_{jm} as obtained by equation (1) yields the posterior probability of *C*_{jm} for the *j*th pool, given that the population frequencies of the haplotypes are for *i*=1,2,..,*U**.*

### Likelihood under the Assumption of No Linkage Disequilibrium

Calculation of the likelihood of data under the assumption of no linkage disequilibrium was performed as follows: Let *q*_{ik} be the frequency of the *k*th allele at the *i*th locus in the population, let *V*_{ijk} be the number of copies of the *k*th alleles at the *i*th locus in the *j*th pool, and let *W*_{ij} be the number of different alleles at the *i*th locus in the *j*th pool. Note that for any *i* and *j.* The likelihood of the data for the *j*th pool at the *i*th locus under the assumption of no linkage disequilibrium is

Since alleles at different loci are independent under the assumption of no linkage disequilibrium, the likelihood of the data at all the loci should be *i*=1*LS*_{ij}*, *and the likelihood of the data at all the loci in all the pools should be *L*_{independent}=*j*=1*Ni*=1*LS*_{ij}*.*

### LOD Score

LOD score was calculated as follows:

To exclude the null hypothesis of no linkage disequilibrium, we calculated the *P *value by incorporating the likelihood ratio in equation (2) as -*ln* (likelihood ratio) and assuming that this statistic asymptotically follows a χ^{2} distribution. The degrees of freedom should be

### Variation of Haplotype Frequencies, *D, *and *D*′ Determined by the Pooling Method

The estimated haplotype frequencies from the pooled genotype data exhibit variation due to different combinations of samples. To examine such variation, we made different combinations of the DNA samples from different subjects, to estimate haplotype frequencies. Thus, if there are a total of *MN* subjects and the samples from *M* different subjects should be in each pool, then *N* pools should be made. There are then different combinations of the samples. This number is so large that we cannot examine all cases. We therefore used a Monte Carlo method to sample the combinations of *N* pools while assuming an equal probability for each of the combinations. From each sample, haplotype frequencies and pairwise linkage-disequilibrium measures *D* and *D*′ were estimated, as described below. From the estimates from 1,000 different randomly selected samples, means and SDs were calculated.

### Nonparametric Bootstrap Method to Estimate SEs

The nonparametric bootstrap method was used to estimate empirically the SEs of the frequency of the *i*th haplotype—that is, . The original pools of DNA consisted of *N* pools, each of which contained DNA from *M* subjects. A bootstrap sample was constructed by drawing a new set of *N* pools from the original *N* pools through the permission of duplicate sampling. The data in the new set of pools were then applied to the algorithm for the estimation of the frequencies of the haplotypes—that is, *p*_{i} for *i*=1,2,..,*U**. *Let be such an estimate of the frequency of the *i*th haplotype from the *b*th bootstrap sample. When the bootstrap sampling was repeated *B* times, the mean of the estimates was calculated as

Then, the empirical SEM for *p*_{i} was calculated as

Bootstrap sampling was usually repeated 10,000 times (i.e., *B*=10,000) to calculate the empirical for each *p*_{i}.

### Estimation of *D, D*′ and ρ^{2}

The measures of linkage disequilibrium for two biallelic loci—*D *and *D*′ (Lewontin ^{1964}) and ρ^{2}—were estimated by two different methods, as follows: Let *f*_{ij} be the frequency of a haplotype containing the *i*th and *j*th alleles at the first and second loci, respectively. *D, D*′, and ρ^{2} were calculated from the estimated values for *i*=1,2 and *j*=1,2 as

and

respectively. In the first method, the maximum-likelihood estimates, *, *of the haplotype frequencies were calculated using the data for all available loci. In the second method, however, genotypic data only at the two relevant loci were used to estimate *.*

## Results

### Estimation of Haplotypes for the *SAA* Gene

The haplotype data from 156 subjects for six SNP loci on the *SAA* gene (Moriguchi et al. ^{2001}) were used to perform haplotype estimation with our algorithm. In this data set, diplotype configurations of all the individuals have been determined, and these data were interpreted as reflecting the real data. So that this data set could be used for our algorithm, the haplotype data from different subjects were mixed together, and the phase data were removed. Using the phase-unknown genotype data at multiple linked loci, we performed haplotype estimation using our LDPooled program.

Tables Tables11 and and22 show the results of estimation in which each pool contained one, two, or four subjects. Table 1 shows the central-processing-unit time required, the number of iterations before convergence, the LOD scores, the χ^{2} values, and the *P *values. Each *P *value represents the risk for excluding the null hypothesis of independence between all six loci. As the number of subjects in each pool increased, the LOD scores and the χ^{2} values decreased while the *P *values increased (table 1). These results are probably due to the decrease in information content because of pooling.

*SAA*Data—and Results of Test of Independence for All Six Loci

Table 2 shows the estimated haplotype frequencies obtained using different estimation protocols in which each pool contained genotype data from one, two, or four subjects. When each pool contained DNA from more than one subject, means and SDs of the frequencies estimated using different combinations of the subjects sampled from the original data for each subject are also shown. The results of estimation when each pool contained only one subject were exactly the same as those noted in the previous study, in which no DNA was pooled (Kitamura et al. ^{2002}). Although the estimated haplotype frequencies varied with the numbers of subjects in a pool, they were still good estimates of the frequencies of the major haplotypes—ACTGCC, ACCGTC, and AGCGCT—as long as the number of subjects in a pool did not exceed four (table 2). For the three major haplotypes, the SDs of the estimated frequencies obtained using different random samplings to make the pools were typically <10% of the means (table 2). For minor haplotypes (frequency <0.1), however, estimation was not accurate. For example, the frequency of haplotype ACTGTC was estimated to be 0.0 by the four-subject pool estimation but 0.013 by the single-subject pool estimation; this haplotype should appear only 4 times among 156 individuals (or 312 haplotype copies) if the latter estimation is accurate.

### Estimation of *D* and *D*′

Table 3 shows *D* and *D*′ values calculated from the estimated haplotype frequencies. In this case, values, the estimated frequencies of the two-locus haplotypes, were calculated from the estimated frequencies of the six-locus haplotypes, as described in the “Methods” section. When values were estimated from the genotype data for the two loci, *D* and *D*′ values were very similar in some cases, but there were some cases in which the two methods yielded quite different values (data not shown). Means and SDs of the values estimated using different combinations of the subjects are shown when each pool contained more than one subject (table 3). Although the values varied between estimation protocols, they were still rather consistent, as long as the number of subjects in a pool did not exceed four and |*D*| was >0.1. However, in some cases, the SDs of the estimated *D* and *D*′ values obtained using different random samplings to make the pools were almost 50% of the means, and the values estimated by different estimation protocols differed greatly. The accuracy of the estimation of *D*′ is heavily dependent on the allele frequency. The minor-allele frequencies were 0.11 at locus 1 and 0.05 at locus 4. The SDs of the estimated *D*′ values for the locus pairs including one of these loci were larger than those for the other locus pairs. Thus, when the minor-allele frequency is low, the estimated *D*′ value obtained using pooled genotype data is not accurate.

Table 4 shows the results of estimation of the combination of haplotype copies in each DNA pool. For each protocol, only the portion of the data corresponding to the first 12 subjects is shown. The results indicate that, in many of the pools, the posterior probabilities of the combinations of haplotype copies with the highest probabilities were 1 or nearly 1. When the contents of the estimated combinations of haplotype copies were carefully compared, they were found in many cases to be consistent between different estimation protocols. For example, the contents of pool number 1 in the two-subject pool protocol should be the same as the combination of pool numbers 1 and 2 in the one-subject pool protocol. Table 4 shows that this was indeed the case. In other cases, however, the contents of a pool estimated by a protocol were inconsistent with those in the pools estimated by a different protocol.

### Bootstrap Method to Calculate SEs of the Estimated Haplotype Frequencies

Since the estimated haplotype frequencies exhibited errors, we implemented in LDPooled the algorithm to calculate SEs by the bootstrap method. Using the same data for the *SAA* gene from 156 subjects, we made one-, two-, or four-subject pools, as described above (see the “Estimation of Haplotypes for the *SAA* Gene” subsection). We then applied the bootstrap method to such data, as described in the “Methods” section. Figure 1 shows the means and SEs of the estimated haplotype frequencies obtained using different estimation protocols (one-, two-, and four-subject pools). Bootstrap sampling was repeated 10,000 times for each estimation protocol. As shown, the estimated frequencies were rather stable for haplotypes ACTGCC, ACCGTC, and AGCGCT, irrespective of the number of subjects in a pool. The lengths of the error bars were rather short, compared with those of the mean values for these haplotypes. In addition, the means of the haplotype frequencies estimated by different protocols were approximately the same for the same haplotypes. For the minor haplotypes (*p*_{i}<0.1), however, the frequencies estimated by the different protocols varied significantly, and the error bars were rather long, compared with the mean values (fig. 1). For some minor haplotypes, the error bars were too large to be tolerated when two- and four-subject pools were used. When the total numbers of the subjects were the same, the estimated haplotype frequencies obtained using two- and four-subject pools for such minor haplotypes were less accurate than the frequencies estimated using one-subject pools.

### Time and Memory Required for Calculation

The time and memory required for each calculation were recorded. When a computer with a Pentium III 1-GHz processor and a memory of 1.5 GB was used, the number of subjects within a pool was, at maximum, six when the number of loci was six. If the number of loci was 13, then the maximum number of subjects in a pool could be only two. In contrast, 25 loci were possible when no pooling was performed on DNA samples. This is because the algorithm implemented in LDPooled uses possible combinations of haplotype copies in each DNA pool and this step consumes a large amount of memory. The number of combinations increases by a power function of the number of alleles at a locus, and it increases by a factorial of the number of subjects in a pool. Therefore, the dependence of time and memory on those factors is due to the requirement of space for combinations.

### Estimation of Haplotypes for the *MTHFR* Gene

The *MTHFR* gene encodes the methylenetetrahydrofolate reductase enzyme, which is related to folate metabolism. We have published data, for a total of 80 subjects, in which two linked loci of the gene are involved (Urano et al. ^{2002}). These published data were used to make DNA pools and to estimate parameters from the pooled data. Table 5 shows estimated haplotype frequencies and means and SDs of the estimated haplotype frequencies obtained using different combinations of the subjects sampled from the original data for each subject (when the protocol for pools of two or more subjects is used) for the *MTHFR* gene. These data indicate that the estimated haplotype frequencies were rather accurate, even when the four-subject protocol was used. This is probably because linkage disequilibrium is very strong for this pair of loci. The data also indicate that estimation was rather accurate for haplotypes whose relative frequencies were >0.1.

Figure 2 shows the results of the application of the bootstrap method to the haplotype data for the *MTHFR* gene. These results suggest that our method can accurately estimate haplotype frequencies for the *MTHFR* gene when haplotype frequencies are rather high.

*MTHFR*gene. Means and SEs of estimated frequencies were calculated as described in the “Methods” section.

*D* and *D*′ values calculated from the estimated haplotype frequencies for the *MTHFR* gene are shown in table 6. For the data from two- and four-subject pool protocols, means and SDs for different random samplings are also shown. These data also show that the variability of estimated values is large. If the minor-allele frequency at one of the 2 loci was low, then the estimated *D*′ value was not accurate, even when one-subject pools were used for estimation. Since the minor-allele frequencies were 0.39 and 0.16 at loci 1 and 2 for the *MTHFR* gene, SDs for *D*′ showed large values.

### Estimation of Haplotypes for the *NAT2* Gene

The *NAT2* gene encodes the N-acetyltransferase 2 enzyme, which is related to transfer of N-acetyl residues. We have published data, for a total of 116 subjects, in which seven linked loci of the gene are involved (Tanaka et al. ^{2002}). These published data were used to make DNA pools and to estimate parameters from the pooled data. Table 7 shows estimated haplotype frequencies and the means and SDs of the estimated haplotype frequencies obtained using different combinations of the subjects sampled from the original data for each subject (when the protocol for pools of two or more subjects is used) for the *NAT2* gene. Figure 3 shows the results of the application of the bootstrap method to the haplotype data for the *NAT2* gene. These results show that our method can accurately estimate haplotype frequencies when the frequencies of the haplotypes for the *NAT2* gene are rather high (>0.1). For minor haplotypes (<0.1), when two- and four-subject pools were used, the error bars were too large to be tolerated.

*NAT2*gene. Means and SEs of estimated frequencies were calculated as described in the “Methods” section.

*D* and *D*′ values calculated from the estimated haplotype frequencies for the *NAT2* gene are shown in table 8. Compared with the results of the one-subject pool, some *D*′ values for two- or four-subject pools showed the opposite signs. The minor-allele frequencies at loci 3, 4, 6, and 7 were <0.1. When the minor-allele frequency is very low, estimated values obtained using pooled genotype data are not accurate for *D*′.

### Estimation of Haplotypes for the *Smoothelin* Gene

Estimation of haplotype frequencies and combination of haplotype copies within each pool was performed using data for the *smoothelin* gene. The set of all data includes the genotype data from 32 black, 90 white, and 102 Japanese subjects; however, this set will be published later, and the present article used the data from Japanese subjects only, to test the function of LDPooled.

The typed loci spanned an ~300-kb region and contained both SNP and microsatellite polymorphisms. The total number of SNP loci, including insertion/deletion polymorphisms, was 36. Since the number of loci was too large to estimate the haplotypes by using the EM algorithm, the loci were selected to reduce the number as follows: First, the SNP loci with minor-allele frequencies that were not <0.2 were selected; the number of SNP loci was still too large for the haplotype estimation with pooled DNA. Then, linkage-disequilibrium measures *D*′ and ρ^{2} were calculated for each of those SNP pairs, a haplotype block containing nine loci was determined, and the nine loci were used for the haplotype estimation; the same haplotype block was determined for two different protocols, one by the one-subject pool method and the other by the two-subject pool method. Finally, the haplotype estimation and SE calculation by the bootstrap method were performed for the nine loci data; bootstrap sampling was repeated 10,000 times for each estimation protocol. All these processes were performed using our LDPooled program.

Figure 4 shows a comparison of the estimated haplotype frequencies, as well as the SEs, by the application of the bootstrap method between two different protocols. As shown, the haplotype frequencies estimated by the two protocols were quite similar as long as haplotype frequencies were >0.1. Even for minor haplotypes, the frequencies estimated using two different protocols were similar, except when the haplotype frequencies were <0.01. However, the error bars were rather long, compared with the mean values for these minor haplotypes. Thus, the calculation of SEs by the bootstrap method was useful for the evaluation of the inaccuracy of estimated haplotype frequencies.

### Estimation of Pairwise Linkage-Disequilibrium Measure ρ^{2}

The strength of linkage disequilibrium is usually measured in pairwise fashion. We calculated ρ^{2} as described in the “Methods” section. Figure 5 shows estimated ρ^{2} values for all pairs. In this case, values were estimated using only two-locus genotype data. Although slight differences were observed, the ρ^{2} values estimated by the two different protocols (i.e., single-subject pool and two-subject pool) were quite similar. The mean and the SD for the estimated ρ^{2} values for different SNP pairs were 0.114 and 0.224, respectively, with the one-subject pool protocol, and 0.118 and 0.227, respectively, with the two-subject pool protocol. The mean and the SD of the absolute value of the difference between estimated ρ^{2} values were 0.011 and 0.022, respectively. On average, the absolute value of the difference in estimated ρ^{2} values obtained by the two protocols was ~9.5% of the mean.

## Discussion

Genotyping is usually performed on DNA samples obtained from single subjects. In special cases, however, samples from different subjects are mixed, and genotyping is performed on the mixed samples. For example, DNA samples are mixed, and the number of allele copies in each mixed sample is determined for case and control samples, to reduce the cost of genotyping. Recently, case-control studies in which allele frequencies are compared between different groups have been performed at thousands of loci in numerous subjects (Barcellos et al. ^{1997}; Collins et al. ^{2000}). If the objective of studies is only to detect differences in the frequencies of alleles between case and control groups, then the pooled DNA method can be efficient, as long as the frequencies can be determined accurately for the pooled samples. However, phase information becomes ambiguous with pooling.

The question that we addressed when beginning the present study was the following: If we wish to know either the phase of each subject, haplotype frequencies in the population, numbers of haplotype copies in the sample, or the strength of linkage disequilibrium for a group of linked loci, how accurately can we make such estimations when using pooled DNA data? More specifically, how accurately can we estimate either the phase of each subject, haplotype frequencies in the population, numbers of haplotype copies in the sample, or the strength of linkage disequilibrium when only data for the allele copies in each pooled sample containing *M* different subjects are available? If *M*=1, then this is equivalent to the procedure for estimating the frequencies of haplotypes by the EM algorithm when using nonpooled samples.

In certain cases, the regular EM algorithm to estimate the frequencies of haplotypes under the assumption of Hardy-Weinberg equilibrium by using genotypic data is also considered to be a method to estimate the parameters by using data from pooled samples. In the regular method, however, pooling of haplotypes is performed during fertilization, whereas, in our method, it is performed in vitro. Our method is useful for the estimation of various linkage-disequilibrium parameters on the basis of incomplete information from pooled haplotype samples, each of which contains 2*M* haplotype copies. It should be noted that Hardy-Weinberg equilibrium is always assumed in our method, as well as in the regular method. In addition, our method requires that pooling of DNA be performed at random.

As expected, information becomes degenerated as the number of haplotype copies in a pool increases. Thus, LOD scores and χ^{2} values for testing the independence of all the loci decreased when the number of subjects whose DNA was mixed in a pool increased. Naturally, *P* values for testing of independence increased.

We used real data from subjects, rather than simulated data, to test our method, since no standard simulation method is available for haplotypes and linkage disequilibrium. Naturally, the number of sets of samples that we tested is insufficient to extend the results obtained in the present study to other sets of data.

Our results from a limited number of sets of samples suggested that the estimation of frequencies of haplotypes was rather accurate when frequencies were >0.1 and when fewer than four subjects' data were in a pool. As total numbers of subjects in a data set, we tested the data from 156, 80, 116, and 102 subjects. In addition, our data were for 2–13 loci. Many other cases featuring various total numbers of subjects, various numbers of loci, and various degrees of linkage disequilibrium should be examined. However, our data suggest that the frequencies of haplotypes with frequencies of >0.1 can be accurately estimated from pooled DNA data under some conditions.

We examined the variability of estimated frequencies by using different combinations of the original samples. This method was useful for estimating such variability, but, of course, it cannot be applied to the real pooled DNA data since, in the latter case, data for single subjects are not available. To evaluate variability in haplotype frequencies estimated from real pooled DNA data, we used the bootstrap method to calculate means and SEs. The SE for each haplotype frequency was consistent with the SD obtained from the frequencies estimated from different combinations of the DNA samples—that is, it was small when the haplotype frequency was >0.1. Therefore, our bootstrap method is likely to be useful for the estimation of variability in estimated haplotype frequency, although it features some limitations.

Note that the objective of the haplotype-frequency estimation is often not to estimate the proportions of the haplotypes in the sample but to estimate the population frequencies. Therefore, there are two different sources of inaccuracy for the haplotype frequency estimation: one is from the sampling of the haplotypes from the population, and the other is from the estimation using the sample data. The inaccuracy from the sampling is high when the sample size is small. Thus, the sample with a larger size represents more accurately the information in the population than a sample with a smaller size. Therefore, there are situations in which the analysis based on *MN-*subject data, from *N* pools composed of *M *subjects, is better than the analysis based on *N-*subject data, from *N *pools composed of a single subject.

Estimation of linkage-disequilibrium measures such as *D* and *D*′ was performed using pooled DNA data. Compared to the results of the haplotype-frequency estimation, variability in the estimated *D*′ value was large, especially when the minor-allele frequencies were <0.16 at one of the loci. When the minor-allele frequencies were high, the estimated *D*′ values obtained using two- or four-subject pools were in good agreement with the results of one-subject pools, and the SDs of the *D*′ values were rather small, compared with the means. In contrast, ρ^{2} was better than *D* and *D*′ when the data from a large number of linked loci were analyzed for the purpose of observing the gross pattern of linkage disequilibrium (unpublished data). The gross pattern of the linkage-disequilibrium measure ρ^{2} could be reliably reproduced when the two-subject protocol was used.

Our LDPooled program has a limitation in that it cannot handle missing data, even though such a function has been included in haplotype-estimation programs. The reason for this inability is that, for the pooled DNA data, the number of possible events becomes too large to be handled by current machines. We still work for the implementation of this function, but this needs extended expansion of the memory and calculation speed of computers.

Thus, although more data sets should be analyzed, our methods for the estimation of haplotype frequencies and linkage-disequilibrium parameters may be useful when genotype data from pooled DNA samples are available. The present method may also be useful when the haplotype is to be estimated in a sample in which DNA from more than one person is mixed. Such cases may occur in the fields of forensic medicine and archaeology. As shown, the posterior distribution of combinations of haplotype copies can be estimated given the estimated population haplotype frequencies. In the near future, the frequencies of the major haplotypes for each haplotype block (Gabriel et al. ^{2002}) in ethnic groups may be determined. By use of such data, the haplotype copies in pooled DNA samples may be estimated.

In summary, we have devised a new method to estimate haplotype frequencies, combinations of haplotype copies, and the *D, D*′, and ρ^{2} values from the pooled DNA data, and we have implemented the algorithm obtained by the computer program LDPooled. We have also used the bootstrap method to calculate SEs of the estimated frequencies. Although the frequencies of haplotypes can be estimated rather accurately when the frequencies are >0.1, the estimated data for haplotypes with lower frequencies were not reliable, as shown by the large error bars calculated by the bootstrap method. Estimated *D* and *D*′ values exhibited large variation except when |*D*| values were >0.1. The gross pattern of the linkage-disequilibrium measure ρ^{2} may be reproduced using the two-subject pool protocol for numerous linked loci.

## Acknowledgment

The present study was supported by grants from the New Energy and Industrial Technology Development Organization.

## References

*APOE*region. Genomics 63:7–12 [PubMed]

**American Society of Human Genetics**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (302K) |
- Citation

- Determination of probability distribution of diplotype configuration (diplotype distribution) for each subject from genotypic data using the EM algorithm.[Ann Hum Genet. 2002]
*Kitamura Y, Moriguchi M, Kaneko H, Morisaki H, Morisaki T, Toyama K, Kamatani N.**Ann Hum Genet. 2002 May; 66(Pt 3):183-93.* - Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data.[Am J Hum Genet. 2000]
*Fallin D, Schork NJ.**Am J Hum Genet. 2000 Oct; 67(4):947-59. Epub 2000 Aug 22.* - Characterisation of SNP haplotype structure in chemokine and chemokine receptor genes using CEPH pedigrees and statistical estimation.[Hum Genomics. 2004]
*Clark VJ, Dean M.**Hum Genomics. 2004 Mar; 1(3):195-207.* - A comprehensive literature review of haplotyping software and methods for use with unrelated individuals.[Hum Genomics. 2005]
*Salem RM, Wessel J, Schork NJ.**Hum Genomics. 2005 Mar; 2(1):39-66.* - Haplotyping methods for pedigrees.[Hum Hered. 2009]
*Gao G, Allison DB, Hoeschele I.**Hum Hered. 2009; 67(4):248-66. Epub 2009 Jan 27.*

- Resequencing of Pooled DNA for Detecting Disease Associations with Rare Variants[Genetic epidemiology. 2010]
*Wang T, Lin CY, Rohan TE, Ye K.**Genetic epidemiology. 2010 Jul; 34(5)492-501* - An EM algorithm based on an internal list for estimating haplotype distributions of rare variants from pooled genotype data[BMC Genetics. ]
*Kuk AY, Li X, Xu J.**BMC Genetics. 1482* - Maximum-parsimony haplotype frequencies inference based on a joint constrained sparse representation of pooled DNA[BMC Bioinformatics. ]
*Jajamovich GH, Iliadis A, Anastassiou D, Wang X.**BMC Bioinformatics. 14270* - Maximum Likelihood Estimation of Frequencies of Known Haplotypes from Pooled Sequence Data[Molecular Biology and Evolution. 2013]
*Kessner D, Turner TL, Novembre J.**Molecular Biology and Evolution. 2013 May; 30(5)1145-1158* - Fast and accurate haplotype frequency estimation for large haplotype vectors from pooled DNA data[BMC Genetics. ]
*Iliadis A, Anastassiou D, Wang X.**BMC Genetics. 1394*

- Estimation of Haplotype Frequencies, Linkage-Disequilibrium Measures, and Combin...Estimation of Haplotype Frequencies, Linkage-Disequilibrium Measures, and Combination of Haplotype Copies in Each Pool by Use of Pooled DNA DataAmerican Journal of Human Genetics. Feb 2003; 72(2)384

Your browsing activity is empty.

Activity recording is turned off.

See more...