# Partition-Ligation–Expectation-Maximization Algorithm for Haplotype Inference with Single-Nucleotide Polymorphisms

^{1}Department of Statistics, Harvard University, Cambridge, MA; and

^{2}Program for Population Genetics, Harvard School of Public Health, Boston

^{*}The first two authors contributed equally to this work.

*To the Editor:*

The mapping of SNPs in human genomes has generated a lot of interest from both the biomedical research community and industry. In conjunction with SNP mapping, researchers have shown that haplotypes possess considerably greater potential than the traditional single-SNP approach in disease-gene mapping and in our understanding of complex landscapes of linkage disequilibrium (LD) (Goldstein 2001). *In silico* methods for haplotype reconstruction have attracted much attention because of their cost-effectiveness and accuracy (Tishkoff et al. 2000) and have played an important role in the definition of human haplotype block structure and in candidate-gene studies of complex traits (Tabor et al. 2002). In a recent publication, Niu et al. (2002) proposed a partition-ligation (PL) strategy and implemented it together with Gibbs sampling, to estimate haplotype phases for a large number of SNPs. Although the resulting program, HAPLOTYPER, has been in high demand from many research groups, a significant portion of researchers are also strongly interested in using an expectation-maximization (EM)–based algorithm. In the present letter, we describe how to combine the PL strategy with the EM algorithm and how to handle the local-mode problem. We also present a fast and robust method of computing the variance of the estimated haplotype frequencies. Some related issues concern the handling of missing data and the multiple imputations of haplotype phases.

The EM algorithm is arguably the most popular statistical algorithm, because of its interpretability and stability. Compared to the Gibbs sampler, the EM approach is a deterministic procedure, requires less computing time, and is easier for convergence check. The output of the EM algorithm, if not trapped in a local mode, is the maximum-likelihood estimate (MLE), which possesses well-established statistical properties. However, the capability of most EM-based approaches is restricted to approximately one dozen loci, because of the memory constraint. A recently developed program, SNPHAP (see David Clayton's Web site [SNPHAP: A Program for Estimating Frequencies of Large Haplotypes of SNPs]), is an exception that, although different from the PL strategy, can handle many more linked loci by using a progressive-extension technique.

The essential steps of the PL strategy (Niu et al. 2002) are as follows: One first breaks down all of the marker loci into stretches of “atomistic” units and then uses either the EM algorithm or the Gibbs sampler to construct haplotypes for each unit and to rebuild the phase hierarchically, through a bottom-up approach. For example, an individual represented in the lipoprotein lipase (LPL) gene SNP data set (Nickerson et al. 1998) has the genotype (01200001000000000100010), where 0 stands for heterozygote and 1 and 2 stand for wild-type and mutant homozygotes, respectively. Since there are 18 heterozygous loci, the standard EM algorithm has to consider 2^{18} possible haplotypes, making it extremely costly for haplotype estimation. Using the PL strategy, we divide the linked loci into four “atomistic” units—(012000), (010000), (000001), and (00010)—and use the EM algorithm to estimate partial haplotypes within each unit. Afterward, two adjacent partial haplotypes are “ligated” by using the EM algorithm again, just like phasing two linked multiallelic markers. The ligation process is repeated until the complete phase is determined.

It is well known that the EM algorithm can be trapped in a local mode. This problem becomes a more serious issue for the PL-EM strategy, because every atomistic haplotype construction or ligation step involves a complete EM algorithm implementation. A naive implementation of the ligation step considers only the partial haplotypes that have nonzero estimated frequencies in the previous EM step. However, it appears that one phase configuration (and the corresponding haplotypes with nonzero estimated frequencies) is more likely when looking only at a partial set of loci, whereas a different configuration is more likely when all loci are taken into consideration. For example, consider the set of individuals with the following genotype data on four loci—(A/A A/A T/T T/T), (A/A A/A T/T T/T), (A/A G/G T/T T/T), (A/A G/G C/C C/C), (A/A G/G C/C C/C), and (A/G A/G T/T T/T). If just the first two loci are concerned, then the EM algorithm estimates the haplotype frequencies as 7/12, 4/12, and 1/12, for (AG), (AA), and (GA), respectively. When all four loci are considered together, however, the EM gives rise to four haplotypes—(AATT), (AGCC), (AGTT), and (GGTT), with frequencies 5/12, 4/12, 2/12, and 1/12, respectively. Thus, had we thrown away the (GG) haplotype prematurely when only the first two SNP markers were analyzed, we would have not been able to reach the MLE.

To overcome this difficulty, we devised a “backup-buffering” strategy during the ligation step. In brief, in addition to keeping in a buffer those partial haplotypes that have EM-algorithm–estimated frequencies greater than a threshold value (e.g., ε=10^{-5}), we also retain in the buffer some partial haplotypes whose estimated frequencies are below ε. The criterion for choosing such a backup partial haplotype is based on the rank of its average estimated frequency over all the EM iterations. The buffer size—that is, the total number of candidate partial haplotypes in a buffer—is kept as a constant in the PL process. Not surprisingly, our simulation study based on the cystic fibrosis data showed that, the larger the buffer size is, the more accurate the phasing results are (for details, see fig. A1 [online only, at J. S. Liu's Web site]).

Niu et al. (2002) observed a modest performance improvement when recombination hotspots were used as the partition sites. Recently, hotspot-detection algorithms, such as a greedy algorithm (Patil et al. 2001) and a dynamic programming approach (Zhang et al. 2002), have been developed. Our PL-EM program can incorporate the information revealed by such algorithms by allowing the user to specify desirable partition points (for details and download of the PL-EM program, see J. S. Liu's Web site). We also conducted an empirical study on the effects that different partition sizes, *K, *have when hotspot information is absent. Although little difference in phasing performance was observed when three different partition sizes were used—3–4, 5–8, or 9–16 (see fig. A2 [online only, at J. S. Liu's Web site])—we found that the computation time increased sharply when the coarsest partition was used. Overall, *K* = 5–8 appeared to be a good choice for the atomistic unit size.

Several EM-based algorithms—including HAPLO (Hawley and Kidd 1995), Arlequin (Schneider et al. 2000), and the Mx program (Neale et al. 1999)—provide the variance estimates for the estimated haplotype frequencies. However, since these methods handle no more than ~20 loci, their variance-estimation method cannot be directly used by the PL-EM program. Instead, we implemented with the PL-EM program a simple and robust approach, to estimate the variances or SEs of the frequencies of those haplotypes that were selected at the final ligation stage.

Let *Y* be the observed genotype data, *Z* be the missing phase information, and θ be the vector of haplotype frequencies. As noted by Louis (1982), the Hessian matrix of θ can be computed via an identity analogous to the variance-decomposition rule,

and the variance-covariance matrix of the MLE, **, **is the inverse of this matrix evaluated at **. **The first term on the right-hand side of equation (1) can be computed as

where *m* is the number of all candidate haplotypes, *n*_{i} is the number of occurrences of haplotype *i* in *Z,* and the expectation is taken for the *n*_{i} (which is a function of *Z*) with θ fixed at the MLE. The second term on the right-hand side needs the variance-covariance matrix of

The calculation of , for example, can be achieved by observing in each individual the probability of the joint occurrence of haplotypes *i* and *j.*

In the presence of many heterozygous loci, some rare haplotypes with very low frequencies are likely to occur. Then, the inversion of the Hessian matrix becomes computationally burdensome and numerically unstable. Since scientists are mostly concerned with the variance of each instead of covariances among the s, we introduce a new, robust method of computing these marginal variances. Take , for example: by applying equation (1) to a reparameterization of the model with θ^{′}=(θ_{1},1-θ_{1}) and θ^{′′}=(θ_{2},…,θ_{m})/(1-θ_{1}), we have

Thus, is equal to the reciprocal of the above quantity. Note that the new method and Louis’s method give identical variance estimates if the inversion of the Hessian matrix (eq. [1]) is accurate. Intuitively, the first term on the right-hand side of equation (2) is the standard variance estimate when there is no uncertainty in phasing, and the second term accounts for the loss of information because of unknown phases.

An example of the SE calculation for estimated haplotype frequencies is shown, in table 1, for the LPL data from Nickerson et al. (1998). This example also illustrates that haplotypes can shed new light on population migration and admixture. To better understand the properties of the estimated SEs, we conducted a simulation study using the 12 distinct haplotypes from the β_{2}-adrenergic receptor (β_{2}AR) data set. Assuming that the 12 haplotypes have equal frequencies (1/12), we simulated 100 data sets, each consisting of 90 hypothetical individuals. The PL-EM algorithm was applied to each of the data sets, and a 95% CI for each was constructed on the basis of the estimated frequencies and SEs (i.e., ). The number of times (in 100 trials) that the 95% CI covered the true frequency (θ=1/12) for the 12 haplotypes was 92, 88, 93, 96, 97, 96, 88, 93, 94, 92, 95, and 94, which average to 93.2%. For the purpose of calibration, we note that the average coverage of the true θ was only 93.1% when the haplotype phase information was given.

The presence of a significant portion of missing genotypes is a common problem when a great number of linked loci are under investigation. This missing-data problem poses a serious challenge to the existing EM haplotype-inference algorithms, even when the total number of SNP loci is moderate. In the case of missing two allele calls at one locus, for example, all three different genotype configurations—(AA), (Aa), and (aa)—have to be accounted for by the algorithm, which greatly inflates the space of candidate haplotypes. As a consequence, the standard EM algorithm not only needs a lot more memory but also converges much more slowly. The PL-EM algorithm resolves this difficulty seamlessly because of its adoption of the divide-conquer-combine strategy.

It often occurs that, for some individuals with a large number of heterozygous loci, numerous haplotype pairs (each with a nonzero probability) are compatible with their genotype data. In this case, generating all compatible haplotype phases with nontrivial probabilities is more desirable than outputting only the best phase. There is some evidence (X. Lu and J. S. Liu, unpublished data) showing that, by accounting for the phasing uncertainty, one can gain accuracy in LD mapping when using the algorithm BLADE (Liu et al. 2001; this algorithm employs a semi-hidden Markov model and a Markov-chain Monte Carlo method, for inference of the location of the disease mutation among a given set of linked markers with known genetic distances in a case-control setting). To accommodate this need of multiple-haplotype imputation, the PL-EM program can let the user choose to display either the top *f* most likely phases (if existing) for each individual or all phases with probabilities >0.1.

We evaluated the performances of PL-EM, HAPLOTYPER (Niu et al. 2002), and an enhanced version of PHASE (Stephens et al. 2001), using the angiotensin I–converting enzyme (ACE) data set, the β_{2}AR gene data set, the cystic fibrosis transmembrane conductance regulator (CFTR) gene data set, and data sets produced by coalescence model–based haplotype-simulation software (see the Long Lab's Web site [Tools: Statistical Analysis and Molecular Biology Tools]). All these data sets were constructed in the same way, as described by Niu et al. (2002). The results are summarized in the left panels of figure 1. The PL-EM program’s error rate for individuals’ phasing is comparable to HAPLOTYPER, but is lower than PHASE in the first three cases, which is consistent with the studies described by Niu et al. (2002). For the coalescence simulation, PL-EM and HAPLOTYPER respectively made 35% and 11% more errors than PHASE. Note that Stephens et al. (2001) reported that the EM algorithm made ~100% more errors than PHASE, indicating that PL-EM performed significantly better than the standard EM algorithm when the coalescence assumption is appropriate.

*open bars*) or the proportion of incorrectly inferred loci (

*shaded bars*), for ACE (

*A*), β

_{2}AR (

*B*), CFTR (

*C*), and coalescence-simulation (

*D*) data. For the ACE data, there

**...**

To investigate further how the inference errors were made by the three algorithms, we looked into the following two aspects: (1) how the incorrectly inferred haplotypes differ from the true ones and (2) whether different algorithms made errors on the same individuals. For the first three data sets, PL-EM appeared to produce the least amount of incorrectly inferred loci for those wrongly inferred haplotypes, whereas, for the coalescent-based simulated data, PL-EM and HAPLOTYPER respectively produced 36% and 7% more incorrectly inferred loci than did PHASE (fig. 1, *right panels*). In the first three cases, most of the errors made by HAPLOTYPER and PL-EM appeared to be a subset of the errors made by PHASE (see fig. A3 [online only, at J. S. Liu's Web site]).

In summary, the PL-EM algorithm can deal with a large number of linked loci that have moderate levels of LD. It is capable of variance estimation, multiple imputation, and the handling of incomplete genotype data. In addition, PL-EM was faster than HAPLOTYPER in these examples, even with the variance estimation. Hence, in practice, if a coalescence model for the population haplotypes is too strong to assume, then PL-EM can be an attractive alternative to HAPLOTYPER, further helping scientists in the haplotype-reconstruction endeavor.

## Acknowledgments

We are grateful to Chi-Hse Teng and the two anonymous reviewers for insightful comments. This research was supported in part by the National Science Foundation grants DMS-0094613 and DMS-0104129 and National Institutes of Health grant R01 HG02518-01.

## Electronic-Database Information

URLs for data presented herein are as follows:

## References

^{2}-adrenergic receptor haplotypes alter receptor expression and predict

*in vivo*responsiveness. Proc Natl Acad Sci USA 97:10483–10488 [PMC free article] [PubMed]

**American Society of Human Genetics**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (105K) |
- Citation

- A partition-ligation-combination-subdivision EM algorithm for haplotype inference with multiallelic markers: update of the SHEsis (http://analysis.bio-x.cn).[Cell Res. 2009]
*Li Z, Zhang Z, He Z, Tang W, Li T, Zeng Z, He L, Shi Y.**Cell Res. 2009 Apr; 19(4):519-23.* - Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms.[Am J Hum Genet. 2002]
*Niu T, Qin ZS, Xu X, Liu JS.**Am J Hum Genet. 2002 Jan; 70(1):157-69. Epub 2001 Nov 26.* - Inference of missing SNPs and information quantity measurements for haplotype blocks.[Bioinformatics. 2005]
*Su SC, Kuo CC, Chen T.**Bioinformatics. 2005 May 1; 21(9):2001-7. Epub 2005 Feb 4.* - Algorithms for inferring haplotypes.[Genet Epidemiol. 2004]
*Niu T.**Genet Epidemiol. 2004 Dec; 27(4):334-47.* - [Analysis and application of SNP and haplotype in the human genome].[Yi Chuan Xue Bao. 2005]
*Li J, Pan YC, Li YX, Shi TL.**Yi Chuan Xue Bao. 2005 Aug; 32(8):879-89.*

- Genetic analysis of axial length genes in high grade myopia from Indian population[Meta Gene. ]
*Sharmila F, Abinayapriya, Ramprabhu K, Kumaramanickavel G, R.R.Sudhir, Sripriya S.**Meta Gene. 2164-175* - Haplotype diversity of VvTFL1A gene and association with cluster traits in grapevine (V. vinifera)[BMC Plant Biology. ]
*Fernandez L, Le Cunff L, Tello J, Lacombe T, Boursiquot JM, Fournier-Level A, Bravo G, Lalet S, Torregrosa L, This P, Martinez-Zapater JM.**BMC Plant Biology. 14209* - Analytical Methods for Immunogenetic Population Data[Methods in molecular biology (Clifton, N.J....]
*Mack SJ, Gourraud PA, Single RM, Thomson G, Hollenbach JA.**Methods in molecular biology (Clifton, N.J.). 2012; 882215-244* - Obesity Has an Interactive Effect with Genetic Variation in the Activating Transcription Factor 6 Gene on the Risk of Pre-Diabetes in Individuals of Chinese Han Descent[PLoS ONE. ]
*Gu N, Ma X, Zhang J, Dong A, Jin M, Feng N, Zhang H, Guo X.**PLoS ONE. 9(10)e109805* - Polymorphisms of the tumor necrosis factor-alpha receptor 2 gene are associated with obesity phenotypes among 405 Caucasian nuclear families[Human genetics. 2008]
*Zhao LJ, Xiong DH, Pan F, Liu XG, Recker RR, Deng HW.**Human genetics. 2008 Sep; 124(2)171-177*

- PubMedPubMedPubMed citations for these articles

- Partition-Ligation–Expectation-Maximization Algorithm for Haplotype Inference wi...Partition-Ligation–Expectation-Maximization Algorithm for Haplotype Inference with Single-Nucleotide PolymorphismsAmerican Journal of Human Genetics. 2002 Nov; 71(5)1242

Your browsing activity is empty.

Activity recording is turned off.

See more...