Logo of bioinfoLink to Publisher's site
Bioinformatics. 2010 Jun 1; 26(11): 1437–1445.
Published online 2010 Apr 20. doi:  10.1093/bioinformatics/btq157
PMCID: PMC2913665

Inferring combined CNV/SNP haplotypes from genotype data


Motivation: Copy number variations (CNVs) are increasingly recognized as an substantial source of individual genetic variation, and hence there is a growing interest in investigating the evolutionary history of CNVs as well as their impact on complex disease susceptibility. CNV/SNP haplotypes are critical for this research, but although many methods have been proposed for inferring integer copy number, few have been designed for inferring CNV haplotypic phase and none of these are applicable at genome-wide scale. Here, we present a method for inferring missing CNV genotypes, predicting CNV allelic configuration and for inferring CNV haplotypic phase from SNP/CNV genotype data. Our method, implemented in the software polyHap v2.0, is based on a hidden Markov model, which models the joint haplotype structure between CNVs and SNPs. Thus, haplotypic phase of CNVs and SNPs are inferred simultaneously. A sampling algorithm is employed to obtain a measure of confidence/credibility of each estimate.

Results: We generated diploid phase-known CNV–SNP genotype datasets by pairing male X chromosome CNV–SNP haplotypes. We show that polyHap provides accurate estimates of missing CNV genotypes, allelic configuration and CNV haplotypic phase on these datasets. We applied our method to a non-simulated dataset—a region on Chromosome 2 encompassing a short deletion. The results confirm that polyHap's accuracy extends to real-life datasets.

Availability: Our method is implemented in version 2.0 of the polyHap software package and can be downloaded from http://www.imperial.ac.uk/medicine/people/l.coin

Contact: ku.ca.lairepmi@nioc.l

Supplementary information: Supplementary data are available at Bioinformatics online.


Copy number variations (CNVs) are pervasive in the human genome (Feuk et al., 2006; Redon et al., 2006) and could play a key role in human diversity and disease susceptibility (Conrad et al., 2009; McCarroll and Altshuler, 2007). Despite this, the population genetics of CNVs—and particularly so for duplications—remain relatively poorly understood. Several analytical tools, such as haplotype analysis, which are standard for SNP-based population genetics have yet to be modified to be applicable to complex multi-allelic CNVs.

Several technologies enable high-throughput CNV detection, including array comparative genomic hybridization (aCGH) and SNP genotyping arrays. Many algorithms have been proposed to detect CNV regions and to estimate the integer copy-number (CN) genotypes in each region using these technologies (Colella et al., 2007; Fiegler et al., 2006; Korn et al., 2008; Lai et al., 2005; Olshen et al., 2004; Wang et al., 2007). In particular, using SNP genotyping arrays to simultaneously produce estimates of integer CN and SNP genotype has become popular, particularly as a means to identify both SNPs and CNVs associated with disease. CNV association analyses are conducted either on estimates of integer CN genotype (Barnes et al., 2008; Korn et al., 2008), or using normalized continuous intensity data measurements (Barnes et al., 2008).

As a result of intensive genotyping efforts worldwide for genome-wide association studies, there are many datasets containing inferred CNV regions, CNV genotypes in these regions, as well as SNP genotypes. However, CNV–SNP haplotypes are rarely determined in these datasets, due largely to a lack of algorithmic development in this area. Hence, haplotype-based approaches that have been shown to be more powerful than single-marker analyses (Liu et al., 2007; Mailund et al., 2006; Su et al., 2008a) are not fully exploited in CNV association studies.

Apart from improving the sensitivity of association studies, CNV–SNP haplotypes are also invaluable for studying the evolutionary history of CNVs. In particular, many techniques for detecting positive selection rely on accurate phasing (Sabeti et al., 2007). Similarly, CNV–SNP phasing will improve accuracy of estimates of linkage disequilibrium (LD) between SNPs and CNVs (Conrad et al., 2009; de Smith et al., 2008), particularly for multi-allelic CNVs. Identification of the haplotypic background(s) of a given CNV, will also help to distinguish single versus recurrent deletion/amplification events and will also shed light on the age of the CNV (from the size of the extended haplotype containing the CNV).

Methods for inferring haplotypic phase from diploid genotypes are well developed and provide accurate inference of haplotypic phase (Browning and Browning, 2009, 2007; Kimmel and Shamir, 2005; Scheet and Stephens, 2006; Stephens and Scheet, 2005; Su et al., 2008a). For polyploid organisms, two phasing programs, SATlotyper (Neigenfind et al., 2008) and polyHap(v1.0) (Su et al., 2008b), have been proposed. By treating a CN region as a region of variable ploidy, polyHap(v1.0) can also be used for phasing CNV regions providing that the ploidy is fixed for the entire genomic region under investigation (although this can be different for different individuals). Thus, phasing complex CNV regions, in which each individual can have different CNV breakpoints, is a problem that has yet to be fully addressed.

To properly define what is meant by CNV haplotyping, we consider in Figure 1 how a CNV might arise in an ancestral genome, and subsequently be transmitted from one generation to the next. We consider separately the cases of deletion, local and dispersed duplication as illustrated in Figure 1. For both a deletion and local duplication, LD can accumulate between flanking SNPs and the CN state itself. Studies have estimated that the majority of common genotypeable CNVs are well tagged by SNPs (Conrad et al., 2009; de Smith et al., 2008). Exploiting this LD pattern to infer which haplotype contains the CNV is called non-internal phasing in our study. For bi-allelic SNPs within duplications, non-internal phasing also provides an estimate of allelic configuration—e.g. distinguishing two possible configurations AA/B and AB/A for a genotype AAB. On the other hand, the duplicated regions can themselves contain SNPs, which either arose prior to the duplication event, during the duplication event (through imperfect copying) or subsequently. We refer to exploiting the LD patterns within duplicated regions in order to identify the haplotypes comprising the duplication as internal phasing. The difference between internal and non-internal phasing—as applied to diploid genomes—is further illustrated in Figure 2. Non-internal phasing reconstructs haplotypes consisting of CNV–SNP alleles {−, A, B, AA, AB, BB, …} from diploid genotypes, but does not phase within duplicated alleles. Internal phasing, on the other hand, reconstructs all of the underlying SNP haplotypes present in the dataset, including those within duplicated regions. Thus, it considers diploid genomes as locally polyploid, where the ploidy is given by the maximum CN. MOCSphaser (Kato et al., 2008b) was the first program developed for inferring (non-internal) CNV–SNP haplotypes using an expectation maximization (EM) algorithm. However, it only accommodates CNs in CNV regions (does not consider variant bases in these regions) and SNP genotypes in non-CNV regions. Another recently proposed non-internal phasing program for CNV haplotype inference, CNVphaser, employed an EM and partition–ligation (PL) algorithms to infer haplotypic phase given identified CNV regions and CNs (Kato et al., 2008a).

Fig. 1.
Illustration of the process of forming a deletion and a local/dispersed duplication. The light grey box represents deleted or duplicated region. The deleted or duplicated region is transmitted over generations. The pre-existing SNPs (represented in white ...
Fig. 2.
Illustration of non-internal and internal phasing with a deletion and a single copy amplification. Non-internal phasing considers the genotypes as diploids and treats the duplication and deletion as extra different alleles, whereas, internal phasing considers ...

In this article, we describe an algorithm for both internal and non-internal CNV haplotype inference from CNV/SNP genotype data, which takes account of the shared haplotype structure between individuals in a population. Our method, polyHap(v2.0) extends the model of polyHap(v1.0) (Su et al., 2008b) to phase complex CNV regions by allowing arbitrary changes of CN within individuals and along the genomic sequence.

To investigate the effectiveness of our approach, we took SNP/CNV genotype data on male X chromosomes and randomly paired these into phase-known diploid and triploid haplotypes. We then investigated how well we could reconstruct the known phase and allelic configuration as well as infer missing CNV genotypes. The results show that our method provides accurate estimates of missing CNV genotypes, allelic configuration and haplotypic phase. We applied polyHap to a region on Chromosome 2 encompassing a short deletion. The results show that polyHap correctly detected a haplotype comprising this deletion.


Our method employs a hidden Markov model (HMM) to infer an ancestral haplotype for each haplotype at each marker, reflecting the idea that similar haplotypes are likely to have descended from the same ancestral haplotype. Assume we observe the genotypes, g = (g1, g2.…, gM), at M SNPs for each individual. gm = {gm1,… gmN} is an unordered list of the individual's alleles at marker m, where N is the ploidy. For non-internal phasing, we infer haplotypic phase on diploid chromosomes, where N equals to 2. For internal phasing, we consider genotypes as polyploids, where N is set to the maximum CN (ploidy) observed on the individual (Fig. 2). Also, sm = {sm1,…, smN} and s′m = [sm1,…, smN] are the unordered and ordered lists of ancestral haplotypes at marker m, respectively. We write π(sm) = [smπ(1),…, smπ(N)] for a permutation of sm, and Π(sm) for the set of all such permutations. Thus, for example, if sm = [1, 2] there are two permutations, namely [2, 1] and [1, 2], whereas if sm = [1, 1] there is only one permutation.

2.1 Emission probability

In our method, each allele is assumed to be descended from one of z ancestral haplotypes, which are the hidden states (haplotype states) in the HMM. The program first learns the ancestral haplotype structure from genotypes jointly for all individuals. Based on this structure, allelic configuration, missing CNV genotypes and CNV haplotypic phase are then inferred. This relationship between the allele and the haplotype hidden state is modelled by the emission probability. In this study, we allow a deletion and a single copy amplification. Thus, the set of possible alleles is {-, A, B, AA, AB, BB} underlying a diploid model when non-internal phasing is considered, while the set of possible alleles is {-, A, B} underlying a polyploid model for internal phasing.

First, we define the emission probability of each genotype given a haplotype state. Let θmln(h) denote the emission probability of allele h at marker m given the haplotype state ln in a haploid model, where h ∈ {-, A, B, AA, AB, BB} for non−internal phasing and h ∈ {-, A, B} for internal phasing.

We first obtain the emission probability of a list of unordered haplotypes, given an unordered list of haplotype states {l1,…, lN} by

equation image

where θmln(hπ(n)) = p(hπ(n)|ln).

For non-internal phasing, a given CNV genotype (e.g. AAB) may be consistent with more than one unordered list of haplotype pairs (e.g. AA/B and AB/A). In this case, the observed data is represented as probability distribution p*m over unordered haplotype pairs (e.g. such that p*m(AB/A) = p*m(AA/B) = 0.5). We then write

equation image

for the emission probability, using Equation (1) to calculate the terms in this sum. We note that a normal copy genotype, e.g. AA is also consistent with two different unordered haplotype pairs, namely A/A as well as AA/-; however, we currently exclude the AA/- haplotype pair from our analysis. Equation (2) can also be used to accommodate uncertain CN genotypes, in which case p*m to reflects the probability of each CNV/SNP genotype as calculated by the CNV genotyping algorithm used. If gm is missing, we set p*m to be the uniform distribution over all CNV/SNP genotypes.

2.2 Transition probability

We first briefly describe a basic transition model for internal phasing. We then introduce the extension of this model for non-internal phasing by considering the transitions between the CNs and between the haplotypes. In this extension, a given haplotype hidden state has a fixed CN and there can be multiple haplotype states underlying each CN. In this study, we use eight ancestral haplotype states for internal phasing and nine haplotype states for non-internal phasing of which one haplotype state has the underlying CN = 0 (deletion), four haplotype states have the underlying CN = 1 (normal copy) and four haplotype states have the underlying CN = 2 (a single copy amplification). Note that the CN states are the super states which categorize haplotype states according to their underlying CNs.

2.2.1 A basic haplotype transition model

First, we define the transition probability in a HMM from haplotype states kn to ln between markers m − 1 and m by

equation image

where Jm is the probability of a jump occurring at marker m − 1, and αmln is the probability that this jump results in the haplotype ln. For tightly linked markers, Jm is small so that haplotype state changes occur infrequently, but are allowed between any pair of markers. Here, the parameter Jm is independent of the state and αmln only depends on the ln (the ‘to’) state.

2.2.2 A modified haplotype transition model

We further modify this model to allow different models for the transition between CN states, for the transition between haplotype states that have the same CN state, and for the transition between haplotype states that have different CN states. To incorporate the CN state in the transition model, we introduce a hierarchy transition model—the first transition level is the transition between the CN states and the second is between the haplotype states given the CN states (Fig. 3). The idea of using this model is to capture the favoured transition between the CNs.

Fig. 3.
Illustration of two levels of the transitions based on the haploid model. Each box represents the CN state and the numbers in the box are the assigned haplotype states. The first level of the transition (which is between the CN states) can be considered ...

The transition probability from haplotype state kn to ln is then the product of the transition probability between CN states and the transition probability between haplotype states given the CN states

equation image

where c(ln) and c(kn) are the underlying CN state for haplotype states ln and kn, respectively; and ci(ln) and ci(kn) are the indices of haplotype states ln and kn within the CN states c(ln) and c(kn). Both transition probabilities (the two terms of the product in the equation) are calculated based on Equation (3) with different parameters.

In this modification, we allow that the parameters Jm depend on the kn (‘from’) state, denoted as Jmkn and αm is related to both the kn and ln (‘from’ and ‘to’) states, denoted as αmknln. To capture linkage disequilibrium between duplication states and flanking SNPs, we use Equation (3) with parameters Jmkn and αmknln to compute the transition probability between CN states and between haplotype states given the transition occurring in different CN states. We use the basic transition model (the parameter Jm is independent of the state and αmln only depends on the ln state) to calculate the transition probability between the haplotype states given the same CN state.

2.2.3 Polyploid transition model

We use the modified haplotype transition model for non-internal phasing and basic haplotype transition model for internal phasing. Based on these transition models, the transition probability between unordered lists of haplotype states k = k1,…, kN and l = l1,…, lN at marker m is given by

equation image

2.3 The prior and computation

We use Dirichlet priors on all of our parameters. We let θml∼Dirichlet(uθmθ), where mθ is the uniform vector with each element equal to 1/H (H is the length of allele space), and αm.∼ Dirichlet(uαmα) where mα is the uniform vector with each element equal to 1/z (z the number of ancestral haplotypes). We let Jm ∼ Beta(uJ(1 − edmr), uJedmr) where dm is the physical distance between consecutive markers and r = 10−8 per based pair in the population, reflecting the background recombination rate. We use uθ = uα = 1 and uJ = 105 for initialization of the EM algorithm and uθ = uα = uJ = 0.1 for the maximization step.

Although our HMM has many parameters, approximate posterior mode estimates are readily obtained using the Baum–Welch algorithm, which is a form of the EM algorithm. The parameters in the model are updated at each step of the EM algorithm given the observed genotype data. The training process might converge to a local maximum of the likelihood function, which is a typical problem for the EM algorithm. To deal with this problem, we combine the results from 10 repetitions of the EM algorithm with different start values. In our model, the first-order Markov chain is employed to model ancestral haplotypes across the sequence. Thus, the number of EM iterations does not depend on the number of markers. A default number of iterations is 25 for each repetition of the training process, which can be specified in our parameter file.

After obtaining the estimates of parameters at each repetition, a specified number of haplotypes are sampled from the posterior distribution conditional on the genotype data of a given individual (Su et al., 2008b). Here, we obtain 100 samples for each repetition. The most likely haplotype is then inferred from all the sampled haplotypes across the 10 repetitions of the EM algorithm. The certainty rate of this estimate is the fraction of times it is sampled. Because we consider only a small number (e.g. 10) of local modes of the posterior distribution for the HMM parameters, the certainty value is not the probability of the imputed genotype under the model, which would require integration over the posterior distribution, but it may serve as a reasonable approximation to this probability.


In this section, we present the details of the simulation study to evaluate the performance of our method for inferring allele configurations, CNV–SNP haplotypes and missing CNV genotypes. We simulated phase-known datasets based on data obtained from French and Finnish population cohorts, respectively, with different technologies for obtaining the CN status and using different genotyping chips. The French dataset contains fewer samples but denser CNV–SNP genotypes, while the Finnish dataset contains more samples but less dense genotypes.

3.1 The French samples

We obtained data for X chromosomes from 48 males of northern French origin who were genotyped both on the Illumina 1M platform and 244K aCGH platform. The 244K aCGH chips, custom-designed for focussed investigation for putative CNV regions, provide information on the locations of CNV regions as well as CNs in these regions across the entire genome (de Smith et al., 2007). In aCGH, test and reference DNA samples, which are labelled differentially with fluorescent tags, are competitively hybridized into genomic arrays. The fluorescence ratio of test and reference hybridization signal is then determined at different positions along the genome, which provides information on the difference in CNs between test and reference samples.

CNV regions on non-pseudo-autosomal regions of the male X chromosome were identified from 244K aCGH chip data using the ADM2 algorithm developed by Agilent Technologies (Santa Clara, CA, USA), which recursively searches for CNV intervals based on log R ratios (LRRs) of fluorescent signals from probes between test and reference DNA sample (de Smith et al., 2007). A single sample from the Coriell Cell Repository (NA15510) was used as reference. The boundary and size of the CNV intervals are defined on the basis of the positions of the first and last array probes identified as lying within the CNV. The integer CN of the CN region was set to 0 if the average LRR of probes within the region was less than −0.5 (i.e. deletion) and was set to 2 otherwise (i.e. amplification on male haploid background). Haploid SNP genotypes in non-CNV regions were obtained from BeadStudio, using the Illumina 1M chip. Within amplified regions, two-copy SNP genotypes were estimated from a Gaussian mixture model using the B-allele frequency from BeadStudio. For this dataset, we analysed a 2.7 Mb non-pseudo-autosomal region of the X-chromosome (151 881 226–154 588 828 bp based on NCBI build 36) This region has 1904 aCGH probes (equally 1 probe for every 1.4 kb) and 1058 Illumina SNP probes.

3.2 The Finnish samples

We also assessed our method using a larger dataset from the Northern Finland Birth Cohort (NFBC), from which we obtained non-pseudo-autosomal X-chromosome genotype data on 695 Finnish males assayed on Illumina Hap370 chips. aCGH data were unavailable for this cohort; however, Illumina's BeadStudio software generates the log ratio of observed to expected fluorescent signal intensity (LRR), as well as a normalized measure of relative signal intensity between the two SNP alleles the B allele frequency (BAF), which can be used to detect CNV regions and infer CN genotypes (Colella et al., 2007; Wang et al., 2007). Haploid SNP genotypes were obtained from BeadStudio, while two copy SNP genotypes within amplifications were obtained on the basis of BAF. For this dataset, we analysed a 20.9 Mb region on the X chromosome (19 502 220–40 491 848 bp based on NCBI build 36), which contains 2149 markers.

3.3 Simulation of phase-known genotypes

We randomly combined SNP/CNV genotypes on male X chromosomes into pairs to create diploid genomes with up to four copies for non-internal phasing (Fig. 2). We created 24 ‘non-internal’ phase-known diploid genomes in the French dataset and 347 genomes in the Finnish dataset. These samples were inappropriate for internal phasing as the ‘internal’ haplotypes comprising the amplifications on each X-chromosome copy are not known. Thus, to evaluate internal phasing, we masked X-chromosome amplifications, and randomly grouped these X-chromosomes into 15/231 French/Finish triploid genomes, so that we obtain internal+external phase-known SNP/CNV genotype data with up to three copies.

3.4 Switch error rate

The switch error rate for each individual is defined as ψ/(n − 1), where n denotes the number of heterozygous sites for that individual and ψ the minimal number of switches needed to recover the true haplotypes. We assumed that at most one switch could occur between consecutive heterozygous sites.

For each individual, we determined if there was a switch by comparing the inferred haplotypes to the true haplotypes. If a discrepancy is identified at a heterozygous marker m, a switch error is counted and a switch is introduced in the inferred haplotypes to ensure that it matches the true haplotypes up to marker m. To identify a discrepancy, it is only necessary to compare haplotype sets as far back as to distinguish N distinct preceding haplotypes (N is the ploidy), which in diploids requires looking back to the previous heterozygous marker only.


4.1 Missing data imputation

We first examined the accuracy of our method for missing data imputation with both French and Finnish data. In each dataset, 5% and 10% of genotypes with one to four copies of alleles were set as missing at random, respectively. We report the proportion of missing genotypes for each CN that were estimated incorrectly (imputation error rate). Table 1 shows the imputation error rate in the French and Finnish datasets, respectively. Overall, our method provides accurate estimates of missing genotypes. For both missing rates (5% and 10%), our method gives an imputation error rate <0.09.

Table 1.
Error rate for estimation of missing genotype

4.2 Allelic configuration inference

We assess the performance of our method for inferring allelic configuration on a pair of haplotypes (such as AA/B versus A/AB). Table 2 presents the distribution of CNs observed on all markers and the error rate of estimated allele configurations. In the French data, there are 6317 and 754 3-CN and 4-CN genotypes of which 1075 and 94 are heterozygous, respectively. The allelic configuration is ambiguous for all of these heterozygous 3-CN and 4-CN genotypes (excluding genotypes AAAB and ABBB for which AA/AB and AB/BB, respectively, are the only possible configurations). The allelic configuration error rate amongst these ambiguous 3-CN and 4-CN genotypes is 0.119 and 0.0, respectively. In the Finnish data, the corresponding error rates are 0.016 and 0.188, based on 24 106 and 1572 heterozygous 3-CN and 4-CN genotypes.

Table 2.
The distribution of CN and error rate of estimation of allele configuration at heterozygous sites

4.3 Inference of haplotypic phase of CN state relative to flanking SNPs (non-internal phasing)

We assessed the performance of our method for haplotypic phase inference using the switch error rate. In this case, the CNV/SNP alleles consist of {−, A, B, AA, AB, BB} and we do not distinguish the order of alleles within an amplification. Hence, when calculating switch error rate, homozygous 3-CN genotypes (A/AA or B/BB) are considered as heterozygous sites as the CNV/SNP alleles are different for each haplotype. Homozygous genotypes 4-CN genotypes (AA/AA or BB/BB) are still considered as homozygous sites. In calculating the switch error, we excluded the sites where the allelic configurations were incorrectly inferred.

For the French data, the overall switch error rate is 0.015. We then classified transitions by the ‘from’ and ‘to’ genotype CN (denoted by N1N2) to get error rates in Table 3. The number of observed N1N2 transitions is shown in brackets. Overall, the switch error rates are <0.34, apart from two cases where the error rates are 0.57 and 0.41 at heterozygous sites with CNs 1 → 3 and 2 → 3, respectively. The reduced accuracy in the French data at such sites is due to the fact that the number of observations is small, and moreover may not all occur at the same CN breakpoints. Accuracy for these CN transitions is improved by increasing the population size as can be seen in the corresponding results for the Finnish data.

Table 3.
Switch error rate for non-internal phasing

Figure 4 shows the error rate for each transition in the Finnish data distributed according to certainty score. In general, the estimate with the higher certainty rate provides the more reliable inference. However, we observed a low proportion of estimates that have a high certainty rate (>0.9) in some cases, such as CNs 4 → 3. Supplementary Figure S1 shows that the level of LD between SNPs and CNVs, as measured by r2, is inversely correlated with switch error.

Fig. 4.
Histogram of certainty scores and switch error rate in each bin from the Finnish dataset. The circles indicate average switch error rates within each histogram bin. The error bar of each switch error rate is based on a 95% equal-tailed Bayesian interval ...

To compare the results with those from CNVphaser and MOCSphaser, we chose three and eight sites in two different CNV regions from the French data. The maximum number of CNV sites used in the original CNVphaser article (Kato et al., 2008a) is eight. MOCSphaser could not be run on the eight site data because it ran out of memory on a 32 GB machine. We also attempted to run CNVphaser using the same number of sites as presented for polyHap in Table 3, but we found that the scale of our simulated dataset was not computationally feasible for CNVphaser.

CNVphaser and MOCSphaser both return a posterior probability distribution over possible haplotypes given the observed genotypes. We selected the haplotype with the highest probability as the inferred haplotype. We show the number of individuals whose genotypes are not correctly phased at any heterozygous sites in Table 4. The CNV genotypes at three sites are all correctly phased by both polyHap and CNVphaser/MOCSphaser. For the genotypes at eight sites, the results from polyHap show that only one individual has a single switch error over all the sites, while most of the inferred haplotypes from CNVphaser are incorrect. The allele configurations are incorrectly inferred in most heterozygous sites by CNVphaser.

Table 4.
Comparison between polyHap and CNVphaser/MOCSphaser

Previous studies have also used fastPhase and Beagle for CNV phasing (Conrad et al., 2009). This approach is limited to phasing bi-allelic CNVs relative to flanking SNPs not in CNV regions, which is achieved by recoding bi-allelic CNV genotypes as SNP genotypes. To compare our method to this approach, we removed multi-allelic CNVs from the Finnish dataset, and also masked SNPs within CNV regions, and finally encoded CNV genotypes as SNP genotypes. We then ran each of fastPhase/Beagle and polyHap on this dataset (Supplementary Table 1). Comparing with Table 3, we see that switch error rates for polyHap have markedly increased in most cases due to loss of information from masking SNPs. Comparing algorithms on the masked dataset (Supplementary Table 1), we see that polyHap and fastPhase had comparable switch error rates between SNPs and CNVs with different CN transitions, while Beagle had higher error rates on these CN transitions except for CN 3 → 2.

Finally, to test polyHap's accuracy on a non-simulated dataset we successfully phased a region on chromosome 2 containing a known short (<3 kb) deletion at 229.467 mb, using data generated by a 244K Agilent array CGH chip (de Smith et al., 2007). A consistent haplotype including this deletion was detected (Supplementary Fig. S2). This deletion has been previously verified by polymerase chain reaction (PCR) across the breakpoints followed by sequencing (de Smith et al., 2007).

4.4 Inference of haplotypic phase of SNPs within CN states (internal phasing)

Internal phasing can be considered as a tool for further investigating haplotypic phase of duplicated alleles locally. Thus, we report the switch error rate at sites that have the same CN. Note that here we only consider up to a single copy amplification at the genotype level. Table 5 gives the switch error rate between a pair of consecutive heterozygous sites, which have the same CN. The count of each pair of CN is shown in parentheses. For both French and Finnish datasets, the error rates are ≤0.08 for locally inferring haplotypic phase of duplicated alleles.

Table 5.
Switch error rate for internal phasing with same CN


We have presented a method for inferring haplotypic phase for CNV/SNP genotype data among unrelated individuals. Our method allows CNV regions and ploidy to vary along the sequence and between the individuals. Our program accommodates both CNV and SNP genotype data and infers missing genotypes and haplotypic phase for both types of data. Our method allows uncertainty in the CN assignment by representing the CNV genotype as a probability distribution over multiple CNV genotypes.

It is necessary to first calculate CNV genotypes prior to running our program. In particular, polyHap does not accommodate a continuous measurement in place of the integer CN genotypes. polyHap can include—in principle—an arbitrary maximum number of copies. However, as the computational complexity scales roughly as # copies2 for non-internal phasing, and as e# copies for internal phasing, meaning that internal phasing is feasible for up to 6 copies, and non-internal for up to 20 copies. Similarly, polyHap cannot model complicated structural rearrangements, including inversions and translocations.

polyHap requires a pre-defined number of ancestral haplotypes. In this study, we use eight ancestral haplotype states for internal phasing (two CN = 0 and six CN = 1 states) and nine haplotype states for non-internal phasing (one CN = 0, four CN = 1 and four CN=2 states). We have also tried different numbers of ancestral haplotypes and found that the results are comparable. Here, we would suggest using higher number of ancestral haplotypes when dealing with rare variants. The choice of ancestral haplotype number usually does not depend on the sample size but rather on the number of haplotypes present in the population. Thus, if a very diverse, heterogeneous population or a mixture of several populations were being analysed, then it would be advisable to include more states.

The results from the simulation study demonstrate that our program provides accurate estimates of missing genotypes, allele configuration and haplotypic phase for both CNV and SNP data. Our method gives an imputation error rate <0.09 for imputing missing genotypes with one to four copies of alleles. Also, our method provides accurate estimates of allele configurations on a pair of haplotypes, with an error rate <0.19. Furthermore, polyHap successfully identified a haplotype comprising a short deletion on chromosome 2. Our method gives encouraging results for inferring CNV haplotypic phase over different CNs at heterozygous sites. Although there are several situations where the switch error rate is >0.3, this might result from rare haplotypes in the dataset, and the accuracy here would be improved by using a larger population sample. Also, reliable phase inferences can be distinguished using the uncertainty estimates. In general, a higher certainty rate indicates higher accuracy of the estimate.

polyHap outperforms two existing methods for phasing CNV-SNP haplotypes—CNVphaser and MOCSphaser—in terms of accuracy and capacity of dealing with large-scale datasets. Comparing our method with fastPhase/Beagle for phasing bi-allelic CNV, polyHap is comparable to fastPhase and gives more accurate estimates than Beagle in most cases of CN transitions. One advantage of our new method over fastPhase/Beagle for phasing CNV–SNP haplotypes is that polyHap is designed for inferring CNV–SNP haplotypes and is able to accommodate some properties of CNV that differ from SNPs and to deal with multi-allelic CNV.

Our program provides two different levels of CNV phasing—non-internal and internal. With internal phasing, the individual is considered as polyploid, and thus the phasing process is similar to that described for polyHap (Su et al., 2008b). Internal phasing enables inference of the duplicated and original haplotype, but does not say which chromosome copy contains the amplification. Non-internal phasing, on the other hand provides information about which chromosome copy contains the CNV, but not the internal structure of duplications. By providing both options, our program enables the researcher to choose a suitable level of phasing for the specific purposes of the study.

Our method is faster than CNVphaser, and is feasible for genome-wide analyses using a computing cluster. The computing time for the French dataset with nine ancestral haplotype states and 10 repetitions of the EM-training algorithm (containing 1106 markers on each of 24 individuals) was ∼0.8 h on a 8 GB computer, while the Finnish dataset (containing 2149 markers on each of 347 individuals) took 1.5 h on a 16 GB computer. The computing time increases linearly with the number of markers and individuals.

Modelling the haplotypic background of CNVs will provide a better understanding of the evolutionary processes affecting CNVs. Moreover, it will help us to better model CNV–phenotype associations—to make CNV–disease associations more robust by simultaneously identifying the underlying haplotype harbouring the CNV and to disentangle associations between CNVs and phenotype from associations with flanking SNPs.

Supplementary Material

[Supplementary Data]


We thank Rob Sladek for providing Illumina data and Adam de Smith for providing aCGH data. The DNA extractions, sample quality controls, biobank up-keeping and aliquotting for the NFBC was performed in the national Public Health Institute, Biomedicum Helsinki, Finland. Genotyping of the NFBC samples was supported by the National Institute of Mental Health.

Funding: Research Council UK fellowship (to L.J.M.C.); Genome Canada and Genome Quebec funded genotyping on the French samples; the NFBC1966 received financial support from the Academy of Finland (project grants 104781, 120315, 132797, and Center of Excellence in Complex Disease Genetics); University Hospital Oulu, Biocenter, University of Oulu, Finland; the European Community's Fifth/Seventh Framework Programme (EURO-BLCS, QLG1-CT-2000-01643, FP7/2007-2013); NHLBI grant 5R01HL087679-02 through the STAMPEED program (1RL1MH083268-01); ENGAGE project (HEALTH-F4-2007-201413); the Medical Research Council (centre grant G0600705); the Wellcome Trust (project grant GR069224), UK; the National Institute of Health Research (NIHR) Biomedical Research Centre Programme at Imperial College; the DNA extractions, sample quality controls, biobank up-keeping and aliquotting were performed in the National Public Health Institute, Biomedicum Helsinki, Finland, and supported financially by the Academy of Finland and Biocentrum Helsinki.

Conflict of Interest: none declared.


  • Barnes C, et al. A robust statistical method for case-control association testing with copy number variation. Nat. Genet. 2008;40:1245–1252. [PMC free article] [PubMed]
  • Browning BL, Browning SR. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Amer. J. Hum. Genet. 2009;84:210–223. [PMC free article] [PubMed]
  • Browning SR, Browning BL. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Amer. J. Hum. Genet. 2007;81:1084–1097. [PMC free article] [PubMed]
  • Colella S, et al. QuantiSNP: an objective Bayes hidden-Markov model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Res. 2007;35:2013–2025. [PMC free article] [PubMed]
  • Conrad DF, et al. Origins and functional impact of copy number variation in the human genome. Nature. 2009;464:704–712. [PMC free article] [PubMed]
  • de Smith AJ, et al. Array CGH analysis of copy number variation identifies 1284 new genes variant in healthy white males: implications for association studies of complex diseases. Hum. Mol. Genet. 2007;16:2783–2794. [PubMed]
  • de Smith AJ, et al. Small deletion variants have stable breakpoints commonly associated with Alu elements. PLoS ONE. 2008;3:e3104. [PMC free article] [PubMed]
  • Feuk L, et al. Structural variation in the human genome. Nat. Rev. Genet. 2006;7:85–97. [PubMed]
  • Fiegler H, et al. Accurate and reliable high-throughput detection of copy number variation in the human genome. Genome Res. 2006;16:1566–1574. [PMC free article] [PubMed]
  • Kato M, et al. An algorithm for inferring complex haplotypes in a region of copy-number variation. Am. J. Hum. Genet. 2008a;83:157–169. [PMC free article] [PubMed]
  • Kato M, et al. MOCSphaser: a haplotype inference tool from a mixture of copy number variation and single nucleotide polymorphism data. Bioinformatics. 2008b;24:1645–1646. [PMC free article] [PubMed]
  • Kimmel G, Shamir R. A block-free hidden Markov model for genotypes and its application to disease association. J. Comput. Biol. 2005;12:1243–1260. [PubMed]
  • Korn JM, et al. Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nat. Genet. 2008;40:1253–1260. [PMC free article] [PubMed]
  • Lai WR, et al. Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics. 2005;21:3763–3770. [PMC free article] [PubMed]
  • Liu J, et al. Incorporating single-locus tests into haplotype cladistic analysis in case-control studies. PLoS Genet. 2007;3:0421–0430. [PMC free article] [PubMed]
  • Mailund T, et al. Whole genome association mapping by incompatibilities and local perfect phylogenies. BMC Bioinformatics. 2006;7:454. [PMC free article] [PubMed]
  • McCarroll SA, Altshuler DM. Copy-number variation and association studies of human disease. Nat. Genet. 2007;39(Suppl. 7):S37–S42. [PubMed]
  • Neigenfind J, et al. Haplotype inference from unphased SNP data in heterozygous polyploids based on SAT. BMC Genomics. 2008;9:356. [PMC free article] [PubMed]
  • Olshen AB, et al. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics. 2004;5:557–572. [PubMed]
  • Redon R, et al. Global variation in copy number in the human genome. Nature. 2006;444:444–454. [PMC free article] [PubMed]
  • Sabeti PC, et al. Genome-wide detection and characterization of positive selection in human populations. Nature. 2007;449:913–918. [PMC free article] [PubMed]
  • Scheet P, Stephens M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 2006;78:629–644. [PMC free article] [PubMed]
  • Stephens M, Scheet P. Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. Am. J. Hum. Genet. 2005;76:449–462. [PMC free article] [PubMed]
  • Su S.-Y, et al. Disease association tests by inferring ancestral haplotypes using a hidden Markov model. Bioinformatics. 2008a;24:972–978. [PubMed]
  • Su S.-Y, et al. Inference of haplotypic phase and missing genotypes in polyploid organisms and variable copy number genomic regions. BMC Bioinformatics. 2008b;9:513. [PMC free article] [PubMed]
  • Wang K, et al. PennCNV: An integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 2007;17:1665–1674. [PMC free article] [PubMed]

Articles from Bioinformatics are provided here courtesy of Oxford University Press
PubReader format: click here to try


Save items

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • MedGen
    Related information in MedGen
  • PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...