• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of ploscompComputational BiologyView this ArticleSubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)
PLoS Comput Biol. Jan 2011; 7(1): e1001060.
Published online Jan 27, 2011. doi:  10.1371/journal.pcbi.1001060
PMCID: PMC3029233

Estimation of Parent Specific DNA Copy Number in Tumors using High-Density Genotyping Arrays

Scott Markel, Editor

Abstract

Chromosomal gains and losses comprise an important type of genetic change in tumors, and can now be assayed using microarray hybridization-based experiments. Most current statistical models for DNA copy number estimate total copy number, which do not distinguish between the underlying quantities of the two inherited chromosomes. This latter information, sometimes called parent specific copy number, is important for identifying allele-specific amplifications and deletions, for quantifying normal cell contamination, and for giving a more complete molecular portrait of the tumor. We propose a stochastic segmentation model for parent-specific DNA copy number in tumor samples, and give an estimation procedure that is computationally efficient and can be applied to data from the current high density genotyping platforms. The proposed method does not require matched normal samples, and can estimate the unknown genotypes simultaneously with the parent specific copy number. The new method is used to analyze 223 glioblastoma samples from the Cancer Genome Atlas (TCGA) project, giving a more comprehensive summary of the copy number events in these samples. Detailed case studies on these samples reveal the additional insights that can be gained from an allele-specific copy number analysis, such as the quantification of fractional gains and losses, the identification of copy neutral loss of heterozygosity, and the characterization of regions of simultaneous changes of both inherited chromosomes.

Author Summary

Many genetic diseases are related to copy number aberrations of some regions of the genome. As we know, each chromosome normally has two copies. However, under some circumstances, for some regions, either one or both of the chromosomes change. Genotyping microarray data provides the copy number of the two alleles of polymorphic sites along the chromosomes, which make the inference of the copy number aberrations of the chromosome feasible. One difficulty is that genotyping microarray data cannot provide the haplotype of the two copies of a chromosome. In this paper, we model the copy number along the chromosome as a two-dimensional Markov Chain. Using the observed copy number of both alleles of all the sites, we can determine the parent specific copy number along the chromosome as well as infer the haplotypes of the two copies of the inherited chromosomes in regions where there is allelic imbalance. Simulation results show high sensitivity and specificity of the method. Applying this method to glioblastoma samples from the Cancer Genome Atlas data illustrate the insights gained from allele-specific copy number analysis.

Introduction

DNA copy number aberration (CNA), defined as gains or losses of specific chromosomal segments, are an important type of genetic change in tumors. Various microarray based experimental platforms [1][7] have made possible the fine scale measurement of CNAs. Whereas the earlier platforms such as comparative genome hybridization arrays were designed to measure the total copy number of both inherited chromosomes, other platforms such as high density genotyping microarrays [6][8] can measure allele specific DNA quantity. For alleles that represent known variants of genes, it would be of biological interest to know which allele has undergone copy number change [9]. Also, some genetic mechanisms, such as gene conversion, mitotic recombination, and uniparental disomy, cause loss of heterozygosity (LOH) without change in total DNA copy number, and thus can not be detected through conventional analysis methods relying only on total copy number. Even in the case where the total DNA copy number changes, it would be informative to know whether one or both of the inherited parental chromosomes are involved. Thus, to construct a more detailed molecular portrait of tumors, we need to distinguish between the underlying quantities of the two inherited chromosomes, which we call the parent specific copy numbers.

This paper addresses the problem of parent specific copy number estimation using allele-specific raw copy number data from high-density genotyping arrays. We will describe the data in more detail in the next section. Here, we clarify the differences between total copy number analysis and parent specific copy number analysis, and review the background of the computational treatment of this problem.

The genome of each somatic human cell normally contains two copies of each of the 22 autosomes, one inherited from each biological parent. At any genome location, one or both of these two chromosomes may gain or lose copies, thus creating a change in total copy number at that location. Microarray experiments for measuring total copy number produce a sequence of continuous valued measurements mapping to ordered locations along the chromosomes. Computational methods can be applied to segment this noisy sequence of measurements into regions of homogeneous copy number [10][21], see Lai and Park [22] and Willenbrock and Fridlyand [23] for a review. Since chromosomes are gained and lost in contiguous segments, the true total copy number should be piecewise continuous. This is why change-point models and hidden Markov models have been very useful for total copy number estimation.

Total copy number estimates do not reveal which (or both) of the two inherited chromosomes have been gained or lost, and if a locus is polymorphic, which (or both) of the alleles have been affected. This information is now available in data produced by high density genotyping platforms, which give, at selected single nucleotide polymorphisms (SNPs), a bivariate measurement quantifying the two alleles which we arbitrarily label An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e001.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e002.jpg, as shown in the left panel of Figure 1. Some platforms output the total raw copy number (An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e003.jpg), which is the sum of An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e004.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e005.jpg, and the B-allele frequency (BAF), which is the percentage of An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e006.jpg allele raw copy number among the total allele raw copy number, i.e., An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e007.jpg. The logR quantifies the total copy number, while the BAF quantifies the imbalance between the two alleles. The right panel of Figure 1 shows An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e008.jpg, the sum of An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e009.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e010.jpg allele intensities, and BAF. Unlike the total copy number, the allele-specific measurements are mixtures that depend on the unknown genotype at each location. For this reason, conventional change-point models can not be applied to allele specific copy number estimation.

Figure 1
An example data sequence taken from a stretch of a TCGA glioblastoma sample (first 10000 SNPs of TCGA sample 02-0258 chromosome 2) assayed using the Illumina HumanHap 550k SNP array.

This problem can be formulated statistically as follows: The observed An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e013.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e014.jpg intensities form a bivariate sequence whose underlying distribution undergoes abrupt changes. The distributions at each location are mixtures. Both the change-points, the mixture components, and the cluster memberships at each data point are unknown and must be estimated from the data.

There have been much effort extending existing genotyping and total copy number segmentation procedures to analyze allele-specific data. At the probe level, CNAT [24], CN5 [24], CRMA [25], dChipSNP [26], [27], PLASQ [28], and PICR [29] can be applied to Affymetrix data to produce allele-specific probe-set summaries at each SNP location. However, just as in the estimation of total copy number, the allele-specific intensities for adjacent SNPs should be smoothed to infer the underlying parent-specific copy numbers. LaFramboise et al. [28] first segmented the total copy number using Circular Binary Segmentation [30], and then estimated the parent-specific copy numbers for each segment. This early approach misses copy neutral loss-of -heterozygosity (LOH) events, defined as the simultaneous gain of one chromosome and balanced loss of the other chromosome resulting in loss of heterozygosity but no change in total copy number. Many other existing approaches rely on discrete-state hidden Markov models [27], [31][34], which are hidden Markov models assuming a pre-specified finite set of underlying states. For example, PennCNV [32] and QuantiSNP [33] assume that the underlying copy numbers belong to the integer classes An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e015.jpg, and that the allele-specific copy numbers can be described by “generalized genotypes” AA, AB, BB, A-, B-, AAB, ABB, etc. While these types of models are very useful for detecting germline copy number variants in normal tissue, they do not generalize well to genetically heterogeneous samples. This is because by requiring a fixed set of pre-defined discrete states, they do not account for the heterogeneity of cells within the sample, which produces data with apparently fractional copy number changes rather than the idealized unit-copy changes. This is especially problematic for tumor samples, which are usually heterogeneous mixtures of cells with different genetic profiles. Through titration studies, Staaf et al. [35] showed that methods relying on idealized genotype states lose sensitivity when tumors are diluted with normal cells.

The fractional changes in tumors inspired recent approaches [35], that segment both the logR and BAF simultaneously. Since BAF is a mixture of homozygous and heterozygous SNPs, it cannot be processed using existing segmentation procedures. Current methods solve this problem through a pre-processing step that gets rid of the homozygous SNPs. However, identifying the “homozygous SNPs” is nontrivial when the regions of CNA are unknown, and a segmentation procedure that simultaneously genotype each SNP while inferring the underlying parental copy numbers is desirable, unless a matched normal is available.

In light of these recent developments, we need a systematic stochastic model for parent specific copy number which can accommodate fractional copy number changes. We propose a general two-chromosome hidden Markov model for this problem. The hidden states of the model represent the copy numbers of each of the two inherited chromosomes, and take value in the continuous space of real numbers. Thus, unlike discrete state space HMMs, this model is not limited to idealized unit-copy changes. Computationally efficient fitting algorithms are given that scale well to data obtained from the current high density genotyping arrays. The estimation procedure based on the two chromosome model, which we call Parent-Specific-Copy-Number (PSCN), extends the framework developed in Lai et al. [37] for total copy number analysis.

After segmenting the genome into regions of constant parent-specific copy number, we identify, for each region, whether both or only one of the parental chromosomes have changed copies. We also determine, in regions containing simultaneous gain of one chromosome and loss of the other, whether the changes are balanced. Thus, we classify the regions into six different types of aberrations depending on the status of the two parental chromosomes: gain of both chromosomes (gain/gain), gain of only one chromosome (gain/normal), gain of one chromosome and balanced loss of the other chromosome (balanced gain/loss), gain of one chromosome and unbalanced loss of the other chromosome (unbalanced gain/loss), loss of only one chromosome (normal/loss) and loss of both chromosomes (loss/loss). To our knowledge, this is the most detailed classification available among methods for allele-specific analysis. The PSCN method outputs the copy number for both chromosomes in each segment.

We evaluate the accuracy of the proposed procedure on a series of simulated tumor titration data provided by Staaf et al. [35], as well as a new set of simulation data containing a larger variety of chromosomal aberrations. We then apply the new approach to 223 glioblastoma samples from the Cancer Genome Atlas project [38], and illustrate through case studies some of the insights gained from an analysis of allele-specific data.

Results

The Two Chromosome Hidden Markov Model

Let An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e016.jpg be the allele-specific signals for alleles A and B at An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e017.jpg SNPs ordered by their locations in a reference genome. The way of obtaining An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e018.jpg depends on the experimental platform (see “Data Transformation” in Methods). Our goal is to infer the quantities of the parent specific copy numbers, which we denote by An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e019.jpg. By parent-specific, we distinguish between the chromosomes inherited from the two parents, which we treat as exchangeable and do not label as maternal or paternal. Let An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e020.jpg be the configuration at SNP An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e021.jpg specifying the alleles carried by the inherited chromosomes. Let An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e022.jpg be the true copy numbers of alleles An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e023.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e024.jpg at SNP An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e025.jpg. The relationship between An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e026.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e027.jpg, and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e028.jpg is shown in Table 1.

Table 1
Relationship between the inherited allele configuration An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e029.jpg and the true allele specific copy numbers An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e030.jpg.

Note that when a somatic event causes a change in copy number of one or both parental chromosomes at SNP An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e046.jpg, the allele-specific copy numbers An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e047.jpg change, but An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e048.jpg remains fixed. For example, if the inherited genotype is An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e049.jpg, and if An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e050.jpg is amplified two-fold, then the true copy number of allele An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e051.jpg would also be amplified two-fold, but An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e052.jpg would still be An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e053.jpg. The observed allele specific signals An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e054.jpg are assumed to be equal to the true allele specific quantities plus an independent measurement error,

equation image
(1)

where An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e056.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e057.jpg are state specific error covariance matrices. The model that relates An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e058.jpg to An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e059.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e060.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e061.jpg is illustrated in Figure 2.

Figure 2
Overview of the stochastic segmentation model.

To model the gains and losses of the two inherited chromosomes, we assume that An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e072.jpg is a Markov jump process with state space An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e073.jpg. Conceptually, each time An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e074.jpg jumps, it can choose between two states: The normal state (one copy each of maternal and paternal chromosome), where An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e075.jpg must assume a known baseline value An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e076.jpg, or the variant state, where An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e077.jpg picks a new random value from the bivariate Gaussian An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e078.jpg. The prior mean An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e079.jpg and prior covariance An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e080.jpg, along with the other hyperparameters of the prior, will be estimated by maximum likelihood. To allow the possibility of the copy number changing from a variant state to a different variant state, for example, An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e081.jpg to An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e082.jpg, we technically need two identically distributed variant states in our formulation of the Markov chain. Hence we let the states be An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e083.jpg. Then, the dynamics of the Markov model can be described by the transition matrix

equation image
(2)

The matrix An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e085.jpg specifies that if An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e086.jpg is in the normal state at SNP An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e087.jpg, then at SNP An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e088.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e089.jpg stays in the normal state with probability An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e090.jpg, or jumps to a variant state with probability An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e091.jpg. If An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e092.jpg is in a variant state, then at SNP An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e093.jpg, it would stay at the same variant state with probability An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e094.jpg, or jump to a different variant state with probability An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e095.jpg, or jump back to the normal state with probability An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e096.jpg. One can verify that this formulation of the Markov chain, with one baseline state and two variant states, allows for a model with a baseline state and generic “variant” states as desired. This model extends the one used for the analysis of total copy number in Lai et al. [37]. This Markov chain has the stationary distribution An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e097.jpg. The three-state Markov chain with transition probability matrix An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e098.jpg and initialized at the stationary distribution is reversible, which provides substantial simplification for the estimation of An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e099.jpg. Practically, the reversibility of the Markov model implies that we would obtain the same segmentation going from right to left as we do going from left to right. Biologically, this seems logical, as there is no known directionality of copy number aberration events.

We assume that the inherited allele configurations An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e100.jpg are independent multinomial with prior parameters

equation image

which can be obtained from the genotyping data of a set of normal control samples. Note that An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e102.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e103.jpg cannot be distinguished in normal samples, so we can set An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e104.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e105.jpg to one-half of the proportion of heterozygotes for SNP An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e106.jpg. When these figures are not available, we have found that a uniform prior usually works reasonably well. This is because the main purpose of the model is to estimate the parent-specific copy numbers, with An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e107.jpg as surrogate information. With the large number of data points obtained from the high density arrays, the posterior for the parent-specific copy numbers is usually quite insensitive to the prior on An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e108.jpg. Note that for platforms, such as the Affymetrix 6.0 array, have non-polymorphic copy number markers rather than SNP markers. For those markers, the prior for An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e109.jpg can be set to An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e110.jpg. In this way, the posterior will always remain at An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e111.jpg and only the total copy number information at these markers would contribute to the overall segmentation.

Note that this model contains many assumptions, including Gaussianity of the allele specific intensities and Markovicity of the underlying copy number states. These assumptions allow fast and explicit analytic formulas to be derived, thus avoiding the need for Monte Carlo based estimates. For most platforms, the allele-specific intensities deviate from Gaussianity, despite careful normalization. Also, there has never been proof that chromosomal breakages are Markovian. These assumptions are made for modeling convenience, just as in the total-copy number estimation problem [11], [16], [30], [37]. It is reassuring that the estimation method is robust to deviations from both the Gaussian and Markov assumptions, as we show using the titration data from Staaf et al. [35] and through our own spike-in studies.

Our primary objective is to estimate the parent specific copy numbers An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e112.jpg, which depend on the observed signals through the unobserved inherited allele configurations An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e113.jpg. Let An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e114.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e115.jpg be the set of all possible realizations for An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e116.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e117.jpg, respectively. We describe below an iterative algorithm to estimate An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e118.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e119.jpg.

Allele-specific iterative smoothing

Fix stopping threshold An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e120.jpg. Initialize An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e121.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e122.jpg through an initial 4-group clustering of An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e123.jpg. Repeat:

  1. Expectation step: Given An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e124.jpg, set An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e125.jpg to its posterior mean
    equation image
    (3)
    Computationally efficient formulas for (3) are given in Methods.
  2. Maximization step: Given An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e127.jpg, set An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e128.jpg to its maximum a posterior value
    equation image
    (4)
    This can be done easily because given An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e130.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e131.jpg is a four-component mixture of Gaussians at each An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e132.jpg, and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e133.jpg is simply the identifier for each mixture component. The exact formula for (4) is given in Methods.
  3. If An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e134.jpg, stop and report An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e135.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e136.jpg. Otherwise, set An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e137.jpg and go back to step 1.

In each iteration of the above algorithm, the expectation step estimates An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e138.jpg by its posterior mean given the data and the current estimate of the configuration states An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e139.jpg. Then, An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e140.jpg is set to its posterior mode given the data and the current estimate of An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e141.jpg. Computationally efficient forward-backward equations for (3) and formulas for (4) are given in Methods, where we also describe an expectation maximization procedure for estimating the hyperparameters An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e142.jpg, and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e143.jpg from the data, so that they do not need to be specified a priori.

The above algorithm returns a soft segmentation of An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e144.jpg in the form of a Bayesian estimate An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e145.jpg for the parent specific copy numbers at each location. A hard segmentation is sometimes desirable, for example, to give a sparse representation of the data. A hard segmentation can be obtained from the soft segmentation as follows: Compute for each An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e146.jpg the one-step Euclidean distance An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e147.jpg. Estimate the change-points to be the locations where An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e148.jpg are larger than the threshold, with the constraint that they must be separated by a pre-chosen minimum number of SNPs (e.g. 20). The segmentation algorithm starts with the set An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e149.jpg containing only the end points of the sequence. Change-points are added recursively to the set by maximizing An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e150.jpg under the separation constraint, until no more change-point can be added. We start with a low threshold for An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e151.jpg allowing some false positives, with most of the false positives eliminated by a subsequent Wilcoxon Rank-Sum test (An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e152.jpg-value threshold of An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e153.jpg) that combines adjacent segments with no significant difference in mean. We found this to be more accurate than a one-step procedure using a more stringent threshold on An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e154.jpg.

Identifying the Type of Aberration

The segmentation divides the genome into regions where the copy numbers of the two inherited chromosomes are constant. It is often useful to know, for each region, whether the copy numbers of one or both parental chromosomes deviate from the normal level. This involves classifying each region into one of the following six types of chromosomal change: gain/gain, gain/normal, balanced gain/loss, unbalanced gain/loss, normal/loss and loss/loss.

For each segmented region, we define the major copy number to be the normalized raw copy number of the more abundant chromosome, and the minor copy number to be the normalized raw copy number of the less abundant chromosome. If the two chromosomes have equal copy numbers, then the major and minor chromosome labels are assigned arbitrarily. The major and minor copy numbers are estimated after the hard-segmentation using a mixture model on the heterozygous SNPs in each region (which can be identified using An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e155.jpg). Then, a An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e156.jpg-test is used to compare the estimated major and minor copy numbers of each region to the estimated allele copy number of the normal level in the unchanged segments. The Bonferroni correction is used to adjust for multiple testing. The technical details are given in Methods. This procedure allows us to discover and distinguish all of the six types of CNVs.

An additional caveat is that when both parental chromosomes carry the same haplotype, a balanced gain/loss would be called if the region were long enough. Without matched data from normal tissue, it is impossible to distinguish with certainty between inherited and somatic LOH. However, we rely on the fact that long regions of LOH are infrequent, and thus the minor allele frequency of SNPs and the linkage disequilibrium between them can be used to conduct a test for the probability that an inherited LOH appears by chance. This haplotype correction only takes care of the unique common haplotypes, i.e., when a region is dominated by one haplotype. If a haplotype is not common in that region, or if there are several haplotypes in that region, this test loses sensitivity. In this case, paired normal cell information would be useful. More details are given in Methods.

Results on Simulated Dilution Data from Staaf et al. [35]

Staaf et al. [35] performed a systematic comparison of existing methods for allele-specific copy number estimation. They created a simulated dilution data set based on experimental 550k Illumina data for HapMap sample NA06991. To the diploid HapMap sample, ten regions of aberrant copy number were added at increasing fractions to mimic a tumor sample that is contaminated with normal cells. Here, An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e157.jpg normal cell contamination means An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e158.jpg part normal cells are mixed with An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e159.jpg part tumor cells. The aberrant regions vary by type and length, and represent regions of hemizygous gains and losses and copy neutral LOH. Since the locations of the true aberrant regions are known, the specificity and sensitivity of the detection methods can be evaluated.

We applied PSCN, the R package we developed based on our method, to this dilution data set and compared it with existing approaches in an analysis that parallels the insightful analysis in Staaf et al. [35]. The sensitivity and specificity of results from PSCN at varying contamination ratios is shown in Figures 3 and and44 overlayed onto plots reproduced from Staaf et al. [35]. In order to compare with the sensitivity analysis of other models done in the paper by Staaf et al. [35], we define a “correct detection” to mean that a true CNA region has been called, but do not require that the type of CNA (e.g. gain/loss, normal/loss) has been correctly identified. All the other current procedures only categorize the CNAs into Gain, Loss and LOH, which are the three types of CNAs used in the Dilution data in Staaf et al. [35]. We assess the accuracy of PSCN in a more detailed classification of identified CNAs based on the six types of chromosomal change in a separate data set that contains a wider diversity of chromosomal events (see next section). In the simulated dilution data, the regions vary in length, magnitude, and type of aberration, with some regions harder to detect than the others. There is a separate sensitivity plot for each of the 10 aberrant regions created by [35]. As expected, for all regions, sensitivity is maintained at a high level up to a certain contamination ratio, then drops sharply. Since Staaf et al. and we used very stringent detection thresholds, the specificity is maintained near 1 for all contamination ratios, as shown in Figure 4. The sensitivity of PSCN is comparable to SOMATICs [36], but the latter method has much lower specificity, as shown in the analysis of Staaf et al., see Figure 4. PSCN achieves good accuracy compared to the other existing methods, especially methods based on discrete-state hidden Markov models for high levels of contamination. The discrepancy between the two specificity plots in Figure 4 are due to the fact that when an aberration is called, it may be labeled as an incorrect type (for example, a copy neutral LOH may be labeled as single copy gain). When the correct calling of aberration type is required, the specificity of PSCN is maintained through a higher level of contamination as compared to existing models. The new model can identify the correct aberration type if the normal cell contamination is below 80%. Above 80%, PSCN gains significantly in sensitivity compared to existing methods but also sacrifices slightly in specificity.

Figure 3
Sensitivity versus normal cell contamination for 10 regions in the dilution data set of Staaf et al. [35]
Figure 4
Specificity versus normal cell contamination in the dilution data set of Staaf et al. [35]

Accuracy of Aberration Type Identification

The dilution data set from Staaf et al. [35] contains only three types of aberrations: hemizygous loss (normal/loss), single copy gain (gain/normal), and copy neutral LOH (balanced gain/loss). We created a simulated data set containing all six types of aberrations: gain/gain, gain/normal, balanced gain/loss, unbalanced gain/loss, normal/loss and loss/loss. To make the simulation resemble real data, we started with the 550k Illumina data for chromosome 1 of HapMap sample NA06991. To this normal sequence we imposed six different signal types on six regions. The positions and magnitudes of the added signals are shown in Table 2. The top panel of Figure 5 (first row) shows the An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e160.jpg and BAF before the signals are imposed. The middle and bottom panels show the An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e161.jpg and BAF after the signals have been imposed, at 0% and 80% contamination respectively, with true signals indicated by black lines. Signal becomes weaker when normal cell contamination increases, and thus are harder to detect. The estimated parent-specific copy numbers are shown in Figure 6. We can see from the plots that the estimated parent-specific copy numbers are very close to the true allele copy numbers. Table 3 shows the largest normal cell contamination under which the signals are detectable by PSCN. When normal cell contamination is less than 80%, our model can detect most of the signals with both alleles assigned to the correct type. When the normal cell contamination rises to 90%, our model can still detect three out of the six CNA regions, but assigns the correct type to only one of the two alleles. For example, at a high contamination level of 90%, there is a tendency for a fractional loss of both chromosomes to be mistaken for a fractional loss of only one of the two chromosomes. From this study, we see that the correct type of aberration can be identified robustly for all but the highest levels of normal cell contamination.

Figure 5
Signal of the simulated data by imposing six types of aberrations on chromosome 1 of HapMap sample NA06991.
Figure 6
Copy number estimation of PSCN on the simulated data by imposing six types of aberrations on chromosome 1 of HapMap sample NA06991.
Table 2
Signals imposed on to Chromosome 1.
Table 3
The largest tolerable percentage for normal cell contamination under which the type of aberration can be correctly detected (left column), and under which the type of aberration can be correctly identified for one of the two alleles when both alleles ...

Accuracy of Estimation of Genotype States

Using the dilution data set created from HapMap sample NA06991, we can also assess the accuracy of PSCN in identifying the genotype states An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e165.jpg. Since the genotypes for the SNPs on this sample are known, we simply compared the estimated An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e166.jpg with the true values.

Table 4 shows the percent of homozygous SNPs that are misclassified as heterozygous, and vice versa. When the SNP is classified as homozygous, the determination between the states AA and BB is trivial, and no errors are made. When normal cell contamination is extremely low, less than 10%, genotyping errors are common in regions of loss of heterozygosity (either normal/loss or gain/loss). This is expected, since in a region with complete LOH and zero contamination, only one of the two parental alleles is left, and thus it would be impossible to distinguish between the homozygous configurations An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e167.jpg and the heteryzogous configurations An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e168.jpg. Fortunately, these types of genotyping errors would not affect the accurate estimation of An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e169.jpg, since the mean levels for the heterozygous and homozygous tracks merge for LOH regions under zero contamination. It is slightly unintuitive that the correct estimation of An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e170.jpg depends on the fact that there is normal cell contamination! This is reflected in Table 4, where accuracy quickly improves as normal cell contamination increases, with a total misclassification rate of An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e171.jpg at An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e172.jpg normal cell contamination.

Table 4
The number of misclassifications of each type in the identification of An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e173.jpg on the NA06991 dilution data set, at different levels of normal cell contamination.

A complete analysis of the misclassification rates of An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e176.jpg are given in the Supporting Information file (Text S1).

Analysis of TCGA Glioblastoma Samples

We applied PSCN to 223 glioblastoma samples from the TCGA project [38]. These samples were assayed using Illumina HumanHap 550k SNP arrays.

Almost all of the 223 samples analyzed contain substantial copy number aberrations. Table 5 shows the distribution of the types of copy number events found in the samples. Of the gain/loss events, which comprise 45.4% of all of the events, 22.8% are copy neutral LOH and 22.5% are unbalanced gain/loss. We see from this table that, among these glioblastoma samples, single chromosome losses or single chromosome gains comprise 49.6% of all the events, which means that more than half of the events involve change of both inherited chromosomes.

Table 5
Distribution of types of copy number aberrations across all events found in the 223 glioblastoma samples.

We now zoom in on two example regions to illustrate the additional insights gained from parent-specific copy number analysis. These regions are shown in Figure 7. The figures in the left panel correspond to the entire chromosome 3 of TCGA glioblastoma sample 02-0332, while those on the right panel correspond to the first 10000 SNPs on chromosome 2 of TCGA glioblastoma sample 02-0258. The top two plots in each panel show the An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e177.jpg and BAF values. The color scheme for these plots show the segmentation obtained using PSCN. We transformed the An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e178.jpg and BAF values back to the An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e179.jpg raw copy number values, and fitted two dimensional densities separately to each region in the segmentation. The contours of the two dimensional density estimates, delineating the locations of the clusters, are shown in the third plot from the top in each panel. The color scheme for the contours is the same as the color scheme for the An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e180.jpg and BAF plots. Finally, the bottom plot of each panel shows the estimated major and minor copy numbers for each region (we will call this type of plot the mm-plot). The color scheme of the mm-plot reflects the gain/loss status of each region, where red represents gain, blue represents loss, and green represents normal. It is usually difficult to discern the relative magnitudes of gains and losses from the An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e181.jpg and BAF plots, especially when both inherited chromosomes have undergone copy number changes. Such relative changes in parent specific copy numbers can be quantified more easily by examining the An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e182.jpg contour and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e183.jpg-plots.

Figure 7
Example regions from TCGA sample 02-0332 chromosome 3 (left) and TCGA sample 02-0258 chromosome 2 (first 10000 SNPs) (right).

Copy neutral LOH (Balanced Gain/Loss)

First, consider the example region from TCGA sample 02-0332 on the left panel. There are three instances of copy neutral LOH, colored in purple. Based on the BAF plot, the loss seems to be complete, that is, it is carried by almost all of the cells in the sample. The An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e186.jpg-plot also gives this information, as the estimated major copy number (red line) is close to 2, and the estimated minor copy number (blue line) is close to 0. These LOH regions do not change the total copy number, and thus would not have been detected if the segmentation were based on the An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e187.jpg profile. On the other hand, an analysis based only on the BAF plot would not have revealed that the LOH is copy neutral; e.g. in the TCGA sample 02-0258, the LOH region (purple) with similar pattern in BAF is not copy neutral. The estimates in the An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e188.jpg-plot can only be obtained through a joint analysis of both the An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e189.jpg and the BAF profiles.

Fractional single chromosome gains and losses

Following the copy neutral LOH regions in chromosome 3 of sample 02-0332, there is a stretch of alternating gains and losses, colored respectively in red and blue. The copy of the other parental chromosome in these regions is one. As seen from the An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e190.jpg-plot, all of these regions contain changes that affect only one of the two inherited chromosomes. The changed chromosome may differ across segments. For example, the paternal chromosome may have been differed in one segment, and the maternal chromosome in the next. The copy number of the other chromosome in these regions remain at the normal level. This fact can not be deduced from total copy number analysis, as an increase in An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e191.jpg can be due to gains of both inherited chromosomes, or an unbalanced gain of one chromosome and loss of the other; see the next example (TCGA 02-0258). The An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e192.jpg contour plot discriminates between these two possible cases. If we examine the cluster centers corresponding to the heterozygotes in the red and blue segments we see that for any one cluster, only one of the An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e193.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e194.jpg coordinates is significantly shifted from the corresponding coordinate of the normal AB cluster (coded in gray). This is evidence that the copy number of only one of the chromosomes has changed in these regions. The positions of the heterozygote cluster centers of the red and blue regions indicate only a partial gain and loss, as their shifts from normal are only a fraction of that expected in a complete event. The estimated major and minor copy numbers in the An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e195.jpg-plot quantifies the partial change explicitly, with the major copy numbers at around 1.5 for the gain and the minor copy numbers at around 0.5 for the loss. Assuming a linear signal response curve for the Illumina platform in the range between 0 and 3 fold change in DNA quantity, this translates to about 50% of the cells in the tumor sample carrying the aberrations coded in blue and red.

The same reasoning can be applied to the red and pink regions of chromosome 2 of TCGA sample 02-0258 (right panel), which contains a fractional gain. By teasing apart the copy numbers of each inherited chromosome, we are now able to characterize and quantify these fractional changes.

Simultaneous unbalanced gain and loss of both chromosomes (unbalanced gain/loss)

Now consider the example region color coded in purple from TCGA sample 02-0258 in the right panel. The An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e196.jpg plot suggests that there is a gain in total copy number. However, the BAF plot reveals that there seems also to be an almost complete loss of heterozygosity in this region. Loss of one of the inherited chromosomes is necessary for loss of heterozygosity. Thus we conclude that the region colored in purple contains both a gain of one as well as an almost complete loss of the other inherited chromosome. Indeed, as the An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e197.jpg-plot shows, the estimated major and minor copy number fold changes for this region have values of 3 and 0, respectively. The gain and loss of the two inherited chromosomes is thus unbalanced, suggesting that this region may have experienced multiple mutations. This region is immediately followed by a gain of only one of the two inherited chromosomes (see the An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e198.jpg-plot), of magnitude roughly equal to the difference between the deviations of the major and minor copy numbers from normal. This suggests the hypothesis that this sample first experienced a gain of one of the inherited chromosomes that covered the purple and red regions, then a LOH which caused a gain of the already amplified chromosome and a simultaneous loss of the other inherited chromosome. Our analysis of the TCGA data shows that these types of unbalanced gain and loss events are quite common.

Discussion

We have developed a method for simultaneous estimation of parent-specific DNA copy number and inherited genotypes for tumor samples using allele-specific raw copy number data. The model and estimation procedure start with transforming allele-specific data into An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e199.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e200.jpg intensities, which may vary across experimental platforms. The model assumes that the An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e201.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e202.jpg allele intensities should be roughly symmetric, roughly variance stabilized and have approximately bivariate Gaussian errors. Indeed, the model is quite robust to the violation of the bivariate Gaussian error assumption. The model gives satisfying results even if this assumption is heavily violated. More details are shown in the the Supporting Information file (Text S1). We illustrated the method and evaluated its performance on both published and newly generated dilution data sets on the Illumina platform.

A rigorous assessment using in silico titration data provided by Staaf et al. [35] shows that PSCN has good accuracy. The proposed method does not require paired normal samples. However, if such samples were available, then they can be used to further improve accuracy and to distinguish between inherited LOH and somatic LOH. In such cases, An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e203.jpg can simply be set to the genotypes inferred from the normal samples.

PSCN is not platform specific, and we have also applied it to data from the Affymetrix Genotyping 6.0 array, with an example analysis given in the Supporting Information file (Text S1). The segmentation accuracy of PSCN seems to be reasonable for Affymetrix data, but can potentially be improved significantly by better probe-level normalization. This is due to the fact that the BAF of Affymetrix data is much noisier than the BAF of Illumina data, which makes the estimation of An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e204.jpg much more difficult. Bengtsson et al. [39] have shown that much of the variation in the BAF of Affymetrix data are due to probe-specific effects that can be removed if a matched normal sample is available. Another promising method for probe-level normalization of Affymetrix data is the probe raw copy number composite representation (PICR) model of Wan et al. [29], which uses probe sequence information and physico-chemical modeling to estimate binding affinity. However, since the PICR model relies on mismatch probes, it is only applicable to Affymetrix platforms prior to the 6.0 array. Thus, better probe-level normalization of Affymetrix 6.0 data for unmatched samples is still an important problem for further investigation.

An overview of an analysis of the TCGA glioblastoma samples reveal that a substantial fraction of copy number changes are copy-neutral loss of heterozygosity events. These events would not have been found using analyses based only on total copy number. Cases of unbalanced simultaneous changes in the copy numbers of both inherited chromosomes were also found. It would be of interest to quantify the frequency of such changes among different cancer subtypes and in other types of tumors.

A final point that we would like to emphasize is the quantification of fractional changes, as exemplified by the two case studies on the TCGA glioblastoma samples. Since this requires teasing apart the quantities of the two inherited chromosomes, it can only be achieved through allele-specific estimates. The fraction of cells that carry each copy number event is important for downstream analyses, such as quantifying normal cell contamination and studying tumor microevolution. The parent-specific copy number estimates obtained from the proposed method provides a starting point for these types of investigations.

The R package for PSCN is registered on R-Forge (http://r-forge.r-project.org/) under project name PSCN.

Methods

Data Transformation

The proposed model is not platform specific, and can theoretically be applied to any type of allele-specific copy number data where the errors on the raw copy number values of the alleles can be normalized to approximately adhere to a bi-variate Gaussian distribution. As we show below, the Gaussian error assumption allows for explicit analytic formulas for the posterior mean of the underlying inherited chromosome copy numbers, thus bypassing the need for computationally intensive Monte Carlo methods. For most platforms, the raw allele-specific raw copy number values must be properly normalized for this error model to be a good approximation. However, as we mentioned in the Discussion section, the model is quite robust to the violation of the Gaussian error assumption.

A unified approach that gives satisfying results for data from both Illumina and Affymetrix platforms is as follows. Since

equation image

we have

equation image

Note that the “BAF” given by the Illumina platform [6] is not the intuitive quantity (An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e207.jpg), but the arc-tangent of the ratio of An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e208.jpg raw copy number versus An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e209.jpg raw copy number scaled to [0,1]. Use An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e210.jpg to denote the so called BAF given by Illumina, then

equation image

For PSCN we use An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e212.jpg.

Explicit formulas for An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e213.jpg given An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e214.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e215.jpg

We give here exact formulas for the conditional expectation (3). Let An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e216.jpg denote the probability distribution that assigns probability 1 to the value An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e217.jpg. Denote by An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e218.jpg, and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e219.jpg. A brief outline of the estimation procedure is as follows: First, conditioned on all data to the left of An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e220.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e221.jpg is distributed as a mixture of Gaussians:

equation image
(5)

where the formulas for computing the parameters of the mixture An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e223.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e224.jpg, and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e225.jpg are given below. We call (5) the forward filter. Since by our model An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e226.jpg is a reversible Markov chain, we can reverse time and obtain a backward filter that is analogous to (5):

equation image
(6)

where the parameters An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e228.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e229.jpg, and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e230.jpg, as for the forward filter, are given in explicitly computable form below. The Bayes theorem can then be used to combine the forward filter (5) and backward filter (6) to derive the posterior distribution of An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e231.jpg given the complete sequence An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e232.jpg, which is a mixture of normal distributions

equation image
(7)

whose parameters can be derived from the forward and backward filters as described below. This forward-backward procedure can be reduced to An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e234.jpg computation time by the BCMIX algorithm [40]. From (7), it follows that the conditional expectation in Equation (3) can be computed as

equation image
(8)

The forward filter

Let An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e236.jpg be allele assignment matrices depending on An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e237.jpg:

equation image

Let An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e239.jpg denote the nearest change-point at a location less than or equal to An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e240.jpg. Define

equation image

for An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e242.jpg. The conditional distribution of An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e243.jpg, given An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e244.jpg and the event that An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e245.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e246.jpg, is An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e247.jpg, where

equation image

for An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e249.jpg. It follows that the posterior distribution of An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e250.jpg given An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e251.jpg is the mixture of normal distributions and a point mass at An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e252.jpg given by (5). Let An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e253.jpg denote the density function of the An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e254.jpg distribution, i.e.,

equation image

Making use of An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e256.jpg, it is possible to show as in Lai et al. [37] that the conditional probabilities An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e257.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e258.jpg can be determined by the recursions

equation image
(9)
equation image

where An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e261.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e262.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e263.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e264.jpg for An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e265.jpg. Specifically, the mixture probabilities in (5) are An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e266.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e267.jpg.

The smoothing estimate

Since An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e268.jpg is a reversible Markov chain, we can reverse time and apply the same steps as in the forward equations to obtain (6), in which the weights An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e269.jpg can be obtained by backward induction using the time-reversed counterpart of (9):

equation image
(10)
equation image

where An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e272.jpg. Since for any set An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e273.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e274.jpg, it follows from (6) and the reversibility of An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e275.jpg that

equation image

The recursions for deriving the components of the mixture for (7) are exactly the same as those for the earlier model limited to total copy number in Lai et al. [37]:

equation image

and we refer the reader to Lai et al. [37] for their derivation.

Estimation of An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e278.jpg

The variables An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e279.jpg are assumed to be i.i.d., with

equation image

The inherited allele configurations An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e281.jpg is assumed to be independent of An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e282.jpg, so

equation image
(11)

where An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e284.jpg is a constant. Each component of the above sum can be maximized separately to give, for each An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e285.jpg,

equation image

Region Characterization

Let An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e287.jpg be An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e288.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e289.jpg intensities of heterozygous SNPs for segments at normal state and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e290.jpg be An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e291.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e292.jpg intensities of heterozygous SNPs for the segment being tested. Then, An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e293.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e294.jpg follow the model:

equation image

For the normal state, we can estimated the parameters easily as

equation image

For the target segment, An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e297.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e298.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e299.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e300.jpg can be estimated by EM algorithm:

Step 1: Initialize: An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e301.jpg

Step 2: Set

equation image

Step 3: Set

equation image

Step 4: Stop if An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e304.jpg, where An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e305.jpg is a pre-chosen threshold (PSCN has default value An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e306.jpg). Otherwise, set An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e307.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e308.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e309.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e310.jpg, and go back to step 2.

The motivation of the initial and default settings are as follows. For segment with changed states, the goal is to estimate minor and major copy number. It is expected that the minor copy number would be less than or equal to 1 and the major copy number would be larger than or equal to 1, so the initial values for An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e311.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e312.jpg are set to 0.9 and 1.1 respectively. Although it is possible that both chromosomes in a segment are gained or lost, a small discrepancy of the initial values of An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e313.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e314.jpg will also be a good start. Also, it is expected that the numbers of AB and BA states in a segment is similar, so the initial value of An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e315.jpg is set to 0.5. The initial values for An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e316.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e317.jpg can be quite arbitrary, with 1 being a reasonable value to use. An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e318.jpg is set to be An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e319.jpg, which is small enough to indicate a convergence of the iterative algorithm.

Denote the estimated parameters by An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e320.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e321.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e322.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e323.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e324.jpg. To test the hypothesis An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e325.jpg, the standard An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e326.jpg-statistic is

equation image

Under An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e328.jpg, the distribution of An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e329.jpg is An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e330.jpg with degree of freedom An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e331.jpg, so An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e332.jpg-value can be calculated and compared with the level of the test. The null hypothesis that An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e333.jpg needs also be tested, by replacing An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e334.jpg with An external file that holds a picture, illustration, etc.
Object name is pcbi.1001060.e335.jpg in the above equation.

Supporting Information

Text S1

Supporting materials for PSCN.

(0.28 MB PDF)

Acknowledgments

We thank Pierre Neuvial and Henrik Bengtsson for helpful discussions, and an anonymous reviewer for the many helpful comments during the paper revision stage.

Footnotes

The authors have declared that no competing interests exist.

Hao Chen's research was supported in part by NIH Merit Award R37EB02784. Haipeng Xing's research was supported by the National Science Foundation grant DMS-0906593. Nancy R. Zhang's research was supported in part by the National Science Foundation grant DMS-0906394. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Pinkel D, Segraves R, Sudar D, Clark S, Poole I, et al. High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat Genet. 1998;20:207–11. [PubMed]
2. Snijders AM, Nowak N, Segraves R, Blackwood S, Brown N, et al. Assembly of microarrays for genome-wide measurement of DNA copy number. Nat Genet. 2001;29:263–264. [PubMed]
3. Pollack J, Perou C, Alizadeh A, Eisen M, Pergamenschikov A, et al. Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nat Genet. 1999;23:41–46. [PubMed]
4. Matsuzaki H, Dong S, Loi H, Di X, Liu G, et al. Genotyping over 100,000 SNPs on a pair of oligonucleotide arrays. Nat Methods. 2004;1:109–111. [PubMed]
5. Gunderson KL, Steemers FJ, Lee G, Mendoza LG, Chee MS. A genome-wide scalable SNP genotyping assay using microarray technology. Nat Genet. 2005;37:549–554. [PubMed]
6. Peiffer DA, Le JM, Steemers FJ, Chang W, Jenniges T, et al. High-resolution genomic profiling of chromosomal aberrations using infinium whole-genome genotyping. Genome Res. 2006;16:1136–1148. [PMC free article] [PubMed]
7. Wang Y, Moorhead M, Karlin-Neumann G, Falkowski M, Chen C, et al. Allele quantification using molecular inversion probes (MIP). Nucleic Acids Res. 2005;33:e183. [PMC free article] [PubMed]
8. Bignell GR, Huang J, Greshock J, Watt S, Butler A, et al. High-resolution analysis of DNA copy number using oligonucleotide microarrays. Genome Res. 2004;14:287–295. [PMC free article] [PubMed]
9. Hanahan D, Weinberg R. The hallmarks of cancer. Cell. 2000;100:57–70. [PubMed]
10. Broët P, Richardson S. Detection of gene copy number changes in CGH microarrays using a spatially correlated mixture model. Bioinformatics. 2006;22:911–918. [PubMed]
11. Fridlyand J, Snijders A, Pinkel D, Albertson DG, Jain A. Application of hidden Markov models to the analysis of the array-CGH data. J Multivar Anal. 2004;90:132–153.
12. Hsu L, Self S, Grove D, Randolph T, Wang K, et al. Denoising array-based comparative genomic hybridization data using wavelets. Biostatistics. 2005;6:211–226. [PubMed]
13. Venkatraman E, Olshen A. A faster circular binary segmentation algorithm for the analysis of array cgh data. Bioinformatics. 2007;23:657–663. [PubMed]
14. Wang P, Kim Y, Pollack J, Narasimhan B, Tibshirani R. A method for calling gains and losses in array-CGH data. Biostatistics. 2005;6:45–58. [PubMed]
15. Tibshirani R, Wang P. Spatial smoothing and hot spot detection for CGH data using the fused LASSO. Biostatistics. 2008;9:18–29. [PubMed]
16. Hupé P, Stransky N, Thiery JP, Radvanyi F, Barillot E. Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics. 2004;20:3413–3422. [PubMed]
17. Picard F, Robin S, Lavielle M, Vaisse C, Daudin J. A statistical approach for array CGH data analysis. BMC Bioinformatics. 2005;6:27. [PMC free article] [PubMed]
18. Engler D, Mohapatra G, Louis D, Betensky R. A pseudolikelihood approach for simultaneous analysis of array comparative genomic hybridications. Biostatistics. 2006;7:399–421. [PubMed]
19. Daruwala RS, Rudra A, Ostrer H, Lucito R, Wigler M, et al. A versatile statistical analysis algorithm to detect genome copy number variation. Proc Natl Acad Sci U S A. 2004;101:16292–16297. [PMC free article] [PubMed]
20. Wen CC, Wu YJ, Huang YH, Chen WC, Liu SC, et al. A Bayes regression approach to array-CGH data. Stat Appl Genet Mol Biol. 5:3. [PubMed]
21. Xing B, Greenwood CMTM, Bull SBB. A hierarchical clustering method for estimating copy number variation. Biostatistics. 2007;8:632–653. [PubMed]
22. Lai WR, Johnson MD, Kucherlapati R, Park PJ. Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics. 2005;21:3763–3770. [PMC free article] [PubMed]
23. Willenbrock H, Fridlyand J. A comparison study: applying segmentation to arrayCGH data for downstream analyses. Bioinformatics. 2005;21:4084–4091. [PubMed]
24. AffymetrixCopy number and loss of heterozygosity estimation algorithms for the genechip human mapping array sets. 2006. Whitepaper, http://www.affymetrix.com.
25. Bengtsson H, Irizarry R, Carvalho B, Speed T. Estimation and assessment of raw copy numbers at the single locus level. Bioinformatics. 2008;24:759–767. [PubMed]
26. Li C, Wong W. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci U S A. 2001;98:31?–36. [PMC free article] [PubMed]
27. Lin M, Wei LJ, Sellers WR, Lieberfarb M, Wong WH, et al. dChipSNP: significance curve and clustering of SNP-array-based loss-of-heterozygosity data. Bioinformatics. 2004;20:1233–1240. [PubMed]
28. LaFramboise T, Weir BA, Zhao X, Beroukhim R, Li C, et al. Allele-specific amplification in cancer revealed by SNP array analysis. PLoS Comput Bio. 2005;l1:e65. [PMC free article] [PubMed]
29. Wan L, Sun K, Ding Q, Cui Y, Li M, et al. Hybridization modeling of oligonucleotide SNP arrays for accurate DNA copy number estimation. Nucleic Acids Res. 2009;37:e117. [PMC free article] [PubMed]
30. Olshen AB, Venkatraman ES, Lucito R, Wigler M. Circular binary segmentation for the analysis of array-based dna copy number data. Biostatistics. 2004;5:557–572. [PubMed]
31. Beroukhim R, Lin M, Park Y, Hao K, Zhao X, et al. Inferring loss-of-heterozygosity from unpaired tumors using high-density oligonucleotide SNP arrays. PLoS Comput Biol. 2006;2:e41. [PMC free article] [PubMed]
32. Wang K, Li M, Hadley D, Liu R, Glessner J, et al. PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 2007;17:1665–1674. [PMC free article] [PubMed]
33. Colella S, Yau C, Taylor JM, Mirza G, Butler H, et al. QuantiSNP: an objective Bayes hidden Markov model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Res. 2007;35:2013–2025. [PMC free article] [PubMed]
34. Li C, Beroukhim R, Weir BA, Winckler W, Garraway LA, et al. Major copy proportion analysis of tumor samples using SNP arrays. BMC bioinformatics. 2008;9:204+. [PMC free article] [PubMed]
35. Staaf J, Lindgren D, Vallon-Christersson J, Isaksson A, Göransson H, et al. Segmentation-based detection of allelic imbalance and loss-of-heterozygosity in cancer cells using whole genome SNP arrays. Genome Biol. 2008;9:R136+. [PMC free article] [PubMed]
36. Assié G, LaFramboise T, Platzer P, Bertherat J, Stratakis C, et al. SNP arrays in heterogeneous tissue: highly accurate collection of both germline and somatic genetic information from unpaired single tumor samples. Am J Hum Genet. 2008;82:903–915. [PMC free article] [PubMed]
37. Lai TL, Xing H, Zhang NR. Stochastic segmentation models for array-based comparative genomic hybridization data analysis. Biostatistics. 2008;9:290–307. [PubMed]
38. The Cancer Genome Atlas. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008;455:1061–1068. [PMC free article] [PubMed]
39. Bengtsson H, Neuvial P, Speed T. TumorBoost: Normalization of allele-specific tumor copy numbers from a single pair of tumor-normal genotyping microarrays. BMC Bioinformatics. 2010;11:245+. [PMC free article] [PubMed]
40. Lai T, Liu H, Xing H. Autoregressive models with piecewise constant volatility and regression parameters. Stat Sin. 2005;15:279–301.

Articles from PLoS Computational Biology are provided here courtesy of Public Library of Science
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...