- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- J Comput Biol
- PMC2612042

# Estimating Genome-Wide Copy Number Using Allele-Specific Mixture Models

^{1}Benilton Carvalho,

^{1}Nathaniel D. Miller,

^{2}Jonathan Pevsner,

^{2}Aravinda Chakravarti,

^{3}and Rafael A. Irizarry

^{}

^{1}

^{1}Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland.

^{2}Department of Neurology, Kennedy Krieger Institute, Baltimore, Maryland.

^{3}McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland.

^{}Corresponding author.

*Dr. Rafael A. Irizarry, Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205. E-mail:*Email: ude.uhj@afar

## Abstract

Genomic changes such as copy number alterations are one of the major underlying causes of human phenotypic variation among normal and disease subjects. Array comparative genomic hybridization (CGH) technology was developed to detect copy number changes in a high-throughput fashion. However, this technology provides only a >30-kb resolution, which limits the ability to detect copy number alterations spanning small regions. Higher resolution technologies such as single nucleotide polymorphism (SNP) microarrays allow detection of copy number alterations at least as small as several thousand base pairs. Unfortunately, strong probe effects and variation introduced by sample preparation procedures have made single-point copy number estimates too imprecise to be useful. Various groups have proposed statistical procedures that pool data from neighboring locations to successfully improve precision. However, these procedure need to average across relatively large regions to work effectively, thus greatly reducing resolution. Recently, regression-type models that account for probe effects have been proposed and appear to improve accuracy as well as precision. In this paper, we propose a mixture model solution, specifically designed for single-point estimation, that provides various advantages over the existing methodology. We use a 314-sample database, to motivate and fit models for the conditional distribution of the observed intensities given allele-specific copy number. We can then compute posterior probabilities that provide a useful prediction rule as well as a confidence measure for each call. Software to implement this procedure will be available in the Bioconductor oligo package (*www.bioconductor.org*).

**Key words:**algorithms, computational molecular biology, DNA arrays

## 1.Introduction

High-resolution measurements for chromosomal copy number estimates can be obtained using SNP microarray platforms such as those developed by Illumina and Affymetrix or array CGH platforms from Nimblegen and Agilent (Peiffer et al., 2006; Gribble et al., 2007; Sharp et al., 2006). Statistical methodology has been developed for Affymetrix SNP arrays to provide copy number estimation algorithms (Zhao et al., 2004; Bignell et al., 2004; Huang et al., 2004, 2006; Nannya et al., 2005; Ishikawa et al., 2005; Komura et al., 2006; Laframboise et al., 2007). An advantage of the SNP chip technology is that we can obtain genotype calls which permits allele-specific copy number estimation. We can then use these to predict parent specific copy number that is useful in detecting uniparental disomy (Laframboise et al., 2007).

The genotyping platform provided by Affymetrix interrogates hundreds of thousands of human single nucleotide polymorphisms (SNPs) on a microarray. DNA is obtained and fragmented at known locations so that the SNPs are far from the ends of these fragments; the fragmented DNA is amplified with a polymerase chain reaction (PCR); and the sample is labeled and hybridized to an array containing probes designed to interrogate the resulting fragments. We refer to the measurements obtained from these probes as *feature intensities*. There are currently three products available from Affymetrix: an array covering approximately 10,000 SNPs (GeneChip Human Mapping 10K), a pair of arrays covering approximately 100,000 SNPs (GeneChip Human Mapping 50K Xba and Hind Array), and a pair of arrays covering approximately 500,000 SNPs (GeneChip Human Mapping 250K Nsp Array and Sty Array). These are referred to as the 10K, 100K, and 500K chips, respectively. Affymetrix has recently launched SNP array 6.0 that contains over 900K SNPs, as well as over 900K non-polymorphic probes for the detection of copy number variation.

To motivate the model and estimation procedures described here we need to understand the basics of the feature-level data. We provide the essential details here and refer readers to Kennedy et al. (2003) for a complete description. Each SNP on the array is represented by a collection of probe quartets. As with Affymetrix expression arrays, the probes are defined by 25-mer oligonucleotide molecules referred to as perfect match (*PM*) probes. There are also mismatch probes *MM* which we completely ignore because the manufacture has plans of no longer using them. *PM* probes for SNP arrays differ from expression arrays in three important ways. First, two alleles are interrogated (for most SNPs only two alleles are observed in nature). These are denoted by A and B and divide the probes into two groups of equal size. For each *PM* probe representing the *A* allele there is an allele *B* that differs by just one base pair (the SNP). Second, features are included to represent the sense and antisense strands. This difference divides the probes into two groups that are not necessarily of the same size. Finally, for each allele/strand combination, various features are added by shifting the position of the SNP within the probe. The position shift ranges only from −4 to 4 bases, therefore within each strands the probes are relatively similar.

Most copy number algorithms can be divided into three main steps which we refer to as (1) the *preprocessing* step, (2) the copy number *estimation* step, and (3) the *smoothing* across the chromosome step. In the preprocessing step, we summarize feature intensities into two quantities, representative of allele *A* and *B*. We refer to this step as *preprocessing*. In this paper, we use the following notation to denote the preprocessed data: *θ _{A,i,j}* and

*θ*are the logarithms (base 2) of quantities proportional to the amount of DNA in target sample

_{B,i,j}*j*associated with alleles

*A*and

*B*for SNP

*i*. In the estimation step, we use these

*θ*s to estimate the true copy number, which we denote with

*C*. The allele-specific copy number are denoted with

_{i,j}*C*and

_{A,i,j}*C*. Notice that the total copy number is the sum of the allele-specific copy numbers, i.e.,

_{B,i,j}*C*=

_{i,j}*C*+

_{A,i,j}*C*. As we demonstrate here (Fig. 1), estimates of

_{B,i,j}*θ*and

_{A}*θ*are, in general, not precise enough to provide useful copy number calls. Therefore, most copy number estimation algorithms include the smoothing step in which estimates from neighboring regions are averaged to improve the signal to noise ratio. These techniques range from simple method such as running median to more complicated ones such as hidden Markov models (HMM). In Section 3, we review some of the existing methods and motivate our mixture model approach. In Section 4, we describe the mixture model approach. In Section 5, we present results and discussion respectively. Throughout this paper, we use data obtained from collaborators and public repositories which we briefly describe in Section 2.

_{B}## 2.Control Data

In Section 4, we describe a model that is trained using a reference set of 314 normal samples hybridized to Affymetrix's 100K array. We screened out samples not achieving the quality standard described by Carvalho et al. (2007). Our reference set consists of 86 Hapmap samples, 124 samples from the Coriell Repositories (42 African American, 20 Asians, 40 Caucasians, and 22 samples from the polymorphisms discovery panel) (Collins et al., 1998), and 104 from Chakravarti's lab. The test data was sampled from 20 *trisomy 21* samples from Pevsner's lab.

## 3.Previous Work and Motivation

The first algorithms we describe do not provide allele-specific results. We therefore define the total copy number quantity . Ideally the *S _{i,j}* is proportional to the true log-scale copy number. Figure 1A shows data for a male sample with Down syndrome. The

*S*'s are highly noisy and differences between chromosome 21 and X are hard to detect, unless we smooth along the chromosome (in Fig. 1, we show the results of running median). The lessons learned from expression arrays help understand this problem. Various authors have proposed an additive background/multiplicative model for gene expression microarray (Rocke and Durbin, 2001; Huber et al., 2002; Wu et al., 2004). Furthermore, for Affymetrix arrays, various authors (Irizarry et al., 2003; Li and Wong, 2001) have clearly shown the existence of a strong multiplicative probe-specific effect. Probe-specific background noise, attributed to nonspecific binding, has also been described (Wu et al., 2004). Others (Rabbee and Speed, 2006; Laframboise et al., 2007; Carvalho et al., 2007) demonstrate that similar sized effects are seen with SNP chips. These effects are strong enough to be clearly seen even after averaging the various feature intensities associated with each SNP. Extending these findings to the copy number case results in the following model:

_{i,j}with *a* = *A, B* denoting allele, *i* identifies the SNP, *j* identifies the sample, *β* represents a SNP-specific background level, *δ* represents background variability, represents a SNP-specific probe effect, and is multiplicative measurement error that usually follows a log normal distribution with mean 1. Assuming this model holds, relatively simple calculations demonstrate that large values of *β _{a,i}* result in attenuation of real differences and that large variability of

*across SNPs explains the large variance seen in Figure 1A.*

_{a,i}Affymetrix's Copy Number Analysis Tool (CNAT) (Bignell et al., 2004; Huang et al., 2004) deals with the probe effect using a simple yet effective technique. CNAT does not provide allele-specific results and concentrates on estimating the overall copy number. For the preprocessing step, all feature intensities related to the SNP are therefore averaged to form *S _{i,j}*. Using dozens of control subjects, CNAT defines a SNP specific average , standard deviation , for each genotype g=AA,AB,BB. Values

*S*, from any new sample

_{i,j}′*j′*that are called genotype g are standardized in the usual manner: . A predefined regression equation is then used to transform these standardized values to the copy number scale. The standardized

*S*values are used to obtain p-values from the null hypothesis that

*S*= 0 (

*C*= 2). Figure 1B shows de-meaned (observed minus mean) values for the same data shown in Figure 1A. The improvement is clear and it is due to the fact that is partially removed from the de-meaned values. However, notice that the signal to noise ratio still appears to be small: the separation between chromosomes with known differences is far from perfect. To avoid false positives, the third step in CNAT involves looking for strings of consecutive p-values that are smaller than some predefined cut-off. Other smoothing approaches have been used. For example, Zhao et al. (2004) proposed the use of hidden Markov models (HMM) to define the procedure implemented by dChip.

Other authors have noted that further improvements can be obtained by reducing the variance at the preprocessing step. For example, several groups (Nannya et al., 2005; Ishikawa et al., 2005; Komura et al., 2006) have used probe-sequence information, mainly GC content, and fragment length to predict and remove some of the probe-effect–related variability. However, even after accounting for such factors, the signal-to-noise ratio remains too low to make single-point copy number calls; thus, these authors propose their own versions of the smoothing step.

Huang et al. (2006) noted that CNAT's mean removal approach does not fully remove the probe effect because it does not properly deal with the additive background effect *β*. They propose the Copy Number Analysis with Regression And Tree (CARAT) algorithm which uses a non-linear regression model, based on Eq. (1), to account for the probe-specific effects. To estimate model parameters they use a control dataset composed of dozens of arrays. First, genotype calls are obtained and treated as known. This permits estimation of allele-specific parameter estimates. For example, for allele *A*, we have known values of *c _{A}* = 0,

*c*= 2 (

_{B}*BB*genotype),

*c*= 1,

_{A}*c*= 1 (

_{B}*AB*genotype), and

*c*= 2,

_{A}*c*= 0 (

_{B}*AA*genotype) and thus we can estimate

*β*and

_{a,i}*with, for example, least squares, for each SNP*

_{a,i}*i*and each allele

*a*=

*A, B*. For a new sample, we can predict

*c*,

_{A}*c*using the fitted parameters and Eq. (1). Calls can then be based on cut-offs for the prediction of . Huang et al. (2006) suggest using [0, 1.5), [1.5, 2.5], (2.5, ∞) for total copy number < 2,= 2,> 2 respectively. Figure 2A shows the data used in the regression for allele A from a given SNP. The figure demonstrates that the model works reasonably well but that the signal-to-noise ratio is not large enough to provide perfect accuracy (the boxplots overlap). CARAT utilized a regression tree approach in the smoothing step.

_{B}**A,B**) Allele-specific prediction of copy number (log base 2) as compared to the real copy number (log base 2).

**...**

The Probe-level allele-specific quantitation (PLASQ) (Laframboise et al., 2007) procedure is similar to CARAT. Two major difference is that PLASQ fits Eq. (1) to the feature-level data and that it does not rely on external genotype calls. Although, PLASQ provides a superior model-based framework than any other approach, it is computationally challenging to implement. This is because a non-linear estimation procedure is performed at the feature-level for every SNP. Furthermore, it is difficult to adapt it to be robust to outliers and to take probe-sequence and fragment size into account.

Model based approaches such as CARAT and PLASQ provide a great advantage over previous ones: reliable confidence intervals can be computed for single-point copy number estimates. Huang et al. (2006) point out that their uncertainty assessment permits one to call a relatively large group of SNPs and keep the false positive rate relatively low. We now briefly describe a simple adaptation of these methods that provides further improvements.

Notice that all of the above described algorithms use regression-type approaches to give a continuous prediction of copy number. The current approaches rely on three assumptions that we believe are not exactly true. The first is that the linear relationship predicted by Eq. (1). Figure 2A,B shows that there are small but significant deviations from these models. Other SNPs (not shown) show slightly larger deviations. The second is that *θ _{A}* and

*θ*are independent. This assumption is clearly not true, as demonstrated by Figures 2C and and3.3. The third assumption is that the variance of the measurement error term does not depend on allele-specific copy number values. Figure 3 also shows this is not the case. In general, we are making convenience assumptions regarding the conditional probabilities of (

_{B}*θ*,

_{A}*θ*)′ given allele-specific copy number that hurt bottom-line results.

_{B}**θ**

_{i,j}|

**C**

_{i,j}=

**c**] for a SNP on chromosome X and a SNP on chromosome 21. The

*x*-axis is the log base 2 intensity for allele B. The

*y*-axis is the log base 2 intensity for allele A. The dots are from 45 new samples

**...**

Theoretically, one can show that the best predictor of discrete classes given continuous covariates is Bayes classifier. Bayes classifier is a function of the conditional distribution of the predictors given the classes which we cannot always estimate effectively. In the next section we describe how we can use the large amount of public data and Eq. (1) to obtain useful estimates of these conditional distributions and therefore improved copy number calls.

## 4.Allele-Specific Mixture Model

Figure 2C shows a scatterplot for *θ _{A}* versus

*θ*across many individuals. Notice that the three genotypes are clearly seen and that the data for each cluster appears to be bivariate normal. This result can be motivated by Eq. (1). First we assume that data summarized by SNP-RMA can be modeled similarly to probe-level data. As long as we keep sense and anti-sense probes separate, this assumption should hold as the probes have very similar sequence. Now to see how Eq. (1) is in agreement with Figure 2C, notice that for

_{B}*C*= 0, then (1) reduces to

*θ*= log(β) + log(δ) which follows a normal distribution. If > >

*β*then the following approximation

*θ*≈ log() + log() suggest

*θ*is normally distributed. Notice as well that the cluster related to the

*AB*genotype appears to show correlation. This implies that

*and*

_{A}*are correlated. This is in agreement with the fact that PCR should have a similar effect on DNA fragments related to the different alleles as they have the same length and almost identical sequences. These results motivated the use of a normal mixture model defined in the following way:*

_{B}with representing the un-observed true copy number of alleles *A* and *B* for SNP *i* on sample *j*, **c =** (*c _{A}*,

*c*)′ are the possible values

_{B}**C**

_{i,j}can take, accounts for the shifts in location caused by the probe-effect, and is a bivariate normal error with mean 0 and copy-number-specific covariance matrix Σ

_{c,j}which is defined by the allele-copy-number-specific standard deviations and copy-number-pair-specific correlation .

For this model to be useful we need reliable parameter estimates. We also need a computationally practical solution as the model is fit for each SNP (100K, 500K). To do this, we took advantage of the the large database of normal individuals described in Section 2. These were genotyping using the CRLMM algorithm (Carvalho et al., 2007), which has an error rate well below 1%. We therefore assume that, for these individuals, *C _{A}* and

*C*are known. This make the estimaton of the paramaters in Eq. (2) straightforward: because we know

_{B}**C**for all these samples we can estimate the

*γ*s by simply using:

with N_{ca,i} the number of samples with genotypes implying *C _{A,i,j}* =

*c*. The covariance matrix

_{A}*Σ*is computed in a similar way, namely using the sample covariance matrix of

_{c,i,s}

*θ*_{i,j}for samples

*j*implying

**C**

_{i,j}=

**c**.

^{1}Because we assume the

*ε*s are normal, these sets of parameter estimates define the conditional distributions for

**C**= (2, 0), (1, 1), and (0, 2). Next we assume that the behavior of the

**s for**

*θ**C*=

_{A}*c*is similar for all values of

_{A}*C*and vice-versa (no cross-hybridization). We then infer the conditional means for

_{B}**C**= (0, 0), (0, 1), (1, 0). For example the conditional mean for SNP

*i*when

**C**= (0, 1) will be (

*γ*,

_{A,o,i}*γ*

_{B}_{,1,i}). The covariance matrix is inferred in a similar way.

The above procedure permits us to predict the distributions of (*θ _{A}*,

*θ*) for cases with total copy number 0, 1, or 2. Equation (1) becomes particularly useful when trying to predict these distributions for We do this by first using the estimates of

_{B}*γ*

_{A}_{,0,i},

*γ*

_{A}_{,1,i},

*γ*

_{A}_{,2,i}as outcomes in Eq. (1) for values of

*C*= 0, 1, 2 respectively, fit Eq. (1), and obtain estimates of

_{A}*β*and

_{A,i}*, which permit us to predict*

_{A,i}*γ*

_{A}_{,3,i}, for

*C*= 3.

_{A}We now describe how we infer Σ_{c,i} for cases other than **C** = (2, 0), (1, 1), (0, 2) using the estimates we already have. For the A and B variance components (the diagonal entry) of the covariance matrix, we simply assume they depend only on *c _{A}* and

*c*, respectively. For

_{B}*c*>2 and

_{A}*c*> 2 we assume the same variance as

_{B}*c*= 2 and

_{A}*c*= 2, respectively. We therefore use the estimates of the six parameters——and do not need to predict any new values. The correlation component is a bit more difficult. We assume that the correlation coefficient when

_{B}*C*> 0 and

_{A}*C*> 0 is the same as

_{B}**C**= (1, 1). The rationale for this is that correlations are due to PCR effects being different from sample to sample. Thus, if both allele fragments are present, the resulting quantities will be similar regardless of the starting quantities. When one of the two alleles is not present (PCR no longer makes it grow) we assume that the correlation for case where

*C*> 0 but

_{A}*C*= 0 is the same as

_{B}**C**= (2, 0) and

*C*= 0 but

_{A}*C*> 0 the same as

_{B}**C**= (0, 2). For

**C**= (0, 0) we simply assume independence. With this assumption in place we can produce conditional expectations for any value of

**C**given the observed

**s, described as follows.**

*θ*With the model parameter estimates in place, we are able to provide posterior probabilities for allele-specific copy number. Furthermore, we can compute these posterior probabilities for total copy number:

where [**θ**_{i,j}|**C**_{i,j} = **c**] is the bivariate normal distribution defined by Eq. (3). The marginal probability of the **C** pair can be pre-specified and used to control specificity and sensitivity for any copy number value. We can obtain meaningful values by decomposing the probability into: Pr(*C _{A,i,j}* =

*c*,

_{A}*C*=

_{B,i,j}*c*) = Pr(

_{B}*C*=

_{A,i,j}*c*,

_{A}*C*=

_{B,i,j}*c*|

_{B}*C*+

_{A,i,j}*C*=

_{B,i,j}*c*) Pr(

*C*+

_{A,i,j}*C*=

_{B,i,j}*c*). The first component relates to the proportion of each genotype in the population and can be computed using the Hardy-Weinberg Equilibrium for diploids (

*C*+

_{A,i,j}*C*= 2). The second component relates to the probability of each alteration which is unknown. We recommend the user define these probabilities to control specificity and sensitivity. For the examples shown in this paper we assigned equal probabilities to .

_{B,i,j}^{2}Once we have calculated the probabilities above we can provide estimates of copy number by, for example, computing the expected value of

*C*=

*C*+

_{A}*C*.

_{B}A summary of the algorithm:

- For each array, we obtain the pre-processed probe-level log intensities from SNP-RMA, the preprocessing algorithm used by CRLMM. These resulting measurements are
*θ*,_{A}_{+},*θ*,_{A,−}*θ*,_{B}_{+},*θ*_{B}_{,−}for each SNP. - We estimate the conditional probability of these measurements, given allele-specific copy number. We assume a bivariate normal for the
*A*and*B*alleles at each copy number pair. This reduces the number of parameters greatly and we can estimate them precisely using a large training set. We use genotype calls to treat the allele specific copy number as known. We do this independently for sense (+) and antisense (−). More specifically: - We assume the prior probability for the joint distribution of
*C*and_{A}*C*is a uniform distribution._{B} - For a new dataset, we use the above estimates to calculate the posterior probability for
*C*and_{A}*C*being (_{B}*K*is the maximum copy number permitted, respectively). We average the sense and antisense results. - Finally, we compute the posterior probability of
*C*+_{A}*C*being and the posterior mean of_{B}*C*+_{A}*C*._{B}

## 5.Results

We now describe some of the applications of the hierarchical model described above. In general we refer to our procedure as the Copy Number–Robust Linear Model and Mixture Model (CN-RLMM) procedure. The robustness is attained by using medians and robust variance estimates in place of means and variances.

Figure 3 gives the SNP-specific bivariate normal distribution of ** θ** for

**C**= (0, 0), (0, 1), (1, 0), (0, 2), (1, 1), (2, 0), (0, 3), (1, 2), (2, 1), (3, 0), depicted in ellipses (95% confidence regions). These are estimated from the control data as described in Section 2. Figure 3A,B gives the sense and antisense-specific distributions for a SNP on chromosome X. We observe the extrapolated distribution of

**s given**

*θ***C**= (0, 1), (1, 0) coincide with the observed

**s from 45 male samples that were not used in training. Similarly, Figure 3C,D gives the sense and antisense-specific distributions for a SNP on chromosome 21 and we observe that our extrapolated distributions coincide with the observed**

*θ***s from 20**

*θ**trisomy 21*samples. This demonstrates that our assumptions seem to provide reasonable estimates of the conditional distribution of copy number the cases predicted with Eq. (1) (

**C**= (0, 0), (0, 1), (1, 0), (0, 3), (1, 2), (2, 1), (3, 0)).

In Figure 4A,B, we demonstrate how our results have much better precision than CNAT and values with and without probe-sequence and fragment length corrections. We achieve this precision without any loss of accuracy. Note that we could not get PLASQ to work with our data and no software is available to implement CARAT. The preprocessing used by dChip is very similar to CNAT and thus we expect results to be the same. Keep in mind the smoothing step is not being assessed. We observe that the degree of improvement is not equivalent for copy numbers 3 and copy number 1. This is expected because it is easier to detect a 2 times difference (copy number 2 versus 1) than to detect a 1.5 times difference (copy number 3 versus 2).

**A**) Expected copy number given preprocessed log intensities for 817 SNPs on chromosome 21 of 20 Down syndrome patients (with identified trisomy 21). (

**B**) Expected copy number given preprocessed log intensities for 807 SNPs on chromosome

**...**

The most useful application of our results is that we provide improved single-point copy number estimates with reliable uncertainty assessment without the need to re-calibrate for new samples. Note that we can easily control our false positive rate by simply restricting calls to SNPs with posterior probabilities close to 1. Figure 4C demonstrates that we can get usable single-point copy number estimates for a large amount of SNPs. Notice that the worst performance is observed for CN = 3. This is likely due to the fact that we used Eq. (1) to extrapolate (as done by CARAT).

## 6.Discussion

We have presented a mixture model approach that permits us to obtain improved copy number estimate as well as reliable single-point copy number calls. A major advantage of our methodology over the best existing one (e.g., CARAT and PLASQ) is that we explicitly model the conditional joint distribution of the intensities given the copy number values. This permits us to model the strong correlation that sometimes exists between A and B and exploit this information to improve bottom-line results. This advantage is best exemplified by Figure 3C, where the **C** = (2, 1), (1, 1), and (1, 2) are usefully separable only if we take this correlation into account. Furthermore, avoiding the linearity assumption made by these other procedures seems to help as well. This is best demonstrated by the fact that we perform worst in cases where we rely on this assumption (i.e., making calls for CN = 3). Finally, because we use training data to fit the mixture models, the procedure is entirely linear. Other procedures, such as CARAT and PLASQ rely on non-linear algorithms that present many practical problems.

We have plans to extend and improve our approach in various ways. First, we plan to implement it for the more recent chips. Second, we believe this approach can be used with Illumina's SNP array and thus plan to try it with data from this platform. Third, we plan to add another level to the model that will permit us to borrow strength across the thousands of SNPs to better estimate the parameters of the conditional probabilities; for this, we plan to use an approach similar to that of CRLMM. Fourth, we plan to look for ways to avoid using the linearity assumption to infer the parameter of conditional distributions when *C* > 2; we plan to use general regression approaches that predict these parameters from the known parameters *C* <= 2. We can train this regression model with *trisomy 21* data (*C* = 3) and design experiments to be able to train for *C* > 3. Fifth, we plan to look for better ways of combining the results form sense and antisense probes. It is desirable to detect and ignore misbehaving strands. Sixth, and finally, we have observed correlation between parameter estimates coming from proximal locations on the chromosome. This could be due to the fact that various SNPs are on each of the fragments that are amplified. We will explore ways to exploit this finding.

It is possible that the reference set we use has an influence on our results. In the near future, we will study this problem in more detail, and we will substantially increase the size of the reference set to reduce the effect of outlier samples. By combining various publicly available assessment experiments, we will develop a comparison protocol for analysis methods. This will help us determine not only which methods work better, but to explore if subsets of the reference set provide better results.

Notice that we did not offer any solutions for the smoothing step, as we are more interested in developing techniques for single-point estimates. We expect some of the existing techniques to work well when applied to our estimates of copy number. However, because we explicitly model the conditional probabilities, it is possible to develop new methods that impose the across-chromosome correlation through those probabilities instead of the actual copy number estimates.

## Footnotes

^{1}In this paper, we actually use robust (to outliers) versions of these sample means and covariances.

^{2}Remember that we perform the above calculation separately for the sense and antisense values. A final estimate of the posterior probability simply average these two values.

## Acknowledgments

We thank Terry Speed, Giovanni Parmigiani, and Ingo Ruczinski for discussion and suggestions. The work of Wenyi Wang was partially funded by grant no. R01CA105090-01A1. The work of Rafael A. Irizarry was partially funded by 1P41HG004059 and P50 HL73994 (Core E); Benilton Carvalho was funded by Coordenao de Aperfeioamento de Pessoal de N-vel Superior (CAPES/Brazil) Aravinda Chakravarti was supported by the NIH (grants HG02757, MH60007) and the D.W. Reynolds Foundation.

## Disclosure Statement

No competing financial interests exist.

## References

- Bignell G.R. Huang J. Greshock J., et al. High-resolution analysis of DNA copy number using oligonucleotide microarrays. Genome Res. 2004;14:287–295. [PMC free article] [PubMed]
- Carvalho B. Bengtsson H. Speed T.P., et al. Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. Biostatistics. 2007;8:485–499. [PubMed]
- Collins F.S. Brooks L.D. Chakravarti A. A DNA polymorphism discovery resource for research on human genetic variation. Genome Res. 1998;8:1229–1231. [PubMed]
- Gribble S.M. Kalaitzopoulos D. Burford D.C., et al. Ultra-high-resolution array painting facilitates breakpoint sequencing. J. Med. Genet. 2007;44:51–58. [PMC free article] [PubMed]
- Huang J. Wei W. Zhang J., et al. Whole genome DNA copy number changes identified by high-density oligonucleotide arrays. Hum. Genomics. 2004;1:287–299. [PMC free article] [PubMed]
- Huang J. Wei W. Chen J., et al. CARAT: a novel method for allelic detection of DNA copy number changes using high-density oligonucleotide arrays. BMC Bioinform. 2006;7:83. [PMC free article] [PubMed]
- Huber W. von Heydebreck A. Sueltmann H., et al. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics. 2002;18(Suppl. 1):S96–S104. [PubMed]
- Irizarry R. Hobbs B. Collin F., et al. Exploration, normalization, and summaries of high-density oligonucleotide array probe level data. Biostatistics. 2003;4:249–264. [PubMed]
- Ishikawa S. Komura D. Tsuji S., et al. Allelic dosage analysis with genotyping microarrays. Biochem. Biophys. Res. Commun. 2005;333:1309–1314. [PubMed]
- Kennedy G.C. Matsuzaki H. Dong S., et al. Large-scale genotyping of complex DNA. Nat. Biotechnol. 2003;21:1233–1237. [PubMed]
- Komura D. Nishimura K. Ishikawa S., et al. Noise reduction from genotyping microarrays using probe level information. In Silico Biol. 2006;6:9. [PubMed]
- Laframboise T. Harrington D. Weir B.A. PLASQ: A generalized linear model-based procedure to determine allelic dosage in cancer cells from SNP array data. Biostatistics. 2007;8:323–336. [PubMed]
- Li C. Wong W.H. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc. Natl. Acad. Sci. USA. 2001;98:31–36. [PMC free article] [PubMed]
- Nannya Y. Sanada M. Nakazaki K., et al. A robust algorithm for copy number detection using high-density oligonucleotide single nucleotide polymorphism genotyping arrays. Cancer Res. 2005;65:6071–6079. [PubMed]
- Peiffer D.A. Le J.M. Steemers F.J., et al. High-resolution genomic profiling of chromosomal aberrations using Infinium whole-genome genotyping. Genome Res. 2006;16:1136–1148. [PMC free article] [PubMed]
- Rabbee N. Speed T.P. A genotype calling algorithm for affymetrix SNP arrays. Bioinformatics. 2006;22:7–12. [PubMed]
- Rocke D.M. Durbin B. A model for measurement error for gene expression arrays. J. Comput. Biol. 2001;8:557–569. [PubMed]
- Sharp A.J. Hansen S. Selzer R.R., et al. Discovery of previously unidentified genomic disorders from the duplication architecture of the human genome. Nat. Genet. 2006;38:1038–1042. [PubMed]
- Wu Z. Irizarry R.A. Gentleman R., et al. A model based background adjustment for oligonucleotide expression arrays. J. Am. Statist. Assoc. 2004;99:909–917.
- Zhao X. Li C. Paez J.G., et al. An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorphism arrays. Cancer Res. 2004;64:3060–3071. [PubMed]

**Mary Ann Liebert, Inc.**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (532K)

- A multilevel model to address batch effects in copy number estimation using SNP arrays.[Biostatistics. 2011]
*Scharpf RB, Ruczinski I, Carvalho B, Doan B, Chakravarti A, Irizarry RA.**Biostatistics. 2011 Jan; 12(1):33-50. Epub 2010 Jul 12.* - Combined array-comparative genomic hybridization and single-nucleotide polymorphism-loss of heterozygosity analysis reveals complex genetic alterations in cervical cancer.[BMC Genomics. 2007]
*Kloth JN, Oosting J, van Wezel T, Szuhai K, Knijnenburg J, Gorter A, Kenter GG, Fleuren GJ, Jordanova ES.**BMC Genomics. 2007 Feb 20; 8:53. Epub 2007 Feb 20.* - Software comparison for evaluating genomic copy number variation for Affymetrix 6.0 SNP array platform.[BMC Bioinformatics. 2011]
*Eckel-Passow JE, Atkinson EJ, Maharjan S, Kardia SL, de Andrade M.**BMC Bioinformatics. 2011 May 31; 12:220. Epub 2011 May 31.* - Recent advances in array comparative genomic hybridization technologies and their applications in human genetics.[Eur J Hum Genet. 2006]
*Lockwood WW, Chari R, Chi B, Lam WL.**Eur J Hum Genet. 2006 Feb; 14(2):139-48.* - MAPH: from gels to microarrays.[Eur J Med Genet. 2005]
*Patsalis PC, Kousoulidou L, Sismani C, Männik K, Kurg A.**Eur J Med Genet. 2005 Jul-Sep; 48(3):241-9.*

- PubMedPubMedPubMed citations for these articles

- Estimating Genome-Wide Copy Number Using Allele-Specific Mixture ModelsEstimating Genome-Wide Copy Number Using Allele-Specific Mixture ModelsJournal of Computational Biology. Sep 2008; 15(7)857PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...