- Journal List
- Biostatistics
- PMC3006124

# A multilevel model to address batch effects in copy number estimation using SNP arrays

^{*}

^{*}To whom correspondence should be addressed.

## Abstract

Submicroscopic changes in chromosomal DNA copy number dosage are common and have been implicated in many heritable diseases and cancers. Recent high-throughput technologies have a resolution that permits the detection of segmental changes in DNA copy number that span thousands of base pairs in the genome. Genomewide association studies (GWAS) may simultaneously screen for copy number phenotype and single nucleotide polymorphism (SNP) phenotype associations as part of the analytic strategy. However, genomewide array analyses are particularly susceptible to batch effects as the logistics of preparing DNA and processing thousands of arrays often involves multiple laboratories and technicians, or changes over calendar time to the reagents and laboratory equipment. Failure to adjust for batch effects can lead to incorrect inference and requires inefficient *post hoc* quality control procedures to exclude regions that are associated with batch. Our work extends previous model-based approaches for copy number estimation by explicitly modeling batch and using shrinkage to improve locus-specific estimates of copy number uncertainty. Key features of this approach include the use of biallelic genotype calls from experimental data to estimate batch-specific and locus-specific parameters of background and signal without the requirement of training data. We illustrate these ideas using a study of bipolar disease and a study of chromosome 21 trisomy. The former has batch effects that dominate much of the observed variation in the quantile-normalized intensities, while the latter illustrates the robustness of our approach to a data set in which approximately 27% of the samples have altered copy number. Locus-specific estimates of copy number can be plotted on the copy number scale to investigate mosaicism and guide the choice of appropriate downstream approaches for smoothing the copy number as a function of physical position. The software is open source and implemented in the R package *crlmm* at Bioconductor (http:www.bioconductor.org).

**Keywords:**Bioinformatics, Hierarchical models, DNA copy number variations, Single nucleotide polymorphism array

## 1. INTRODUCTION

Segmental changes in DNA copy number arise through genomic rearrangements that cause insertions or deletions of genomic fragments. Such rearrangements are thought to arise most commonly via nonallelic homologous recombination in regions that contain low copy repeats (Gu *and others*, 2008), and can occur in the germline during meiosis as well as during mitosis in somatic cells. Many of the genomic rearrangements that affect DNA copy number are likely to be neutral with respect to phenotype. For instance, an extensive list of deletions and amplifications has been catalogued in apparently normal HapMap individuals (Redon *and others*, 2006, Kidd *and others*, 2008). However, genomic rearrangements that occur in regions that disrupt gene function or alter the copy number of genes that are dosage sensitive can affect phenotypes. See Lupski for a recent review (Lupski, 2009). Alterations of DNA copy number are implicated in many diseases, including autism spectrum disorders (Autism Genome Project Consortium, 2007, Marshall *and others*, 2008), bipolar disease (Zhang *and others*, 2008), autoimmune disorders such as type I diabetes (McKinney *and others*, 2008), and cancer (Ma *and others*, 2009, Cappuzzo *and others*, 2009, Woo *and others*, 2009). For other heritable diseases such as schizophrenia, the role of recurrent copy number variants in disease remains elusive (Sutrala *and others*, 2007, Need *and others*, 2009).

Copy number variants spanning regions of the genome greater than one megabase (Mb) are detectable by cytogenetic techniques such as spectral karyotyping and fluorescence *in situ* hybridization. However, many changes to DNA copy number are thought to involve smaller segments of the genome that are below the level of resolution attainable by cytogenetic methods. High-throughput genotyping arrays enable the measurement of genotype and copy number across the genome. The resolution for detecting copy number variant in current platforms is on the order of thousands of base pairs and can therefore be used to identify segmental changes that are not detectable by spectral karyotyping (resolution: 5–10 Mb). Screening for alterations in copy number has identified genomic regions known to be involved in disease, such as the neurexins in autism (Autism Genome Project Consortium, 2007), as well as novel targets that suggest a role of less well-understood pathways in disease etiology. High-throughput genotyping platforms provide a useful genomic screen whereby loci exhibiting patterns of variation between normal and disease individuals can be identified and followed. While most genotype calling algorithms are highly concordant for the vast majority of single nucleotide polymorphisms (SNPs), copy number estimation is more sensitive to technological artifacts and differences in the preprocessing and normalization steps.

This paper is organized as follows. Section 2 outlines our motivation for developing a multilevel model to address batch effects and improve estimates of uncertainty. Section 3 defines a theoretical framework for copy number in hybridization-based platforms and the challenges of adapting this model to high-throughput genotyping arrays. Section 4 describes an estimation algorithm that is motivated by many of the fundamental features of standard approaches, including maximum likelihood and empirical Bayes. In Section 5, we illustrate the main innovations of our approach using two experimental data sets and compare our results with software recommended by the array manufacturer. Concluding remarks are provided in Section 6.

## 2. MOTIVATION

The focus of this paper is the estimation of allele-specific copy number at polymorphic loci and the total copy number at nonpolymorphic loci as well as the corresponding uncertainty. Statistical methods that smooth locus-level estimates, such as segmentation algorithms or hidden Markov models (HMMs), are downstream of the work considered here and stand to benefit from improvements to the locus-level summaries. Additional background regarding our decision to implement a locus-level model for copy number and a more detailed discussion of downstream methods for copy number estimation are reviewed in Section A of the supplementary material (available at *Biostatistics* online). Figure 1 in Section H of the supplementary material (available at *Biostatistics* online) illustrates the difficulty of copy number estimation as batch (here, laboratory) and true copy number changes are likely to have very similar effects on the data. In general, our work is motivated by the observation of large batch effects in several genomewide data sets and the need for improved estimates of copy number uncertainty that can be propagated to the downstream algorithms reviewed in Sections A and F of the supplementary material (available at *Biostatistics* online).

The European ancestry controls for bipolar disease were run on 29 plates; we excluded 6 plates that had fewer than 20 samples after removing duplicates and samples with low quality (signal to noise ratio less than 5). (a) Scatter plots of the quantile **...**

### 2.1. Batch effects

Batch effects that arise from laboratory, temporal, or other experimental variation can be addressed in several ways. One approach is to consider batch effects as part of the quality control step in the analysis of genomewide arrays. For instance, Zhang *and others* (2008) excluded regions of copy number alteration that were associated with batch in their Genomewide association studies (GWAS) for bipolar disease. This approach is sensible if a relatively small number of loci are affected by batch. In such instances, smoothing the locus-level estimates using a HMM or a segmentation procedure may reduce the impact of batch effects on downstream analyses. An alternative approach is to apply a correction to the signal intensities that effectively gives each batch the same mean signal intensity, as in the GISTIC algorithm (Beroukhim *and others*, 2007). While GISTIC uses date as a surrogate for batch, one could also estimate batch using principal components analysis applied to the raw intensities (Golden Helix, 2009). Figure 1(a) and (b) illustrate the batch effect at one locus on chromosome 15 among the European ancestry controls in a data set for bipolar disease. The distribution of all *F*-statistics for chromosome 15 (Figure 1c) demonstrates that batch effects persist after quantile normalization and are not isolated to a small number of loci. The batch effects observed on chromosome 15 were typical of the other autosomes in this data set and intermediate to the relatively small batch effects observed in 270 HapMap samples and the very large batch effects we have observed in studies comprised of more than 5000 individuals (data not shown).

### 2.2. Shrinkage

Shrinkage of the variance estimates is likely to be useful for several reasons. First, the technology used to estimate the amount of DNA hybridized to the array affects the measured fluorescence of many probes in similar ways. Secondly, many SNPs have a low minor allele frequency or “unobserved” biallelic genotypes that complicate the estimation procedure discussed in Section 4. Third, shrinkage reduces the sensitivity of our approach to extreme values, such as variance estimates near zero. When propagated to HMMs or segmentation algorithms, the shrunken estimates of uncertainty may provide a less *ad hoc* mechanism for handling outliers and a resolution for detecting copy number alterations that depends both on the density of markers and the precision of the estimates.

## 3. MODEL

Batch effects appear to be an unavoidable feature of studies involving a large number of arrays, one that copy number estimation algorithms should take into account. As batch can be easily identified (Figure 1), we argue that batch can be successfully modeled. Here, we introduce a model for copy number estimation based loosely on an approach described by Wang *and others* (2008). Our method differs from Wang and others in several important ways. First, we model batch as a fixed effect. More generally, one can think of batch as a variable selection problem to be inferred at each locus (see Section 6). For the purpose of this paper, we treat batch as known. Secondly, we avoid using training data to estimate model parameters. Instead, we estimate model parameters using an algorithm that relies only on the experimental data (Section 4). Third, we provide prediction regions for loci with low minor allele frequencies and potentially unobserved genotypes. Fourth, we provide a solution for estimating copy number for nonpolymorphic probes in the most recent generation of genotyping platforms. Fifth, we shrink locus-level estimates of the variance and correlation across alleles that are often very noisy though a hierarchical model. Sixth, we propose a correction step that provides less biased estimates of copy number when many subjects have nonnormal copy number. Finally, software for fitting this model to the Affymetrix 5.0 and 6.0 platforms, as well as several Illumina platforms, is publicly available from Bioconductor.

### 3.1. A multilevel model

##### Polymorphic loci.

We propose a multilevel model for the locus-level intensities that is motivated by past work with other hybridization-based technologies. In particular, a general framework for modeling the normalized fluorescence intensities in gene expression arrays has been recently described (Wu and Irizarry, 2007) and is discussed in greater detail in Section A of the supplementary material(supplementary material available at *Biostatistics* online) . We model the observed intensity *I* for allele *k* at locus *i*, sample *j*, and batch *p* as follows:

The average fluorescence arising from optical background and nonspecific hybridization are collectively parametrized by *ν* and referred to as background. The slope, *φ*, in model (3.1) provides an estimate of the change in the average intensity at a given locus per each integer increase in the allelic copy number. Both the background and slope are allowed to depend on the SNP *i* and the batch *p*. See Figure 1a for an illustration of these parameters in the context of an *A* versus *B* intensity scatterplot of a single SNP.

The errors *δ* and ε in model (3.1) account for array-to-array variation within a batch of the background and slope terms, respectively. These terms are each approximately log normal and assumed to be independent across loci and independent of each other:

Note, however, that the error terms are not independent across alleles. In particular, Figure 1(a) and (b) suggest that the correlation of the *A* and *B* intensities is most pronounced for samples with allelic copy numbers greater than 0. The correlation reflects cross-hybridization of the *A* allele probe to the *B* allele target sequence (and vice versa). As in Wang *and others* (2008), we assume that the joint distribution of the log intensities conditional on the allelic copy number is approximately bivariate normal:

The diagonal elements of *Σ* are as follows:

The correlation of the *A* and *B* intensities, *ρ*_{ip}, is SNP and batch specific.

At the next level of the model specification, prior distributions are selected for Σ. A commonly used prior is an inverse Wishart. However, we view this prior as too restrictive as a single degree of freedom is required for the variances. As Σ is a 2×2 matrix, we have considerable flexibility for exploring different priors for the standard deviations and correlation. We use independent inverse chi-squared priors with *d*_{A} and *d*_{B} degrees of freedom for the background,

and slope variances,

The terms *t*_{A,p}^{2},*t*_{B,p}^{2},*s*_{A,p}^{2}, and *s*_{B,p}^{2} in (3.4) and (3.5) correspond to the median variance of the background and slope terms, respectively. Note that these values are the same for all loci and depend only on the batch *p*. For the correlation structure, we use the prior *ρ*_{ip}∼Beta(*α*,*β*),where*α* and *β* are estimated empirically and place more mass at typical values. The motivation for an informative prior on the correlation is that cross-hybridization of the *A* and *B* alleles gives rise to positive correlations. In our experience, negative correlations (after conditioning on the allelic copy number) are spurious and usually occur when an insufficient number of observations are available to estimate the correlation.

##### Nonpolymorphic loci.

For nonpolymorphic probes, only one allele is interrogated at each locus. We generically denote this allele as *T*. Again, we propose a theoretical model for the observed intensity for allele *T* at locus *i*, sample *j* and batch *p* as a convolution of fluorescence from optical background and nonspecific binding of other probes, *ν*_{T}, and fluorescence arising from specific hybridization of the probe to the target sequence, *φ*_{T}. Explicitly,

Again, the background and signal parameters are allowed to depend on both the nonpolymorphic locus *i* and batch *p*. The error terms corresponding to background and signal account for array to array variation within a batch and are assumed to be log normal, independent across loci, and independent of each other. Inverse chi-squared priors for *δ*_{T} and ε_{T} variances complete the specification of the hierarchical model:

##### Challenges.

There are several challenges to fitting models (3.1) and (3.6). First, the parameters **ν**and **φ** cannot be reliably estimated from training data because of batch effects. Therefore, *ν*_{A}, *ν*_{B}, *ν*_{T}, *φ*_{A}, *φ*_{B}, *φ*_{T}, *c*_{A}, *c*_{B}, and *c*_{T} are allowed to depend on both the locus and batch and must be estimated from the experimental dataset. Secondly, the error terms *δ*_{A}, *δ*_{B}, *δ*_{T}, ε_{A}, ε_{B}, and ε_{T} that capture within-batch variation of the background and signal intensities across arrays are not Gaussian. In principle, these parameters can be estimated using maximum likelihood or empirical Bayes. However, least squares and method of moments approaches to parameter estimation are well known to be biased, particularly when the variance of these parameters is large. The standard approach is a generalized linear model with an exponential link function, as employed by LaFramboise *and others* (2007). Such an approach requires an iterative estimation procedure that we view as impractical for platforms that interrogate millions of loci. Thirdly, for polymorphic loci, the covariance matrix is a function of the allelic copy number. Finally, outliers are common and robust-to-outlier approaches are needed. Again, least squares and method of moments are not robust to outliers. Taken together, the size of current genotyping platforms, the inevitability of batch effects in studies involving a large number of arrays, errors that are non-Gaussian, and the need for robustness has led us to develop an *ad hoc* approach motivated by the fundamental features of the standard approaches.

## 4. COPY NUMBER ESTIMATION ALGORITHM

We prescribe a general strategy for copy number estimation that (i) develops naive estimates of the allelic copy number that are taken to be known, (ii) uses a linear model to estimate batch- and locus-specific parameters for the background and slope terms, and (iii) updates the naive estimates of allelic copy number. Robust-to-outlier procedures for preprocessing and copy number estimation are emphasized. Several problems remain after steps (i)-(iii), including unobserved genotypes at many polymorphic loci and variance estimates that are based on a small number of observations. We propose solutions to each of these problems that take advantage of the large number of observations available from other loci.

Prior to copy number estimation, all arrays are quantile normalized to a target reference distribution. The normalized intensities are summarized to the level of the locus. For example, the Affymetrix 6.0 platform has 3 and occasionally 4 identical probes for each allele at polymorphic loci and 1 probe for each nonpolymorphic locus. For the polymorphic loci, we quantile normalize the raw intensities and then summarize the normalized values by the median. For the nonpolymorphic probes, the Affymetrix 6.0 platform has only one probe per target sequence and we use the quantile-normalized intensities directly.

### 4.1. Allele-specific copy number

The parameters for *ν*_{k},*φ*_{k},and *c*_{k} in model (3.1) are unknown for allele *k*∈{*A*,*B*}. As a first step, we genotype all the samples on the array using the *crlmm* software (Carvalho *and others*, 2007, Carvalho *and others*, 2010), obtaining genotype calls of AA, AB, and BB for the polymorphic loci. The genotype calls provide a naive estimate of the allele-specific copy number—an integer value of 0, 1, or 2 for each allele. We denote the naive estimates for the A and B alleles by *c*_{A}^{☆} and *c*_{B}^{☆}, respectively. We use quantile-based estimators, the median and the median absolute deviation (MAD), to obtain robust estimates of the mean () and variance () on the intensity scale for genotype *GT*. (The median is typically more robust to outliers than a trimmed mean.) For example, is computed as the median of intensities *I*_{B,ijp} for samples *j* with genotype *AA*. Inverse chi-squared priors with degrees of freedom , are used to shrink locus-specific estimates of the variance to the batch-specific median variance, . Note that the within-genotype centers are approximately normal regardless of the distribution of ** I**. We (and others) have observed that the relationship of the within-genotype centers is approximately linear with the integer copy number (Huang

*and others*, 2006, Wang

*and others*, 2008). Using the naive estimate of the integer copy number in the design matrix, we use weighted least squares regression to estimate

*ν*

_{A}and

*φ*

_{A}:

The errors *m*_{A,ijp} are approximately independent multi-Gaussian. We repeat the procedure for the B-allele to obtain batch- and locus-specific estimates of *ν*_{B} and *φ*_{B}. See Figure 2 of the supplementary material (available at *Biostatistics* online) for an illustration of the linear model. The naive estimates of allele-specific copy number are updated by subtracting the estimated background from the observed intensity and scaling by the slope coefficient. Specifically,

As discussed in Section 6, the assumption that the median intensity is linear with copy number appears reasonable for a limited range. We have observed departures from linearity for larger copy numbers as the fluorescence becomes more saturated. In practice, we constrain . The above prescription for copy number estimation is predicated on the assumption that at any given locus the median copy number is 2. We explore the robustness of this approach to misspecification of the initial values for allele-specific copy number in an experimental data set (Section 5) and through simulation in Section D of the supplementary material (available at *Biostatistics* online).

### 4.2. Unobserved genotypes and nonpolymorphic loci

For many polymorphic loci, the minor allele is rare and one or more of the three possible biallelic genotypes are not observed. For SNPs with genotype GT not observed, we impute *μ*_{A}^{GT} and *μ*_{B}^{GT} via regression. For example, to impute *μ*_{A}^{AA} for SNPs with genotype AA unobserved, we regress on and using a set of SNPs for which all 3 genotypes were observed. With estimates of the coefficients for and , we predict the value of *μ*_{A}^{AA} from the observed and at this locus. We repeat the procedure for the B allele to impute *μ*_{B}^{AA}. For a polymorphic locus with 2 genotypes not observed, we impute the unobserved within-genotype medians *via* regression again using SNPs with all 3 genotypes observed. The variance terms for the unobserved genotype GT, , and are obtained from the prior . A straightforward extension of our approach is to scale the prior variance by the uncertainty of the imputed values estimated using techniques such as multiple imputation.

For nonpolymorphic loci, the parameters for the background, *ν*_{T}, and the slope, *φ*_{T}, in model (3.6) are difficult to estimate as there are no genotype clusters to guide the estimation procedure. For each nonpolymorphic probe, we assume that the median of the observed intensities across samples in the batch corresponds to normal copy number. The current approach in the R package *crlmm* (see version number in Section I of the supplemental material available at *Biostatistics* online) is to impute *φ*_{T} from polymorphic loci in which all 3 genotypes were observed. Briefly, for SNPs with 3 biallelic genotypes observed, we fit a linear model using and as the explanatory variables and the corresponding slopes, and , as the response variables. The coefficients from this model are used to impute the slope parameter for nonpolymorphic loci, . Note that the background fluorescence, *ν*_{T}, is determined by the relationship . Transforming the nonpolymorphic intensities to the copy number scale is achieved by .

##### Contamination.

In many applications, DNA is isolated from a mixture of 2 or more cell types that may have different somatic alterations. As the DNA in the cell populations may differ, noninteger copy numbers are plausible. HMMs that assume integer copy number states are not appropriate. The transformation in (4.2) for polymorphic loci, and the analogous transformation for nonpolymorphic loci, allow one to plot as a function of the physical position to assess contamination. When contamination is likely to have occurred, a variety of nonparametric segmentation approaches are available that can be used to identify noninteger copy number gain and loss.

### 4.3. Uncertainty

Estimates of the uncertainty are important for downstream algorithms that smooth estimates of the copy number as a function of the physical position. As mentioned previously, a critical choice governing the suitability of a smoothing algorithm is the presence of a mixture of cell populations that can result in noninteger copy number. In the absence of cell contamination, we advocate an HMM that can be fit directly to bivariate normal scatterplots of the log A and log B intensities. (The approach of fitting a HMM directly to the log A and log B intensities was first described by Korn *and others* (2008).) When cell contamination is likely, we prefer nonparametric segmentation algorithms that can identify any noninteger shift in copy number. Our focus in this section is improving estimates of the uncertainty for the prediction regions of allele-specific copy number.

##### Integer copy number.

As HMMs can incorporate locus-specific estimates of the location and scale in the emission probabilities, HMMs for detecting copy number alterations can therefore be applied directly to the bivariate normal scatterplots without first transforming the intensities to the copy number scale (Korn *and others*, 2008). Conditional on the allelic copy number, the logarithm of the *I*_{A} and *I*_{B} intensities is approximately bivariate normal with a mean and covariance that is locus and batch specific, as in model (3.2). Again, our procedure utilizes naive estimates of the allelic copy number from the biallelic genotype calls to provide an estimate of Σ. To illustrate this approach, we describe the estimation of Σ for an SNP with biallelic genotype AA (*c*_{A}^{☆} = 2,*c*_{B}^{☆} = 0). From (3.3), the diagonal elements of Σ_{ip} are (Σ_{ip})_{11} = *σ*_{A,ip}^{2} and (Σ_{ip})_{22} = *τ*_{B,ip}^{2}. The background variance *τ*_{B,ip}^{2} is estimated as the MAD of the log intensities for the B allele across all subjects with genotype AA at locus *i*. The signal variance *σ*_{A,ip}^{2} is estimated as the MAD of the log intensities for the A allele across all subjects with genotype AA at locus *i*. Implicitly, we assume that the variance of *δ*_{A} is small relative to the variance of ε_{A} such that . The assumption that the variance is constant for *c*_{A}^{☆} greater than zero appears reasonable on the log scale. Similarly, an initial estimate for the correlation of the log intensities for the A and B alleles, *ρ*_{ip}, is estimated empirically among subjects with genotype AA. The within-genotype empirical estimates for the variance terms and the correlation parameter provide an initial estimate of Σ. These estimates can be very noisy when based on a small number of observations. Therefore, we shrink the initial estimates of Σ using inverse chi-squared priors as described in Section 3. Specifically, shrinkage estimates for the background and signal variances for (*c*_{A}^{☆} = 2,*c*_{B}^{☆} = 0) are obtained by

The count *N*_{GT,ip} denotes the number of subjects with genotype *GT* at locus *i* in batch *p*. The degrees of freedom for the priors, *d*_{A} and *d*_{B}, can be estimated as described in Lönnstedt and Speed (2001). Median values of the background and signal variance, denoted by *t* and *s*, respectively, are estimated across all loci and allowed to depend on the batch. In addition to the variances, we also shrink the empirical estimate of the correlation *ρ*_{i,p}. As motivated in Section 3, we suggest a Beta prior that puts most of the mass on typical values. The resulting covariance matrix, , can be used to plot prediction regions for any (*c*_{A}^{☆} > 0,*c*_{B}^{☆} = 0). For instance, see the ellipses for in Figure 1(b). The covariance matrices for (*c*_{A}^{☆} = 0,*c*_{B}^{☆} > 0), (*c*_{A}^{☆} > 0,*c*_{B}^{☆} > 0) and (*c*_{A}^{☆} = 0,*c*_{B}^{☆} = 0) are obtained using a similar procedure. By scaling by a sample-specific estimate of the variance across all loci, the variance estimate can incorporate information on the overall noise of the sample relative to other samples.

### 4.4. Common copy number variants

Our approach for estimating copy number uses robust estimates of the within-genotype location and scale of the intensities. In particular, we use medians for the location and MADs for the variance to limit the influence of outliers on parameter estimates. However, many regions of the genome appear to contain common variants in apparently normal individuals (McCarroll *and others*, 2008, Kidd *and others*, 2008), and many diseases may have regions that are commonly altered. For genomic locations where a large number of subjects harbor a copy number alteration, estimates of *ν* and *φ* can be biased. We propose an update for the background and slope parameters that provides additional robustness to regions with a large number of alterations. This procedure does not require *a priori* knowledge of the genomic locations of the common variants, an important feature as these are often not well characterized or highly variable across different populations.

The general strategy is to estimate the parameters *ν*_{A},*φ*_{A},*ν*_{B},*φ*_{B},*ν*_{T}, and *φ*_{T} as described previously. Using initial estimates of these parameters, we calculate the posterior probability of belonging to a prediction region corresponding to an aggregate copy number of 0, 1, 2, or 3. Given the total copy number, we assume that any of the integer (*c*_{A},*c*_{B}) combinations are equally likely *a priori*. At each locus, we tabulate the frequency for which the posterior probability of an amplification or deletion is greater than the posterior probability of normal copy number. If the frequency is uneven for amplifications and deletions, we recompute the within-genotype location and scale parameters after trimming observations from the tail of the distribution that has a greater frequency. See Section 5 for an application.

## 5. RESULTS

We illustrate our approach for copy number estimation using 2 data sets that were assayed on the Affymetrix 6.0 genotyping platform: a data set containing 26 individuals with chromosome 21 trisomy and 70 apparently healthy controls (referred to as the Chakravarti data set), and a GWAS for bipolar disease (dbGaP accession number phs000017.v3.p1). For the Chakravarti data set, we analyzed the 96 samples as a single batch and assessed the robustness of our estimation procedure to a large fraction of samples with nonbiallelic genotypes. For the bipolar data set, we use chemistry plate as a surrogate for batch and restrict our analysis to the 1094 European ancestry controls that were processed on 29 plates over a 2-month period. For both analyses, we used default settings for the priors as described in Section E of the supplementary material (available at *Biostatistics* online). A difficulty in comparing our method to the Birdseye and Canary algorithms in the Birdsuite software suggested on the Affymetrix Web site for copy number analysis is that these algorithms do not provide locus-level estimates of copy number. Rather, Birdsuite provides output from a HMM, Birdseye, for the detection of *de novo* copy number variants and a separate algorithm, Canary, for the detection of common copy number variants (Korn *and others*, 2008, McCarroll *and others*, 2008).

### 5.1. Common copy number variants

We applied our algorithm to the Chakravarti data set to explore the robustness of our approach to the assumption that the typical copy number is 2 at any given locus. As discussed in Section 4, our algorithm uses biallelic genotype calls to develop naive estimates of copy number and parameter estimates for the prediction regions. After 1 iteration of the algorithm, the estimates of copy number are biased toward small values as the trisomy samples skew the within-genotype estimates of location and scale (Figure 3). However, the posterior probability that a trisomy sample has 2 chromosomal copies at a given locus tends to be low, and these samples can be discounted during a second iteration of the algorithm that updates the linear model parameters as described in Section 4.4. Subtle shifts in the prediction regions following the update can be observed for 3 example SNPs in Figure 3. Boxplots of the copy number for all SNPs on chromosome 21 demonstrate the bias reduction (Figure 3d and e). A simulation to assess when the fraction of subjects with altered copy number overwhelms the assumptions of the linear model is included in Section D of the supplementary material (supplementary material available at *Biostatistics* online). We estimate that the breaking point is approximately 30%, beyond which the copy number estimates for the trisomy subjects begins to approach the noise level for copy number 2.

(a–c) The ellipses denote prediction regions for copy number 1, 2, and 3 before (dashed lines) and after (solid lines) bias adjustment for 3 SNPs on chromosome 21 in the Chakravarti data set. Boxplots of the copy number estimates for SNPs on chromosome **...**

While locus-specific copy number estimates are not available from the Birdsuite software, an overall copy number estimate for each chromosome is provided by the Birdseye algorithm for the purpose of assessing mosaicism. The overall chromosome 21 estimates plotted in Figure 3 suggest that both the normal and trisomy samples are mosaic in copy number. The reason for the apparent mosaicism is that the Birdseye HMM (Table 1) is not particularly robust to departures from the assumption that most individuals at a given locus have normal copy number. To more formally compare our approach to the Birdseye HMM, we fit a HMM available in the R package *VanillaICE* to the polymorphic loci (Scharpf *and others*, 2008). For both HMMs, we used the default settings for the software versions indicated in Section I of the supplementary material (supplementary material available at *Biostatistics* online). Assuming that the true copy number is 2 in the normal samples and 3 for the trisomy samples, we calculated the proportion of correct calls for each HMM. Our approach maintains high sensitivity and specificity for detecting alterations despite approximately 27% of the samples having a known copy number alteration. That is, naive estimates of copy number provided by the biallelic genotype calls can be incorrect in a large number of samples, but still provide unbiased estimates of copy number in regions that are commonly variant. In comparison to the Birdsuite calls based on the blended output of the Birdseye HMM and Canary fit to the full marker set (Table 1), our approach has both higher sensitivity in the trisomy subjects (0.955 versus 0.918) and comparable specificity in the normal subjects (0.996 versus 0.991, data not shown). When we fit the HMM to the full marker set, for which the copy number estimates at nonpolymorphic loci remain biased toward 2, the VanillaICE and merged results from Birdseye and Canary again perform comparably in terms of specificity for the normal subjects (0.991 versus 0.996), but the sensitivity for copy number estimates among the trisomy subjects decreases from 0.96 to 0.89, revealing the need for improved univariate prediction regions at the nonpolymorphic loci in future versions of *crlmm*. As the entire chromosome is assumed to be either copy number 2 or copy number 3, an inherent limitation of this comparison is the lack of a penalty for over-smoothing that would limit the detection of smaller microdeletions and amplifications. The comparable specificities and sensitivities in the analysis with the full set of markers (SNPs + nonpolymorphic markers [NPs]) suggests that the default settings for the two HMMs provide comparable smoothing. A more detailed explanation of the technical considerations for this comparison is provided in Section B of the supplementary material (available at *Biostatistics* online).

### 5.2. Batch effects

The batch effect for SNPs on chromosome 15 in Figure 1(c) was comparable to the batch effects observed at polymorphic and nonpolymorphic loci on other chromosomes in the European ancestry bipolar controls (data not shown), demonstrating that quantile normalization alone was insufficient for removing the batch effect. Figure 4(a) plots the normalized log intensities for the A and B alleles for SNP_A-4251622, highlighting samples from 2 of the 29 chemistry plates used for the bipolar controls. The corresponding estimates of total copy number from a linear model that ignores the batch effect (here, plate effect) demonstrates that plate explains much of the variability in the copy number estimates (*F*-statistic 53.93). As much of the variation in the log A versus log B scatterplots is attributable to plate, the ellipses representing our uncertainty of the copy number prediction in Figure 4(a) are inflated. Allowing the parameters *ν*_{A},*ν*_{B},*ν*_{T},*φ*_{A},*φ*_{B}, and *φ*_{T} to depend on batch, the resulting prediction regions in Figure 4(a) more accurately reflect the uncertainty of the copy number estimates and are more robust to differences across batch (Figure 4b). A comparison with the Birdsuite software on this data is is provided in Section C of the supplementary material (available at Biostatistics online).

## 6. DISCUSSION

In this paper, we propose a multilevel model that provides absolute estimates of allele-specific copy number at polymorphic loci and total copy number at nonpolymorphic loci. Our observation that batch effects and copy number changes are often indistinguishable in their effects on the data has led to an estimation procedure that defines batch-specific prediction regions for rare, nonbiallelic genotypes from more commonly observed biallelic genotypes. In particular, biallelic genotype calls of samples in the experimental data set provide naive estimates of allele-specific copy number that are used to derive robust-to-outlier parameters for the background and slope in a linear model fit on the intensity scale. Conditional on the naive estimates of copy number, the log A and log B intensities are correlated due to cross-hybridization of the probes for these alleles. As locus-specific estimates of the covariance are often based on a small number of observations, shrinking these estimates toward typical values provides additional robustness to unusually small or large variance estimates. The copy number estimates obtained from this approach are robust to batch effects and robust to a large proportion of individuals having a copy number alteration. Copy number estimates plotted versus physical position can be used to assess issues such as cell contamination.

Our procedure for locus-level estimation of copy number reduces the impact of batch effects in several ways. First, we quantile normalize the raw intensities for each sample to a target reference distribution. Quantile-normalizing samples to a target distribution ensures that the distribution of intensities is the same across batches and can reduce the occurrence of *in silico* batch effects incurred by how the samples were grouped by the software. Secondly, we do not rely on external data for training the model or a reference set for estimating ratios. In our experience, external data sets are unlikely to provide useful extrapolations due to (i) batch effects and (ii) latent biological differences between the external samples and the test samples. Third, we fit a multilevel model to the summarized intensities that allows locus-specific parameters for background and signal to depend on batch. Finally, we provide an option to iteratively estimate model parameters and compute copy number such that the bias of the estimated copy number in regions of the genome that are commonly altered is relatively robust to the assumption that the median copy number is 2. This approach does not require prior knowledge of the locations of regions that are commonly altered.

The R package *crlmm* may not be appropriate for some data sets. For the study of germline traits, our model is most useful when 25 or more samples have been processed together in a batch. Batches with fewer than 25 samples tend to have a large number of SNPs with unobserved biallelic genotypes. For such SNPs, the additional uncertainty from the imputation of the within-genotype moments further reduces the resolution for detecting copy number alterations. For the study of somatic cell diseases such as cancer, copy number alterations in a substantial fraction of the genome are not uncommon. The ability to accurately estimate the absolute copy number in cancer samples will depend to a large degree on an appropriate experimental design, in particular, whether normal controls were processed alongside the cancer samples throughout the experiment. In such a setting, we have demonstrated through simulation that copy number estimates for the test samples can be computed from model parameters that were estimated from only the normal controls with relatively small bias.

Of the models previously proposed in the literature, the models of Wang *and others* (2008) and Korn *and others* (2008) are the most similar to ours as both develop bivariate normal prediction regions for altered copy number. The Wang model provides allele-specific estimates of copy number that accounts for the correlation of A and B allele intensities. However, the Wang model is designed for an earlier version of the Affymetrix platform that contained only SNP probes and relies on training data to estimate model parameters. In addition, the Wang model does not address batch effects or explore shrinkage for improving variance estimates. The Korn model is similar to the Wang model with a few important differences. First, Korn *and others* recommend fitting their software by plate to address batch effects. In our experience, batch effects persist in by plate analyses with Birdsuite. Secondly, Birdsuite does not provide locus-specific estimates of copy number. Rather, Birdsuite houses separate algorithms for calling rare and common copy number variants that each borrow strength from neighboring loci to identify regions of copy number gain or loss. As locus-level estimates are not available, options to explore alternative smoothing algorithms, such as segmentation for samples that are mosaic in copy number, are not available. By contrast, *crlmm* advocates an approach in which the decision to explore segmentation or HMMs can be evaluated from visualizations of the locus-level summaries. In particular, mixtures of cell populations that give rise to noninteger copy number can be assessed by plotting the locus-level summaries.

Our model can be extended in several ways to improve the prediction regions for biallelic and nonbiallelic genotypes. First, our model assumes that the average intensity increases linearly with allelic dosage. The linearity assumption appears reasonable for many SNPs and can be used to help discriminate between outliers and sparsely populated non-biallelic genotypes. However, the linearity is not apparent for many SNPs and departures from linearity become more pronounced for allelic copy numbers greater than 2. Approaches that allow departures from linearity are a future direction of this work. Secondly, improvements to the univariate prediction regions at nonpolymorphic loci are needed. Again, approaches that relax the assumption of linearity are needed. Thirdly, we currently model batch as a fixed effect. A compromise between a fixed-effect and random-effect model that borrows strength across batch is likely to be very effective, particularly for small batches and SNPs with low minor allele frequencies. Finally, adjusting for sequence characteristics such as GC content and fragment length can be helpful for reducing the variance associated with the probe effect. We will explore methods that adjust for these factors along with batch effects in the future.

Our results provide a strong indication that a model-based approach for estimation of absolute, allele-specific copy number can be effective in large studies with pronounced batch effects, and that borrowing strength across loci can be useful for estimating the variance. Estimates of copy number and the corresponding uncertainty will be useful for downstream assessments of copy number–phenotype association.

## FUNDING

National Institutes of Health (1K99HG005015); CTSA grant to the Johns Hopkins Medical Institutions; and National Heart, Lung, and Blood Institute (5T32HL007024) to R.B.S. National Institutes of Health (R01GM083084) from the National Institute of General Medicine and National Center for Research Resource (5R01RR021967) to R.A.I.

## SUPPLEMENTARY MATERIAL

Supplementary material is available at http://biostatistics.oxfordjournals.org.

**Supplementary Data:**

## Acknowledgments

The bipolar data set used for the analyses described in this manuscript was obtained from the database of Genotype and Phenotype (dbGaP) at http://www.ncbi.nlm.nih.gov/gap through dbGaP accession number phs000017.v3.p1, and was provided through the Genetic Association Information Network (GAIN). We thank Erin M. Ramos and Lisa J. McNeil from the National Institutes of Health for help navigating dbGap. *Conflict of Interest*: None declared.

## References

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (553K) |
- Citation

- Software comparison for evaluating genomic copy number variation for Affymetrix 6.0 SNP array platform.[BMC Bioinformatics. 2011]
*Eckel-Passow JE, Atkinson EJ, Maharjan S, Kardia SL, de Andrade M.**BMC Bioinformatics. 2011 May 31; 12:220. Epub 2011 May 31.* - Estimating genome-wide copy number using allele-specific mixture models.[J Comput Biol. 2008]
*Wang W, Carvalho B, Miller ND, Pevsner J, Chakravarti A, Irizarry RA.**J Comput Biol. 2008 Sep; 15(7):857-66.* - Using the R Package crlmm for Genotyping and Copy Number Estimation.[J Stat Softw. 2011]
*Scharpf RB, Irizarry RA, Ritchie ME, Carvalho B, Ruczinski I.**J Stat Softw. 2011 May 1; 40(12):1-32.* - New technologies provide insights into genetic basis of psychiatric disorders and explain their co-morbidity.[Psychiatr Danub. 2010]
*Rudan I.**Psychiatr Danub. 2010 Jun; 22(2):190-2.* - Genomic copy number variation, human health, and disease.[Lancet. 2009]
*Wain LV, Armour JA, Tobin MD.**Lancet. 2009 Jul 25; 374(9686):340-50. Epub 2009 Jun 15.*

- Sample processing obscures cancer-specific alterations in leukemic transcriptomes[Proceedings of the National Academy of Scie...]
*Dvinge H, Ries RE, Ilagan JO, Stirewalt DL, Meshinchi S, Bradley RK.**Proceedings of the National Academy of Sciences of the United States of America. 2014 Nov 25; 111(47)16802-16807* - Removing batch effects for prediction problems with frozen surrogate variable analysis[PeerJ. ]
*Parker HS, Corrada Bravo H, Leek JT.**PeerJ. 2e561* - Copy number polymorphisms near SLC2A9 are associated with serum uric acid concentrations[BMC Genetics. ]
*Scharpf RB, Mireles L, Yang Q, Köttgen A, Ruczinski I, Susztak K, Halper-Stromberg E, Tin A, Cristiano S, Chakravarti A, Boerwinkle E, Fox11 CS, Coresh J, Linda Kao WH.**BMC Genetics. 1581* - Tackling the widespread and critical impact of batch effects in high-throughput data[Nature reviews. Genetics. 2010]
*Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, Geman D, Baggerly K, Irizarry RA.**Nature reviews. Genetics. 2010 Oct; 11(10)10.1038/nrg2825* - Monitoring of Technical Variation in Quantitative High-Throughput Datasets[Cancer Informatics. ]
*Lauss M, Visne I, Kriegner A, Ringnér M, Jönsson G, Höglund M.**Cancer Informatics. 12193-201*

- A multilevel model to address batch effects in copy number estimation using SNP ...A multilevel model to address batch effects in copy number estimation using SNP arraysBiostatistics (Oxford, England). 2011 Jan; 12(1)33

Your browsing activity is empty.

Activity recording is turned off.

See more...