- Journal List
- Bioinformatics
- PMC2732310

# Joint estimation of copy number variation and reference intensities on multiple DNA arrays using GADA

^{1}Signal and Image Processing Institute, Ming Hsieh Department of Electrical Engineering, Viterbi School of Engineering, University of Southern California, EEB 400, 3740 McClintock Ave, Los Angeles, CA 90089-2564 and

^{2}Departments of Pediatrics and Pathology, Saban Research Institute, Childrens Hospital Los Angeles, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA

## Abstract

**Motivation:** The complexity of a large number of recently discovered copy number polymorphisms is much higher than initially thought, thus making it more difficult to detect them in the presence of significant measurement noise. In this scenario, separate normalization and segmentation is prone to lead to many false detections of changes in copy number. New approaches capable of jointly modeling the copy number and the non-copy number (noise) hybridization effects across multiple samples will potentially lead to more accurate results.

**Methods:** In this article, the genome alteration detection analysis (GADA) approach introduced in our previous work is extended to a multiple sample model. The copy number component is independent for each sample and uses a sparse Bayesian prior, while the reference hybridization level is not necessarily sparse but identical on all samples. The expectation maximization (EM) algorithm used to fit the model iteratively determines whether the observed hybridization levels are more likely due to a copy number variation or to a shared hybridization bias.

**Results:** The new proposed approach is compared with the currently used strategy of separate normalization followed by independent segmentation of each array. Real microarray data obtained from HapMap samples are randomly partitioned to create different reference sets. Using the new approach, copy number and reference intensity estimates are significantly less variable if the reference set changes; and a higher consistency on copy numbers detected within HapMap family trios is obtained. Finally, the running time to fit the model grows linearly in the number samples and probes.

**Availability:**http://biron.usc.edu/~piquereg/GADA

**Contact:** gro.eeei@euqipr; ude.csu.alhc@bahahs

**Supplementary information:**Supplementary data are available at *Bioinformatics* online.

## 1 INTRODUCTION

Copy number variations (CNVs) represent deviations from the normal number of DNA copies generally found in the genome of organisms (e.g. two for diploid cells). In recent years, advances in microarray technology have revealed the presence of short copy number variants that are frequently repeated across normal genomes (i.e. copy number polymorphism) (Feuk *et al.*, 2006; Fredman *et al.*, 2004; Freeman *et al.*, 2006; Iafrate *et al.*, 2004; Redon *et al.*, 2006) constituting a new form of natural genetic variation. Currently, some known CNV regions (CNVRs) cataloged in the database of genome variants (DGV) (Iafrate *et al.*, 2004) tend to be large and not very clearly delimitated regions and also miss smaller CNV (~10 kb) due to the lower resolution of the technologies used in the past (McCarroll *et al.*, 2008; Perry *et al.*, 2008). Moreover, the copy number structure of some of the most highly polymorphic regions have a much higher complexity than what was initially thought (Perry *et al.*, 2008). In order to understand the role of CNVs as a genetic determinant we still require a better characterization and delimitation of these polymorphisms along the entire genome using higher resolution arrays and more accurate detection algorithms. There are several array platforms that can be used to measure CNV. In this article, we focus on the latest high density array platforms that use millions of short oligonucleotide probes distributed along the genome. The high resolution of these arrays is particularly well suited to detect new short CNVs and more accurately delimitate the position and structure of known CNVs. Additionally, some of these probes are also used to target the possible allelic variants of a single nucleotide polymorphism (SNP), making it possible to obtain with the same experiment the SNP and CNV genetic profile of a sample. Commercially available platforms with these characteristics include Affymetrix SNP 6.0 and Illumina Human 1M microarrays. The basic premise for copy number detection is that the hybridization intensities of probes falling under a CNV region will have higher (or lower) values than those expected on a non-copy number variant (non-CNV) region. However, probe hybridization intensities also depend on other non-copy number-related events, which can be regarded as experimental noise that makes CNV detection a challenging problem. Correct estimation of a *reference* hybridization intensity expected on a non-CNV probe is essential to consider that the noise has zero mean.

Existing CNV detection algorithms assume that the measurements are unbiased (i.e. the noise distribution is centered at zero) and require a separate *preprocessing* step for normalization and reference extraction (Fig. 1A). Different normalization approaches have been proposed, including CNAT (Huang *et al.*, 2004), dChip (Zhao *et al.*, 2004), CNAG (Nannya *et al.*, 2005), GEMCA (Komura *et al.*, 2006), BeadStudio (Peiffer *et al.*, 2006), CRMA (Bengtsson *et al.*, 2008) and ITALICS (Rigaill *et al.*, 2008). These and other methods proposed by Marioni *et al.* (2007) and Diskin *et al.* (2008) can remove probe hybridization biases that have spatial correlation (i.e. ‘wave-like’) or correlation with the GC content or fragment length. While the spatially correlated portion of the bias can be removed by these preprocessing methods (Global effects Normalization in Figure 1A) there is still a probe-specific bias due to its own binding affinity that needs to be corrected. This is usually done by taking a robust average (trimmed mean, median, clustering) across a set of *reference* samples. However, if the aim is to identify CNVs that are frequent in population, the application of median normalization on a set of reference samples, where the non-CNVRs are not known a priori, would lead to biased results. This has been already pointed out as potentially problematic by several authors (Bengtsson *et al.*, 2008; Komura *et al.*, 2006; Rigaill *et al.*, 2008), and there is no clearly defined methodology currently available to establish a good reference in regions of the genome with complex CNV patterns.

**A**) the typical workflow used to analyze copy number with separate preprocessing, and (

**B**) the new proposed workflow using a joint estimation model for CNVs and the probe hybridization intensities.

In this article, we propose a new model for joint estimation of CNVs and the *reference* hybridization intensity associated with the non-CNV state. This new model extends our previous work with the genome alteration detection analysis (GADA) approach (Pique-Regi *et al.*, 2007; 2008b) that achieved excellent accuracy in normalized samples. Specifically, we incorporate in the model a parameter vector for the *reference* intensities in addition to the CNV component. The copy number component, as in our previous work, is represented by a sparse Bayesian learning (SBL) prior, which favors estimates where each sample has a small number of CNVRs, but is uninformative with regard to the position and magnitude of these regions. Extending this representation, our proposed hybridization reference intensity has a flat prior, but the effects are shared among all samples. The EM algorithm is used to fit the model, and simultaneously estimates the *copy number* and the *reference* parameters in a given set of observed samples.

The new approach is evaluated in both simulated datasets and microarray data obtained from a pool of HapMap specimens (HapMap Consortium, 2003) using the Affymetrix SNP 6.0 array platform. We compared the new GADA with joint reference normalization (GADA-JRN), with the currently used approach of using the median to compute the reference hybridization (GADA with separate median normalization, GADA-SMN), and with the Affymetrix Genotyping Console GTC3.0.1. with GC correction. The presented results demonstrate that the detected CNV are significantly more consistent within the HapMap family trios. CNV detected by the new approach are also less variable if we change the set of samples used to create the reference. Finally, the computation time to fit the new model GADA-JRN is very competitive, since the resulting algorithm (as was the case for GADA-SMN) has complexity that grows linearly with the number of samples and probes.

## 2 MATERIALS AND METHODS

All the samples used to evaluate the methods described in this article are publicly available as the Affymetrix SNP 6.0 Dataset (Affymetrix, 2007). The implementation for the new GADA method is publicly available at http://biron.usc.edu/ piquereg/GADA. The algorithm has been implemented in C, and tested using Matlab. There is also an R package in preparation that will include the new model.

### 2.1 Observation model and simulation data

An artificial dataset in which the underlying model is known (i.e. copy number, array artifacts and noise) was created to assess the proposed approach under specific scenarios. The hypothesized model assumes that for a collection of microarray experiments the following holds true:

where *y*_{mn} represents the log_{2} of the hybridization intensity observed by probe *m* on array *n*; *x*_{mn} represents change in hybridization intensity due to altered copy number, *r*_{m} is the *reference* probe hybridization intensity expected for a non-CNV state and _{mn} is a zero-mean array noise.

Using this model, we created an artificial microarray dataset with *N* = 20 samples and one single chromosome with *M* = 10 000 probes. We generated a copy number profile *x*_{mn} containing two CNVRs of different type (Fig. 4A). The first one is a region with aligned CNVs (breakpoints occurring at same position across samples). The second type is a region with non-aligned CNVs. The non-CNV portion of the genome, has an expected log_{2}-ratio *x*_{mn} = 0. The first variation is chosen to represent a loss *x*_{mn}=−1 (copy number 1), and the second variation is chosen to have copy number 3 [*x*_{mn}=log_{2}(3)−1=0.58]. The noise introduced in the dataset is white i.i.d Gaussian _{mn}~ (0,1). The major difference of this simulation model as compared with others (Lai *et al.*, 2005; Willenbrock and Fridlyand, 2005) is the inclusion of the probe hybridization bias effect *r*_{m}=0.5sin(2π0.001*m*)+(0,0.25) with two main components: (i) a sinusoidal wave with spatial correlation similar to what has been observed in some actual array experiments (Marioni *et al.*, 2007), and (ii) a noise wave without spatial correlation simulating each probe specific affinity. Supplementary Materials includes additional probe hybridization bias effects including scenarios where *r*_{m} changes in amplitude or across different batches.

### 2.2 Affymetrix SNP 6.0 data description and normalization

The Affymetrix dataset contains 270 samples from the International HapMap Project consisting of 30 CEPH trios (CEU), 30 Yoruban trios (YRI), 45 unrelated Han Chinese samples (CHB) and 45 unrelated Japanese samples (JPT). The Affymetrix Genome-Wide Human SNP Array 6.0 integrates about 1.9 million probesets (931 946 SNPs, 946 000 CN).

The preprocessing software packages that are available for the Affymetrix SNP Array 6.0 platform include: the Affymetrix Genotyping Console 3.0 (GTC3) with a GC content normalization (see Supplementary Materials and Affymetrix (2008)), dChip with invariant set normalization (Schadt *et al.*, 2001) and Aroma.Affymetrix which implements the CRMA (Bengtsson *et al.*, 2008). The CRMA method was chosen, since it has been shown to be more robust and accurate than other methods (Bengtsson *et al.*, 2008). The Aroma.Affymetrix package performs the following four correction steps: (1) allelic crosstalk calibration, *ACC*, for SNP probes; (2) probe-level modeling *PLM*, which gives a single signal for the SNP probes; (3) fragment length normalization, *FLN*, which corrects the differences in the PCR reaction due to the length and GC content of the fragment; and finally, (4) log ratio extraction, *LRE*, which calculates the log hybridization intensity relative to the expected diploid signal reference intensity using the median. In Figure 1A, the steps ACC, PLM and FLN are grouped together as *global effects normalization*, and LRE is the *probe bias correction*.

In this article, we argue that while Steps 1–3 may be performed safely during preprocessing, finding the non-CNV reference intensity is problematic in regions rich in copy number polymorphisms. In normal samples, most of the probes (>90%) will fall on regions without CNV which implies that the normalization model parameters which have a global effect on all probes can be safely estimated using robust strategies as those employed by CRMA (Bengtsson *et al.*, 2008). On the other hand, in any given region of the genome containing a highly polymorhic CNV it is not known a priori which samples do not have a CNV; and this CNV effect should be removed before estimating the probe hybridization intensity associated with the non-CNV state. In this article, we will extract the corrected probe intensities after Steps 1–3, and use our new proposed model (GADA-JRN) to jointly estimate the reference and the copy number component (Fig. 1B). Results from Affymetrix GTC software with GC correction are also obtained for comparison.

### 2.3 GADA with separate median normalization

In our previous work (Pique-Regi *et al.*, 2007, 2008b), we introduced GADA, a novel framework for detecting copy number altered segments of the genome. As in many other existing copy number calling approaches, it was assumed that there was no remaining bias after *LRE*; i.e.

In other words, comparing the models in (1) and (2), if the reference *r*_{m} were known, we could move it to the left side and would become the log-ratio intensities with only two remaining variations: the CNV effect *x*_{mn}, and the zero-mean hybridization noise _{mn}. The *n* in the notation can now be dropped because each sample *n* can be analyzed separately. Figure 2 shows an example to illustrate that *x*_{m} is PWC.

*r*

_{m}=0 on a chromosome section containing two variations as an example. The underlying mean hybridization intensity

*x*

_{m}is piece-wise constant (PWC) and discrete valued

**...**

In GADA, a compact linear algebra representation for these PWC vectors *x*=(*x*_{1},…, *x*_{M})^{t} was introduced:

which represents a linear combination of step vectors *f*_{i} (columns of **F**) as in Figure 3. Using this representation, any PWC vector *x* with *K* breakpoints can be represented by **w**=(…,0,*w*_{i1},0,…,0,*w*_{iK},0,…)^{t} with exactly *K* non-zero coefficients. More details on this maximally sparse representation can be found in our previous work (Pique-Regi *et al.*, 2008b).

In GADA, the PWC representation was combined with a two step detection approach. In the first step, SBL (Tipping, 2001; Wipf and Rao, 2004) is used to model that the number of breakpoints delimiting CNV *K* is very small compared with the number of probes, making it possible to identify the most likely regions of CNVs. The second step uses a backward elimination (BE) procedure to statistically rank the identified CNVs, allowing a flexible control of the false discovery rate (FDR). Compared with other state-of-the-art methods, using standard evaluation datasets and benchmarks (Willenbrock and Fridlyand, 2005), GADA obtained the highest accuracy and was 100 times faster.

### 2.4 GADA model with joint reference normalization

The new proposed model does not assume that the probe hybridization bias for each probe *r*_{m} in (1) has been removed, it is instead estimated jointly with the copy number from a large number samples. In vector form, the observation model for the log-hybridization of each probe on sample *n* can be rewritten as:

There are two basic premises that allow the joint estimation of the reference **r** and the copy number *x*^{n} component. First, the copy number component *x*^{n} on each sample is PWC with a small number of breakpoints *K* and can be efficiently represented with as *x*^{n}=*Fw*^{n}. It should be noted that the number and position of those breakpoints is in general different for each sample. Second, the probe hybridization bias (or non-CNV reference intensity) ** r** is not necessarily PWC, but is exactly the same across multiple arrays. Our model could also be extended for cases in which the amplitude of

**may change, i.e.**

*r**y*

_{mn}=

*x*

_{mn}+ρ

_{n}

*r*

_{m}+

_{mn}(See Supplementary Materials).

The copy number component is modeled using and independent SBL hierarchical prior for each breakpoint *m* and sample *n*:

The properties of the SBL prior are detailed in Tipping (2001), Wipf and Rao (2004) and Pique-Regi *et al.* (2008b). Setting *b*=0 and *a* to be small encourages a sparse number of breakpoints but is uninformative with respect the position and magnitude of the corresponding CNVRs. Compared with the single sample model *N* = 1, this model includes an SBL prior independent for each sample *n*, which means that independent CNV locations can be chosen across samples. The only parameter that is shared is the hyperparameter *a* that controls the expected degree of sparseness (number of CNV) in each sample. The noise , as in previous work, is assumed normal *p*(^{n}) ~ (0, σ^{2}_{n}**I**) using a different variance parameter (with a flat prior) for each array experiment. Finally, the new ** r** component is modeled using an uninformative flat prior.

Typically, the normalization models, estimate ** r** as a robust mean (e.g. the median, GADA-SMN) across a large cohort of normal reference samples, assuming that, at any given position

*m*, the majority of samples are non-CNV. In GADA-JRN,

**is considered unknown and will be jointly estimated with the copy number estimate.**

*r*### 2.5 Fitting the model with the EM algorithm

The model parameters are estimated by finding the maximum *a posteriori* (MAP) using a similar evidence maximization procedure as was used in our previous work. The EM algorithm starts by setting α_{m} and *r*_{m} to zero; then, it iterates the following E and M steps:

where *P*(*w*^{n}|*y*^{n}, **α**^{n}, ** r**, σ

^{2}

_{n})= (

*w*^{n}|

**μ**

^{n}, Σ

^{n}) exploiting the conjugacy properties between the gamma and normal distributions. The same notation as in our previous work is used and the super/sub-scripts

*n*and

*m*are added to identify the parameters that correspond to each sample and probe, respectively. For example, Σ

^{n}

_{mm}refers to the diagonal terms of the covariance matrix for the breakpoint weights

*w*^{n}posterior distribution of sample

*n*. Convergence of the model is reached with very few EM iterations; and all required operations in each iteration can be performed in a linear number of steps (MN) exploiting the properties of our PWC representation (i.e. the matrix structure for

**F**).

In relation to the previous GADA model (Pique-Regi *et al.*, 2008b), if the new ** r** component was modeled by a fixed point estimate (e.g. the median across samples), then the entire model will be completely equivalent to processing each sample independently using GADA (i.e. GADA-SMN). This can also be seen on the EM steps, if

**is a fixed point estimate then (7-11) can be updated separately for each**

*r**n*. Therefore, in GADA-JRN the CNV are placed independently in each sample and the coupling is only through the estimation of

**.**

*r*During the EM algorithm most of the weight parameters ŵ_{m}=μ_{m} are driven to 0 to fulfill the sparseness constraints imposed by the hierarchical prior. Upon convergence to zero, the corresponding weight parameter and hyperparameter can be eliminated from the model and the EM algorithm can continue with a model of reduced dimensions. This elimination makes the algorithm run faster since it has to update fewer parameters (i.e. only those that are non-zero).

### 2.6 Backward elimination

The sensitivity versus FDR tradeoff of our model is controlled by the hyper-parameter *a*. For higher values of *a* a more sparse solution with fewer CNVs is obtained, reducing both the FDR and the sensitivity. In order to efficiently explore solutions with different level of sparseness without having to run the algorithm all-over again, the same BE strategy described in our previous work is employed. Separately, for each of the samples the breakpoints with the lowest scores are recursively removed:

The sensitivity versus FDR tradeoff is controlled by stopping the procedure when all the remaining breakpoints have a score higher than a given critical value *T*. Therefore, as the algorithm continues to remove all the breakpoints and keeps track of which score *T* a particular breakpoint is eliminated, the breakpoints can be rapidly adjusted to any desired level based on their rank. The final result is reported as a set of segment breakpoints and amplitudes that represent the CNVs. The parameter *T* is physically more informative than the parameter *a* because it can be interpreted as the standard error the user is willing to tolerate to call a CNV significant. More information on how to set *T* and *a* can be found in our previous work (Pique-Regi *et al.*, 2008b) and its supplementary material.

### 2.7 Performance metrics and evaluation methods

In order to compare the performance of the different approaches for combining normalization of the non-CNV probe reference intensities and copy number detection, the following methods and performance metrics are introduced. From the Affymetrix dataset 270 samples are randomly partitioned into reference sets of different sizes (10, 20, 30, 45, 70, 90 and 135 samples). In each partition, one given sample would have been grouped with a different set of samples. For example, if we create 10 random partitions into 9 groups of size 30, each sample will have been grouped randomly with 29 different samples of the remaining 269 samples.

The Affymetrix dataset used in this analysis were processed under ideal conditions (i.e. processed on the same day using three plates; personal communications). Thus, we expect copy number estimates would be the same regardless of the subgroup chosen as the reference. Although, significant changes in are not observed in this dataset, it is noteworthy that the will likely change with various laboratory conditions or across different batches. Section 4 will address possible methods to analyze samples from different batches. Variance in non-copy number and copy number estimates (*V*_{r} and *V*_{x}) across different subgroups *g* can be used to assess the performance, with smaller variance indicating better performance.

Since this dataset contains 180 samples that are related in family trios, we also propose an additional measure of trio consistency. Identified CNVs in each HapMap trio are classified for each probe as in Table 1. Then, the *failed trio consistency rate* (*FTCR*) metric is defined as the ratio of inconsistent CNV probes in a trio among all identified CNV probes (except those considered uninformative):

A smaller *FTCR* value indicates that copy numbers within a family trio are more consistent. This measure ignores less frequent scenarios; e.g. if both mother and father have one chromosome with a CNV gain and the other chromosome without variation, it will still be possible (25% of the time) for the son to not inherit the CNV. In order to assess the validity of the *FTCR* measure, the *FTCR* scores of true trios are compared with those obtained from randomly grouping unrelated samples into trios.

## 3 RESULTS

### 3.1 Results with simulated data

The artificially generated data (see Section 2.1) illustrates a scenario in which there are two relatively high-frequency CNVs (Fig. 4A) with a bias on the hybridization measurement from the array experiment (Fig. 4B). If the probe hybridization bias ** r** is not removed from the data, the results will be contaminated with a large number of spurious segments not associated with true CNVs (Fig. 4C). While other approaches (Bengtsson

*et al.*, 2008; Diskin

*et al.*, 2008; Marioni

*et al.*, 2007) can correct the ‘smooth-wave’ (GC correlated) part of the bias (see next section), GADA-JRN can also correct the non-smooth (uncorrelated) probe-specific bias. The currently used approach of separate preprocessing estimates

*r*

_{m}as the median across a set of reference samples (here the simulated samples themselves) before extracting the CNVs. This can eliminate

*r*

_{m}on the areas of the genome without CNVs (

*x*

_{mn}= 0 regions); but it is problematic on CNVRs containing a large amount of CNV across samples (Fig. 4D and E). The new joint reference normalization approach (GADA-JRN in Fig. 4F) can correctly delimitate the CNV on the samples with CNV on the region CNVR-2, while the separate median normalization (GADA-SMN in Fig. 4E) incorrectly reports CNV on samples

*n*= 1,…,10. Additionally, GADA-JRN can better detect the small CNV on CNVR-1, in which the separate median normalization tends to make the amplitude of the variation smaller and thus more difficult to detect.

### 3.2 Results with Affymetrix microarray data

The hybridization intensities are obtained after using ACC and applying FLN corrections from Aroma.Affymetrix, on the 270 HapMap samples analyzed with Affymetrix SNP 6.0 arrays. This preprocessing step appropriately scales and centers the data removing the spatially correlated part of the bias. We next compare GADA-JRN and GADA-SMN employing the evaluation methods introduced in Section 2.7.

Using the randomly chosen reference sets of different size, we evaluated the variability *V*_{r} and *V*_{x} in the reference intensities and the copy number estimates. Figure 5 shows that as the number of samples in the reference set increases the variability on CNV estimates *V*_{x} decreases. More importantly, we can see that using GADA-JRN achieves a considerably better performance when compared with GADA-SMN. In some cases GADA-JRN requires about half of the samples in order to obtain estimates of similar accuracy to those achieved with GADA-SMN. In terms of variance or the reference intensity estimates *V*_{r}, GADA-JRN also achieves a significantly smaller values (*P* < 1*E*− 7 Kolmogorov–Smirnov test, data not shown).

This improvement in performance can also be observed using the trio consistency measure (*FTCR*) described in Section 2.7. In Figure 6, the trio consistency improves (i.e. *FTCR* decreases) with the size of the reference set, and GADA-JRN also achieves significantly better consistency. The results are also similar with change on the sparseness parameters *a* and *T* that set the tradeoff between sensitivity and the FDR. Figure 7 illustrates for one of the reference sets (90 CEU samples) the consistency that is obtained for different settings of the parameter *T*, which controls the BE step. The *FTCR* measure improves (decreases) with increasing *T*, since the number of detected false CNVs (i.e. FDR) decreases; but for higher values of *T* > 8 true CNVs may also fail to be detected on the offspring (i.e. lower sensitivity), and thus *FTCR* stops decreasing. The *FTCR* obtained on randomly formed trios of unrelated samples assures that this metric is actually capturing the increase on shared CNVs within a family (*P* < 0.01) and can be used to compare different normalization and copy number detection approaches. In Table 2, GADA-JRN obtains a better *FTCR* (16.7%) than GADA-SMN (19.5%) and GTC (19.45%) when using 90 reference samples. On a larger reference set of 180 samples, GADA-SMN (*FTCR* = 18.3%) and GTC (*FTCR* = 17.31%) improve but GADA-JRN still retains a better *FTCR* (16.5%). Overall, the new approach is especially indicated for small reference sets and for regions with highly polymorphic CNVs. Figure 8 shows the copy number estimates obtained on an already known highly polymorphic region of chromosome 17. The predicted gains and losses of GADA-JRN are retained when the reference set of 90 reference (CEU) samples is enlarged to include 180 (CEU+YRI) samples.

*T*. The dashed and solid lines correspond to a 90 (CEU) and 180 (CEU+YRI) sample reference set, respectively. The cloud of points are the

*FTCR*values obtained from 100 randomly formed

**...**

**...**

Finally the computational time required for fitting the model is longer on the new approach, but still retains a very competitive linear complexity in the number of probes and samples (Fig. 9).

## 4 DISCUSSION

The application of the proposed GADA with joint reference estimation is not only limited to Affymetrix SNP arrays but it can also be applied to other platforms such as Illumina beadarrays or NimbleGen array comparative genomic hybridization (aCGH). In Illumina BeadStudio the probe hybridization intensities *R* (obtained after ACC) can be extracted instead of the *LRR* values. The extraction of the LRR values uses a cluster approach to compute the expected *R* values of non-CNVRs (Peiffer *et al.*, 2006) which have similar drawbacks as separate median normalization. In aCGH, the reference DNA from a single sample or a pool of samples is used as a reference, but these log-ratio intensities may still contain a remaining ‘wavy’ artifact (Marioni *et al.*, 2007) that our proposed approach could eliminate.

ITALICS (Rigaill *et al.*, 2008) is an iterative approach that alternates two separate steps of copy number detection and normalization. The iterative concept is similar to the EM algorithm employed to fit the GADA-JRN model in this article, but there are two fundamental differences. First, ITALICS operates on a *single* sample and only includes a small set of parameters correcting for global array effects such as the fragment length and GC content; ITALICS considers that the reference non-CNV probe intensity is fixed and extracted from a separate reference set of samples. Second, the copy number extraction and the normalization model are iterated in practice only twice and not integrated under an unifying probabilistic model as in GADA-JRN. In contrast, GADA-JRN first proposes a *multiple* sample probability model which include parameters for the CNV component and the reference non-CNV probe hybridization intensity of every position of the genome; and then an iterative approach to fit the model is derived using the EM algorithm, which guarantees that the parameters converge.

GEMCA (Komura *et al.*, 2006), only available for Affymetrix 500K platform, approaches the problem by finding the reference *after* copy number detection. First, the copy number is estimated on the difference between all the possible pairs. The CNVRs are defined as those identified in certain number of pairs. Then the largest subgroup of samples with no relative variations (found using a maximum clique algorithm) is used to establish the reference set for that particular region of the genome. The Canary algorithm in Birdsuite (Korn *et al.*, 2008) uses a Gaussian mixture model (GMM) mixture model to identify *a posteriori* the copy number variation state of already delimited regions of CNV. In both GEMCA and Canary, the underlying assumption that CNVs are tightly aligned across samples and do not overlap with other possible CNVs represents a challenge in dealing with complex polymorphic CNVRs of the genome (e.g. Fig. 8).

GADA-JRN may also prove to be a flexible algorithm in variety of other applications. For example, samples processed by different labs or on different days may not necessarily have the same non-CNV reference hybridization intensity even if the platform is exactly the same. GADA-JRN could be applied to assess if this effect has changed between different experimental batches. Moreover, one or more control samples replicated across experimental batches could be used to better characterize and correct the probe hybridization noise.

In addition to the probe hybridization reference, other parameters could also be introduced to the GADA framework to model other known sources of variation. For example, allowing changes in scale of the bias (see Supplementary Materials), using an independent bias correction for each SNP allele, and including probe-specific noise variances (heteroscedastic model). The residuals of the model can be used to assess the impact of these other sources of variation on the results In particular, correlation on the model residuals can be used to discover hidden batch effects indicating the need for subgroup analysis (see Supplementary Materials). Finally, the GADA-JRN model could be used in combination with the N-GADA model (Pique-Regi *et al.*, 2008a) that the authors proposed for modeling breakpoints of CNVs across multiple samples.

## 5 CONCLUSION

In this article, we introduce a new method in which the reference probe hybridization intensity is jointly estimated with the copy number component. This type of methods are essential for the characterization of CNV on normal population using the latest array technologies, in which the underlying genome CNVs are not known a priori. The currently used approach of separate preprocessing using the median to estimate the hybridization intensities may fail to accurately detect highly polymorphic regions of the genome. The new proposed method extends our previous GADA model by introducing a new vector of parameters that model a common hybridization bias shared across many arrays. Results demonstrate a significantly better performance with the new extended GADA model while maintaining the attractive linear computational complexity in number of probes and samples.

## ACKNOWLEDGEMENTS

The authors would like to thank the two anonymous reviewers for valuable comments and suggestions.

*Funding*: NIH's Child Health Research Career Development Award Program (SA) (TK12-CA60104); Alex's Lemonade Stand Foundation (SA).

*Conflict of Interest*: none declared.

## REFERENCES

- Affymetrix Genome-wide human snp array 6.0 sample data set. 2007 Available at http://www.affymetrix.com/support/technical/sample_data/genomewide_snp6\data.affx(last accessed date February 17, 2009)
- Affymetrix Genotyping Console 3.0.1 User Manual. 2008 Available at ftp://www.affymetrix.com/(last accessed date February 17, 2009)
- Bengtsson H, et al. Estimation and assessment of raw copy numbers at the single locus level. Bioinformatics. 2008;24:759–767. [PubMed]
- Diskin SJ, et al. Adjustment of genomic waves in signal intensities from whole-genome SNP genotyping platforms. Nucleic Acids Res. 2008;36:e126. [PMC free article] [PubMed]
- Feuk L, et al. Structural variation in the human genome. Nat. Rev. Genet. 2006;7:85–97. [PubMed]
- Fredman D, et al. Complex snp-related sequence variation in segmental genome duplications. Nat. Genet. 2004;36:861–866. [PubMed]
- Freeman JL, et al. Copy number variation: new insights in genome diversity. Genome Res. 2006;16:949–961. [PubMed]
- Hap Map Consortium 2003 The International HapMap Project. Nature, 426, 789-796. Available at http://www.hapmap.org(last accessed date February 17, 2009).
- Huang J, et al. Whole genome DNA copy number changes identified by high density oligonucleotide arrays. Hum. Genomics. 2004;1:287–299. [PMC free article] [PubMed]
- Iafrate AJ, et al. Detection of large-scale variation in the human genome. Nat. Genet. 2004;36:949–951. [PubMed]
- Komura D, et al. Genome-wide detection of human copy number variations using high-density DNA oligonucleotide arrays. Genome Res. 2006;16:1575–1584. [PMC free article] [PubMed]
- Korn JM, et al. Integrated genotype calling and association analysis of snps, common copy number polymorphisms and rare cnvs. Nat. Genet. 2008;40:1253–1260. [PMC free article] [PubMed]
- Lai WR, et al. Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics. 2005;21:3763–3770. [PMC free article] [PubMed]
- Marioni J, et al. Breaking the waves: improved detection of copy number variation from microarray-based comparative genomic hybridization. Genome Biol. 2007;8:R228. [PMC free article] [PubMed]
- McCarroll SA, et al. Integrated detection and population-genetic analysis of SNPs and copy number variation. Nat. Genet. 2008;40:1166–1174. [PubMed]
- Nannya Y, et al. A robust algorithm for copy number detection using high-density oligonucleotide single nucleotide polymorphism genotyping arrays. Cancer Res. 2005;65:6071–6079. [PubMed]
- Peiffer DA, et al. High-resolution genomic profiling of chromosomal aberrations using infinium whole-genome genotyping. Genome Res. 2006;16:1136–1148. [PMC free article] [PubMed]
- Perry GH, et al. The fine-scale and complex architecture of human copy-number variation. Am. J. Hum. Genet. 2008;82:685–695. [PMC free article] [PubMed]
- Pique-Regi R, et al. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. Honolulu (Hawaii): 2007. Wavelet footprints and sparse bayesian learning for DNA copy number change analysis.
- Pique-Regi R, et al. IEEE International Workshop on Genomic Signal Processing and Statistics, 2008 (GENSiPS 2008). Phoenix (AZ): 2008a. Bayesian detection of recurrent copy number alterations across multiple array samples; pp. 1–4.
- Pique-Regi R, et al. Sparse representation and bayesian detection of genome copy number alterations from microarray data. Bioinformatics. 2008b;24:309–318. [PMC free article] [PubMed]
- Redon R, et al. Global variation in copy number in the human genome. Nature. 2006;444:444–454. [PMC free article] [PubMed]
- Rigaill G, et al. ITALICS: an algorithm for normalization and DNA copy number calling for affymetrix SNP arrays. Bioinformatics. 2008;24:768–774. [PubMed]
- Schadt EE, et al. Feature extraction and normalization algorithms for high-density oligonucleotide gene expression array data. J. Cell Biochem. 2001;(Suppl. 37):120–125. [PubMed]
- Tipping ME. Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res. 2001;1:211–244.
- Willenbrock H, Fridlyand J. A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics. 2005;21:4084–4091. [PubMed]
- Wipf D, Rao B. Sparse Bayesian learning for basis selection. IEEE-Trans-SP. 2004;52:2153–2164.
- Zhao X, et al. An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorphism arrays. Cancer Res. 2004;64:3060–3071. [PubMed]

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (550K) |
- Citation

- Sparse representation and Bayesian detection of genome copy number alterations from microarray data.[Bioinformatics. 2008]
*Pique-Regi R, Monso-Varona J, Ortega A, Seeger RC, Triche TJ, Asgharzadeh S.**Bioinformatics. 2008 Feb 1; 24(3):309-18. Epub 2008 Jan 18.* - Segmentation and estimation for SNP microarrays: a Bayesian multiple change-point approach.[Biometrics. 2010]
*Tai YC, Kvale MN, Witte JS.**Biometrics. 2010 Sep; 66(3):675-83.* - Normalization of Illumina Infinium whole-genome SNP data improves copy number estimates and allelic intensity ratios.[BMC Bioinformatics. 2008]
*Staaf J, Vallon-Christersson J, Lindgren D, Juliusson G, Rosenquist R, Höglund M, Borg A, Ringnér M.**BMC Bioinformatics. 2008 Oct 2; 9:409. Epub 2008 Oct 2.* - Oligonucleotide arrays for high-resolution analysis of copy number alteration in mental retardation/multiple congenital anomalies.[Genet Med. 2007]
*Shaikh TH.**Genet Med. 2007 Sep; 9(9):617-25.* - Integrated high-resolution genome-wide analysis of gene dosage and gene expression in human brain tumors.[Methods Mol Biol. 2007]
*Juric D, Bredel C, Sikic BI, Bredel M.**Methods Mol Biol. 2007; 377:187-202.*

- Structural polymorphism in the promoter of pfmrp2 confers Plasmodium falciparum tolerance to quinoline drugs[Molecular Microbiology. 2014]
*Mok S, Liong KY, Lim EH, Huang X, Zhu L, Preiser PR, Bozdech Z.**Molecular Microbiology. 2014 Mar; 91(5)918-934* - HaplotypeCN: Copy Number Haplotype Inference with Hidden Markov Model and Localized Haplotype Clustering[PLoS ONE. ]
*Lin YJ, Chen YT, Hsu SN, Peng CH, Tang CY, Yen TC, Hsieh WP.**PLoS ONE. 9(5)e96841* - A genome-wide study of de novo deletions identifies a candidate locus for non-syndromic isolated cleft lip/palate risk[BMC Genetics. ]
*Younkin SG, Scharpf RB, Schwender H, Parker MM, Scott AF, Marazita ML, Beaty TH, Ruczinski I.**BMC Genetics. 1524* - Genome wide analysis reveals single nucleotide polymorphisms associated with fatness and putative novel copy number variants in three pig breeds[BMC Genomics. ]
*Fowler KE, Pong-Wong R, Bauer J, Clemente EJ, Reitter CP, Affara NA, Waite S, Walling GA, Griffin DK.**BMC Genomics. 14784* - Novel multi-sample scheme for inferring phylogenetic markers from whole genome tumor profiles[IEEE/ACM transactions on computational biol...]
*Subramanian A, Shackney S, Schwartz R.**IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM. 2013; 10(6)1422-1431*

- PubMedPubMedPubMed citations for these articles

- Joint estimation of copy number variation and reference intensities on multiple ...Joint estimation of copy number variation and reference intensities on multiple DNA arrays using GADABioinformatics. 2009 May 15; 25(10)1223

Your browsing activity is empty.

Activity recording is turned off.

See more...