• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of bioinfoLink to Publisher's site
Bioinformatics. May 15, 2009; 25(10): 1223–1230.
Published online Mar 10, 2009. doi:  10.1093/bioinformatics/btp119
PMCID: PMC2732310

Joint estimation of copy number variation and reference intensities on multiple DNA arrays using GADA

Abstract

Motivation: The complexity of a large number of recently discovered copy number polymorphisms is much higher than initially thought, thus making it more difficult to detect them in the presence of significant measurement noise. In this scenario, separate normalization and segmentation is prone to lead to many false detections of changes in copy number. New approaches capable of jointly modeling the copy number and the non-copy number (noise) hybridization effects across multiple samples will potentially lead to more accurate results.

Methods: In this article, the genome alteration detection analysis (GADA) approach introduced in our previous work is extended to a multiple sample model. The copy number component is independent for each sample and uses a sparse Bayesian prior, while the reference hybridization level is not necessarily sparse but identical on all samples. The expectation maximization (EM) algorithm used to fit the model iteratively determines whether the observed hybridization levels are more likely due to a copy number variation or to a shared hybridization bias.

Results: The new proposed approach is compared with the currently used strategy of separate normalization followed by independent segmentation of each array. Real microarray data obtained from HapMap samples are randomly partitioned to create different reference sets. Using the new approach, copy number and reference intensity estimates are significantly less variable if the reference set changes; and a higher consistency on copy numbers detected within HapMap family trios is obtained. Finally, the running time to fit the model grows linearly in the number samples and probes.

Availability:http://biron.usc.edu/~piquereg/GADA

Contact: gro.eeei@euqipr; ude.csu.alhc@bahahs

Supplementary information:Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

Copy number variations (CNVs) represent deviations from the normal number of DNA copies generally found in the genome of organisms (e.g. two for diploid cells). In recent years, advances in microarray technology have revealed the presence of short copy number variants that are frequently repeated across normal genomes (i.e. copy number polymorphism) (Feuk et al., 2006; Fredman et al., 2004; Freeman et al., 2006; Iafrate et al., 2004; Redon et al., 2006) constituting a new form of natural genetic variation. Currently, some known CNV regions (CNVRs) cataloged in the database of genome variants (DGV) (Iafrate et al., 2004) tend to be large and not very clearly delimitated regions and also miss smaller CNV (~10 kb) due to the lower resolution of the technologies used in the past (McCarroll et al., 2008; Perry et al., 2008). Moreover, the copy number structure of some of the most highly polymorphic regions have a much higher complexity than what was initially thought (Perry et al., 2008). In order to understand the role of CNVs as a genetic determinant we still require a better characterization and delimitation of these polymorphisms along the entire genome using higher resolution arrays and more accurate detection algorithms. There are several array platforms that can be used to measure CNV. In this article, we focus on the latest high density array platforms that use millions of short oligonucleotide probes distributed along the genome. The high resolution of these arrays is particularly well suited to detect new short CNVs and more accurately delimitate the position and structure of known CNVs. Additionally, some of these probes are also used to target the possible allelic variants of a single nucleotide polymorphism (SNP), making it possible to obtain with the same experiment the SNP and CNV genetic profile of a sample. Commercially available platforms with these characteristics include Affymetrix SNP 6.0 and Illumina Human 1M microarrays. The basic premise for copy number detection is that the hybridization intensities of probes falling under a CNV region will have higher (or lower) values than those expected on a non-copy number variant (non-CNV) region. However, probe hybridization intensities also depend on other non-copy number-related events, which can be regarded as experimental noise that makes CNV detection a challenging problem. Correct estimation of a reference hybridization intensity expected on a non-CNV probe is essential to consider that the noise has zero mean.

Existing CNV detection algorithms assume that the measurements are unbiased (i.e. the noise distribution is centered at zero) and require a separate preprocessing step for normalization and reference extraction (Fig. 1A). Different normalization approaches have been proposed, including CNAT (Huang et al., 2004), dChip (Zhao et al., 2004), CNAG (Nannya et al., 2005), GEMCA (Komura et al., 2006), BeadStudio (Peiffer et al., 2006), CRMA (Bengtsson et al., 2008) and ITALICS (Rigaill et al., 2008). These and other methods proposed by Marioni et al. (2007) and Diskin et al. (2008) can remove probe hybridization biases that have spatial correlation (i.e. ‘wave-like’) or correlation with the GC content or fragment length. While the spatially correlated portion of the bias can be removed by these preprocessing methods (Global effects Normalization in Figure 1A) there is still a probe-specific bias due to its own binding affinity that needs to be corrected. This is usually done by taking a robust average (trimmed mean, median, clustering) across a set of reference samples. However, if the aim is to identify CNVs that are frequent in population, the application of median normalization on a set of reference samples, where the non-CNVRs are not known a priori, would lead to biased results. This has been already pointed out as potentially problematic by several authors (Bengtsson et al., 2008; Komura et al., 2006; Rigaill et al., 2008), and there is no clearly defined methodology currently available to establish a good reference in regions of the genome with complex CNV patterns.

Fig. 1.
Block diagram depicting (A) the typical workflow used to analyze copy number with separate preprocessing, and (B) the new proposed workflow using a joint estimation model for CNVs and the probe hybridization intensities.

In this article, we propose a new model for joint estimation of CNVs and the reference hybridization intensity associated with the non-CNV state. This new model extends our previous work with the genome alteration detection analysis (GADA) approach (Pique-Regi et al., 2007; 2008b) that achieved excellent accuracy in normalized samples. Specifically, we incorporate in the model a parameter vector for the reference intensities in addition to the CNV component. The copy number component, as in our previous work, is represented by a sparse Bayesian learning (SBL) prior, which favors estimates where each sample has a small number of CNVRs, but is uninformative with regard to the position and magnitude of these regions. Extending this representation, our proposed hybridization reference intensity has a flat prior, but the effects are shared among all samples. The EM algorithm is used to fit the model, and simultaneously estimates the copy number and the reference parameters in a given set of observed samples.

The new approach is evaluated in both simulated datasets and microarray data obtained from a pool of HapMap specimens (HapMap Consortium, 2003) using the Affymetrix SNP 6.0 array platform. We compared the new GADA with joint reference normalization (GADA-JRN), with the currently used approach of using the median to compute the reference hybridization (GADA with separate median normalization, GADA-SMN), and with the Affymetrix Genotyping Console GTC3.0.1. with GC correction. The presented results demonstrate that the detected CNV are significantly more consistent within the HapMap family trios. CNV detected by the new approach are also less variable if we change the set of samples used to create the reference. Finally, the computation time to fit the new model GADA-JRN is very competitive, since the resulting algorithm (as was the case for GADA-SMN) has complexity that grows linearly with the number of samples and probes.

2 MATERIALS AND METHODS

All the samples used to evaluate the methods described in this article are publicly available as the Affymetrix SNP 6.0 Dataset (Affymetrix, 2007). The implementation for the new GADA method is publicly available at http://biron.usc.edu/ piquereg/GADA. The algorithm has been implemented in C, and tested using Matlab. There is also an R package in preparation that will include the new model.

2.1 Observation model and simulation data

An artificial dataset in which the underlying model is known (i.e. copy number, array artifacts and noise) was created to assess the proposed approach under specific scenarios. The hypothesized model assumes that for a collection of microarray experiments the following holds true:

equation image
(1)

where ymn represents the log2 of the hybridization intensity observed by probe m on array n; xmn represents change in hybridization intensity due to altered copy number, rm is the reference probe hybridization intensity expected for a non-CNV state and epsilonmn is a zero-mean array noise.

Using this model, we created an artificial microarray dataset with N = 20 samples and one single chromosome with M = 10 000 probes. We generated a copy number profile xmn containing two CNVRs of different type (Fig. 4A). The first one is a region with aligned CNVs (breakpoints occurring at same position across samples). The second type is a region with non-aligned CNVs. The non-CNV portion of the genome, has an expected log2-ratio xmn = 0. The first variation is chosen to represent a loss xmn=−1 (copy number 1), and the second variation is chosen to have copy number 3 [xmn=log2(3)−1=0.58]. The noise introduced in the dataset is white i.i.d Gaussian epsilonmn~ [mathematical script N](0,1). The major difference of this simulation model as compared with others (Lai et al., 2005; Willenbrock and Fridlyand, 2005) is the inclusion of the probe hybridization bias effect rm=0.5sin(2π0.001m)+[mathematical script N](0,0.25) with two main components: (i) a sinusoidal wave with spatial correlation similar to what has been observed in some actual array experiments (Marioni et al., 2007), and (ii) a noise wave without spatial correlation simulating each probe specific affinity. Supplementary Materials includes additional probe hybridization bias effects including scenarios where rm changes in amplitude or across different batches.

Fig. 4.
Illustration of the observation model. Colors represent the observed hybridization intensities and the relative copy number change (blue = loss ‘−1’, red = gain ‘+1’, green = neutral ‘0’). (A) The ...

2.2 Affymetrix SNP 6.0 data description and normalization

The Affymetrix dataset contains 270 samples from the International HapMap Project consisting of 30 CEPH trios (CEU), 30 Yoruban trios (YRI), 45 unrelated Han Chinese samples (CHB) and 45 unrelated Japanese samples (JPT). The Affymetrix Genome-Wide Human SNP Array 6.0 integrates about 1.9 million probesets (931 946 SNPs, 946 000 CN).

The preprocessing software packages that are available for the Affymetrix SNP Array 6.0 platform include: the Affymetrix Genotyping Console 3.0 (GTC3) with a GC content normalization (see Supplementary Materials and Affymetrix (2008)), dChip with invariant set normalization (Schadt et al., 2001) and Aroma.Affymetrix which implements the CRMA (Bengtsson et al., 2008). The CRMA method was chosen, since it has been shown to be more robust and accurate than other methods (Bengtsson et al., 2008). The Aroma.Affymetrix package performs the following four correction steps: (1) allelic crosstalk calibration, ACC, for SNP probes; (2) probe-level modeling PLM, which gives a single signal for the SNP probes; (3) fragment length normalization, FLN, which corrects the differences in the PCR reaction due to the length and GC content of the fragment; and finally, (4) log ratio extraction, LRE, which calculates the log hybridization intensity relative to the expected diploid signal reference intensity using the median. In Figure 1A, the steps ACC, PLM and FLN are grouped together as global effects normalization, and LRE is the probe bias correction.

In this article, we argue that while Steps 1–3 may be performed safely during preprocessing, finding the non-CNV reference intensity is problematic in regions rich in copy number polymorphisms. In normal samples, most of the probes (>90%) will fall on regions without CNV which implies that the normalization model parameters which have a global effect on all probes can be safely estimated using robust strategies as those employed by CRMA (Bengtsson et al., 2008). On the other hand, in any given region of the genome containing a highly polymorhic CNV it is not known a priori which samples do not have a CNV; and this CNV effect should be removed before estimating the probe hybridization intensity associated with the non-CNV state. In this article, we will extract the corrected probe intensities after Steps 1–3, and use our new proposed model (GADA-JRN) to jointly estimate the reference and the copy number component (Fig. 1B). Results from Affymetrix GTC software with GC correction are also obtained for comparison.

2.3 GADA with separate median normalization

In our previous work (Pique-Regi et al., 2007, 2008b), we introduced GADA, a novel framework for detecting copy number altered segments of the genome. As in many other existing copy number calling approaches, it was assumed that there was no remaining bias after LRE; i.e.

equation image
(2)

In other words, comparing the models in (1) and (2), if the reference rm were known, we could move it to the left side and An external file that holds a picture, illustration, etc.
Object name is btp119i1.jpg would become the log-ratio intensities with only two remaining variations: the CNV effect xmn, and the zero-mean hybridization noise epsilonmn. The n in the notation can now be dropped because each sample n can be analyzed separately. Figure 2 shows an example to illustrate that xm is PWC.

Fig. 2.
Graphical representation of the observation model with the reference already corrected rm=0 on a chromosome section containing two variations as an example. The underlying mean hybridization intensity xm is piece-wise constant (PWC) and discrete valued ...

In GADA, a compact linear algebra representation for these PWC vectors x=(x1,…, xM)t was introduced:

equation image
(3)

which represents a linear combination of step vectors fi (columns of F) as in Figure 3. Using this representation, any PWC vector x with K breakpoints can be represented by w=(…,0,wi1,0,…,0,wiK,0,…)t with exactly K non-zero coefficients. More details on this maximally sparse representation can be found in our previous work (Pique-Regi et al., 2008b).

Fig. 3.
Step vector fi with a breakpoint between probe i and i+1.

In GADA, the PWC representation was combined with a two step detection approach. In the first step, SBL (Tipping, 2001; Wipf and Rao, 2004) is used to model that the number of breakpoints delimiting CNV K is very small compared with the number of probes, making it possible to identify the most likely regions of CNVs. The second step uses a backward elimination (BE) procedure to statistically rank the identified CNVs, allowing a flexible control of the false discovery rate (FDR). Compared with other state-of-the-art methods, using standard evaluation datasets and benchmarks (Willenbrock and Fridlyand, 2005), GADA obtained the highest accuracy and was 100 times faster.

2.4 GADA model with joint reference normalization

The new proposed model does not assume that the probe hybridization bias for each probe rm in (1) has been removed, it is instead estimated jointly with the copy number from a large number samples. In vector form, the observation model for the log-hybridization of each probe on sample n can be rewritten as:

equation image
(4)

There are two basic premises that allow the joint estimation of the reference r and the copy number xn component. First, the copy number component xn on each sample is PWC with a small number of breakpoints K and can be efficiently represented with as xn=Fwn. It should be noted that the number and position of those breakpoints is in general different for each sample. Second, the probe hybridization bias (or non-CNV reference intensity) r is not necessarily PWC, but is exactly the same across multiple arrays. Our model could also be extended for cases in which the amplitude of r may change, i.e. ymn=xmnnrm+epsilonmn (See Supplementary Materials).

The copy number component is modeled using and independent SBL hierarchical prior for each breakpoint m and sample n:

equation image
(5)

equation image
(6)

The properties of the SBL prior are detailed in Tipping (2001), Wipf and Rao (2004) and Pique-Regi et al. (2008b). Setting b=0 and a to be small encourages a sparse number of breakpoints but is uninformative with respect the position and magnitude of the corresponding CNVRs. Compared with the single sample model N = 1, this model includes an SBL prior independent for each sample n, which means that independent CNV locations can be chosen across samples. The only parameter that is shared is the hyperparameter a that controls the expected degree of sparseness (number of CNV) in each sample. The noise epsilon, as in previous work, is assumed normal p(epsilonn) ~ [mathematical script N] (0, σ2nI) using a different variance parameter (with a flat prior) for each array experiment. Finally, the new r component is modeled using an uninformative flat prior.

Typically, the normalization models, estimate r as a robust mean (e.g. the median, GADA-SMN) across a large cohort of normal reference samples, assuming that, at any given position m, the majority of samples are non-CNV. In GADA-JRN, r is considered unknown and will be jointly estimated with the copy number estimate.

2.5 Fitting the model with the EM algorithm

The model parameters are estimated by finding the maximum a posteriori (MAP) using a similar evidence maximization procedure as was used in our previous work. The EM algorithm starts by setting αm and rm to zero; then, it iterates the following E and M steps:

equation image
(7)

equation image
(8)

equation image
(9)

equation image
(10)

equation image
(11)

equation image
(12)

where P(wn|yn, αn, r, σ2n)= [mathematical script N] (wn|μn, Σn) exploiting the conjugacy properties between the gamma and normal distributions. The same notation as in our previous work is used and the super/sub-scripts n and m are added to identify the parameters that correspond to each sample and probe, respectively. For example, Σnmm refers to the diagonal terms of the covariance matrix for the breakpoint weights wn posterior distribution of sample n. Convergence of the model is reached with very few EM iterations; and all required operations in each iteration can be performed in a linear number of steps O(MN) exploiting the properties of our PWC representation (i.e. the matrix structure for F).

In relation to the previous GADA model (Pique-Regi et al., 2008b), if the new r component was modeled by a fixed point estimate (e.g. the median across samples), then the entire model will be completely equivalent to processing each sample independently using GADA (i.e. GADA-SMN). This can also be seen on the EM steps, if r is a fixed point estimate then (7-11) can be updated separately for each n. Therefore, in GADA-JRN the CNV are placed independently in each sample and the coupling is only through the estimation of r.

During the EM algorithm most of the weight parameters ŵmm are driven to 0 to fulfill the sparseness constraints imposed by the hierarchical prior. Upon convergence to zero, the corresponding weight parameter and hyperparameter can be eliminated from the model and the EM algorithm can continue with a model of reduced dimensions. This elimination makes the algorithm run faster since it has to update fewer parameters (i.e. only those that are non-zero).

2.6 Backward elimination

The sensitivity versus FDR tradeoff of our model is controlled by the hyper-parameter a. For higher values of a a more sparse solution with fewer CNVs is obtained, reducing both the FDR and the sensitivity. In order to efficiently explore solutions with different level of sparseness without having to run the algorithm all-over again, the same BE strategy described in our previous work is employed. Separately, for each of the samples the breakpoints with the lowest scores are recursively removed:

equation image
(13)

The sensitivity versus FDR tradeoff is controlled by stopping the procedure when all the remaining breakpoints have a score higher than a given critical value T. Therefore, as the algorithm continues to remove all the breakpoints and keeps track of which score T a particular breakpoint is eliminated, the breakpoints can be rapidly adjusted to any desired level based on their rank. The final result is reported as a set of segment breakpoints and amplitudes that represent the CNVs. The parameter T is physically more informative than the parameter a because it can be interpreted as the standard error the user is willing to tolerate to call a CNV significant. More information on how to set T and a can be found in our previous work (Pique-Regi et al., 2008b) and its supplementary material.

2.7 Performance metrics and evaluation methods

In order to compare the performance of the different approaches for combining normalization of the non-CNV probe reference intensities and copy number detection, the following methods and performance metrics are introduced. From the Affymetrix dataset 270 samples are randomly partitioned into reference sets of different sizes (10, 20, 30, 45, 70, 90 and 135 samples). In each partition, one given sample would have been grouped with a different set of samples. For example, if we create 10 random partitions into 9 groups of size 30, each sample will have been grouped randomly with 29 different samples of the remaining 269 samples.

The Affymetrix dataset used in this analysis were processed under ideal conditions (i.e. processed on the same day using three plates; personal communications). Thus, we expect copy number estimates An external file that holds a picture, illustration, etc.
Object name is btp119i2.jpg would be the same regardless of the subgroup chosen as the reference. Although, significant changes in An external file that holds a picture, illustration, etc.
Object name is btp119i3.jpg are not observed in this dataset, it is noteworthy that the An external file that holds a picture, illustration, etc.
Object name is btp119i3.jpg will likely change with various laboratory conditions or across different batches. Section 4 will address possible methods to analyze samples from different batches. Variance in non-copy number and copy number estimates (Vr and Vx) across different subgroups g can be used to assess the performance, with smaller variance indicating better performance.

equation image
(14)

equation image
(15)

Since this dataset contains 180 samples that are related in family trios, we also propose an additional measure of trio consistency. Identified CNVs in each HapMap trio are classified for each probe as in Table 1. Then, the failed trio consistency rate (FTCR) metric is defined as the ratio of inconsistent CNV probes in a trio among all identified CNV probes (except those considered uninformative):

equation image
(16)

A smaller FTCR value indicates that copy numbers within a family trio are more consistent. This measure ignores less frequent scenarios; e.g. if both mother and father have one chromosome with a CNV gain and the other chromosome without variation, it will still be possible (25% of the time) for the son to not inherit the CNV. In order to assess the validity of the FTCR measure, the FTCR scores of true trios are compared with those obtained from randomly grouping unrelated samples into trios.

Table 1.
Consistency on HapMap trios

3 RESULTS

3.1 Results with simulated data

The artificially generated data (see Section 2.1) illustrates a scenario in which there are two relatively high-frequency CNVs (Fig. 4A) with a bias on the hybridization measurement from the array experiment (Fig. 4B). If the probe hybridization bias r is not removed from the data, the results will be contaminated with a large number of spurious segments not associated with true CNVs (Fig. 4C). While other approaches (Bengtsson et al., 2008; Diskin et al., 2008; Marioni et al., 2007) can correct the ‘smooth-wave’ (GC correlated) part of the bias (see next section), GADA-JRN can also correct the non-smooth (uncorrelated) probe-specific bias. The currently used approach of separate preprocessing estimates rm as the median across a set of reference samples (here the simulated samples themselves) before extracting the CNVs. This can eliminate rm on the areas of the genome without CNVs (xmn = 0 regions); but it is problematic on CNVRs containing a large amount of CNV across samples (Fig. 4D and E). The new joint reference normalization approach (GADA-JRN in Fig. 4F) can correctly delimitate the CNV on the samples with CNV on the region CNVR-2, while the separate median normalization (GADA-SMN in Fig. 4E) incorrectly reports CNV on samples n = 1,…,10. Additionally, GADA-JRN can better detect the small CNV on CNVR-1, in which the separate median normalization tends to make the amplitude of the variation smaller and thus more difficult to detect.

3.2 Results with Affymetrix microarray data

The hybridization intensities are obtained after using ACC and applying FLN corrections from Aroma.Affymetrix, on the 270 HapMap samples analyzed with Affymetrix SNP 6.0 arrays. This preprocessing step appropriately scales and centers the data removing the spatially correlated part of the bias. We next compare GADA-JRN and GADA-SMN employing the evaluation methods introduced in Section 2.7.

Using the randomly chosen reference sets of different size, we evaluated the variability Vr and Vx in the reference intensities and the copy number estimates. Figure 5 shows that as the number of samples in the reference set increases the variability on CNV estimates Vx decreases. More importantly, we can see that using GADA-JRN achieves a considerably better performance when compared with GADA-SMN. In some cases GADA-JRN requires about half of the samples in order to obtain estimates of similar accuracy to those achieved with GADA-SMN. In terms of variance or the reference intensity estimates Vr, GADA-JRN also achieves a significantly smaller values (P < 1E− 7 Kolmogorov–Smirnov test, data not shown).

Fig. 5.
Variability on the copy number estimates if the set of reference samples changes.

This improvement in performance can also be observed using the trio consistency measure (FTCR) described in Section 2.7. In Figure 6, the trio consistency improves (i.e. FTCR decreases) with the size of the reference set, and GADA-JRN also achieves significantly better consistency. The results are also similar with change on the sparseness parameters a and T that set the tradeoff between sensitivity and the FDR. Figure 7 illustrates for one of the reference sets (90 CEU samples) the consistency that is obtained for different settings of the parameter T, which controls the BE step. The FTCR measure improves (decreases) with increasing T, since the number of detected false CNVs (i.e. FDR) decreases; but for higher values of T > 8 true CNVs may also fail to be detected on the offspring (i.e. lower sensitivity), and thus FTCR stops decreasing. The FTCR obtained on randomly formed trios of unrelated samples assures that this metric is actually capturing the increase on shared CNVs within a family (P < 0.01) and can be used to compare different normalization and copy number detection approaches. In Table 2, GADA-JRN obtains a better FTCR (16.7%) than GADA-SMN (19.5%) and GTC (19.45%) when using 90 reference samples. On a larger reference set of 180 samples, GADA-SMN (FTCR = 18.3%) and GTC (FTCR = 17.31%) improve but GADA-JRN still retains a better FTCR (16.5%). Overall, the new approach is especially indicated for small reference sets and for regions with highly polymorphic CNVs. Figure 8 shows the copy number estimates obtained on an already known highly polymorphic region of chromosome 17. The predicted gains and losses of GADA-JRN are retained when the reference set of 90 reference (CEU) samples is enlarged to include 180 (CEU+YRI) samples.

Fig. 6.
Consistency of the copy number estimates on HapMap trios if the set of reference samples changes.
Fig. 7.
Consistency within HapMap trios using a different sparseness setting T. The dashed and solid lines correspond to a 90 (CEU) and 180 (CEU+YRI) sample reference set, respectively. The cloud of points are the FTCR values obtained from 100 randomly formed ...
Fig. 8.
Section of the chromosome 17 that contains an already known CNV. Each row corresponds to one of the 90 CEU HapMap samples and are grouped in trios (father, son/daughter, mother) delimited by horizontal dotted lines. On the left of the thick vertical line ...
Table 2.
Comparison on HapMap trio consistency FTCR

Finally the computational time required for fitting the model is longer on the new approach, but still retains a very competitive linear complexity in the number of probes and samples (Fig. 9).

Fig. 9.
Computational time required to fit the model is linear on the number of samples for both approaches. Execution times required to process the models are measured on the same machine.

4 DISCUSSION

The application of the proposed GADA with joint reference estimation is not only limited to Affymetrix SNP arrays but it can also be applied to other platforms such as Illumina beadarrays or NimbleGen array comparative genomic hybridization (aCGH). In Illumina BeadStudio the probe hybridization intensities R (obtained after ACC) can be extracted instead of the LRR values. The extraction of the LRR values uses a cluster approach to compute the expected R values of non-CNVRs (Peiffer et al., 2006) which have similar drawbacks as separate median normalization. In aCGH, the reference DNA from a single sample or a pool of samples is used as a reference, but these log-ratio intensities may still contain a remaining ‘wavy’ artifact (Marioni et al., 2007) that our proposed approach could eliminate.

ITALICS (Rigaill et al., 2008) is an iterative approach that alternates two separate steps of copy number detection and normalization. The iterative concept is similar to the EM algorithm employed to fit the GADA-JRN model in this article, but there are two fundamental differences. First, ITALICS operates on a single sample and only includes a small set of parameters correcting for global array effects such as the fragment length and GC content; ITALICS considers that the reference non-CNV probe intensity is fixed and extracted from a separate reference set of samples. Second, the copy number extraction and the normalization model are iterated in practice only twice and not integrated under an unifying probabilistic model as in GADA-JRN. In contrast, GADA-JRN first proposes a multiple sample probability model which include parameters for the CNV component and the reference non-CNV probe hybridization intensity of every position of the genome; and then an iterative approach to fit the model is derived using the EM algorithm, which guarantees that the parameters converge.

GEMCA (Komura et al., 2006), only available for Affymetrix 500K platform, approaches the problem by finding the reference after copy number detection. First, the copy number is estimated on the difference between all the possible pairs. The CNVRs are defined as those identified in certain number of pairs. Then the largest subgroup of samples with no relative variations (found using a maximum clique algorithm) is used to establish the reference set for that particular region of the genome. The Canary algorithm in Birdsuite (Korn et al., 2008) uses a Gaussian mixture model (GMM) mixture model to identify a posteriori the copy number variation state of already delimited regions of CNV. In both GEMCA and Canary, the underlying assumption that CNVs are tightly aligned across samples and do not overlap with other possible CNVs represents a challenge in dealing with complex polymorphic CNVRs of the genome (e.g. Fig. 8).

GADA-JRN may also prove to be a flexible algorithm in variety of other applications. For example, samples processed by different labs or on different days may not necessarily have the same non-CNV reference hybridization intensity even if the platform is exactly the same. GADA-JRN could be applied to assess if this effect has changed between different experimental batches. Moreover, one or more control samples replicated across experimental batches could be used to better characterize and correct the probe hybridization noise.

In addition to the probe hybridization reference, other parameters could also be introduced to the GADA framework to model other known sources of variation. For example, allowing changes in scale of the bias (see Supplementary Materials), using an independent bias correction for each SNP allele, and including probe-specific noise variances (heteroscedastic model). The residuals of the model can be used to assess the impact of these other sources of variation on the results In particular, correlation on the model residuals can be used to discover hidden batch effects indicating the need for subgroup analysis (see Supplementary Materials). Finally, the GADA-JRN model could be used in combination with the N-GADA model (Pique-Regi et al., 2008a) that the authors proposed for modeling breakpoints of CNVs across multiple samples.

5 CONCLUSION

In this article, we introduce a new method in which the reference probe hybridization intensity is jointly estimated with the copy number component. This type of methods are essential for the characterization of CNV on normal population using the latest array technologies, in which the underlying genome CNVs are not known a priori. The currently used approach of separate preprocessing using the median to estimate the hybridization intensities may fail to accurately detect highly polymorphic regions of the genome. The new proposed method extends our previous GADA model by introducing a new vector of parameters that model a common hybridization bias shared across many arrays. Results demonstrate a significantly better performance with the new extended GADA model while maintaining the attractive linear computational complexity in number of probes and samples.

Supplementary Material

[Supplementary Data]

ACKNOWLEDGEMENTS

The authors would like to thank the two anonymous reviewers for valuable comments and suggestions.

Funding: NIH's Child Health Research Career Development Award Program (SA) (TK12-CA60104); Alex's Lemonade Stand Foundation (SA).

Conflict of Interest: none declared.

REFERENCES

  • Affymetrix Genome-wide human snp array 6.0 sample data set. 2007 Available at http://www.affymetrix.com/support/technical/sample_data/genomewide_snp6\data.affx(last accessed date February 17, 2009)
  • Affymetrix Genotyping Console 3.0.1 User Manual. 2008 Available at ftp://www.affymetrix.com/(last accessed date February 17, 2009)
  • Bengtsson H, et al. Estimation and assessment of raw copy numbers at the single locus level. Bioinformatics. 2008;24:759–767. [PubMed]
  • Diskin SJ, et al. Adjustment of genomic waves in signal intensities from whole-genome SNP genotyping platforms. Nucleic Acids Res. 2008;36:e126. [PMC free article] [PubMed]
  • Feuk L, et al. Structural variation in the human genome. Nat. Rev. Genet. 2006;7:85–97. [PubMed]
  • Fredman D, et al. Complex snp-related sequence variation in segmental genome duplications. Nat. Genet. 2004;36:861–866. [PubMed]
  • Freeman JL, et al. Copy number variation: new insights in genome diversity. Genome Res. 2006;16:949–961. [PubMed]
  • Hap Map Consortium 2003 The International HapMap Project. Nature, 426, 789-796. Available at http://www.hapmap.org(last accessed date February 17, 2009).
  • Huang J, et al. Whole genome DNA copy number changes identified by high density oligonucleotide arrays. Hum. Genomics. 2004;1:287–299. [PMC free article] [PubMed]
  • Iafrate AJ, et al. Detection of large-scale variation in the human genome. Nat. Genet. 2004;36:949–951. [PubMed]
  • Komura D, et al. Genome-wide detection of human copy number variations using high-density DNA oligonucleotide arrays. Genome Res. 2006;16:1575–1584. [PMC free article] [PubMed]
  • Korn JM, et al. Integrated genotype calling and association analysis of snps, common copy number polymorphisms and rare cnvs. Nat. Genet. 2008;40:1253–1260. [PMC free article] [PubMed]
  • Lai WR, et al. Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics. 2005;21:3763–3770. [PMC free article] [PubMed]
  • Marioni J, et al. Breaking the waves: improved detection of copy number variation from microarray-based comparative genomic hybridization. Genome Biol. 2007;8:R228. [PMC free article] [PubMed]
  • McCarroll SA, et al. Integrated detection and population-genetic analysis of SNPs and copy number variation. Nat. Genet. 2008;40:1166–1174. [PubMed]
  • Nannya Y, et al. A robust algorithm for copy number detection using high-density oligonucleotide single nucleotide polymorphism genotyping arrays. Cancer Res. 2005;65:6071–6079. [PubMed]
  • Peiffer DA, et al. High-resolution genomic profiling of chromosomal aberrations using infinium whole-genome genotyping. Genome Res. 2006;16:1136–1148. [PMC free article] [PubMed]
  • Perry GH, et al. The fine-scale and complex architecture of human copy-number variation. Am. J. Hum. Genet. 2008;82:685–695. [PMC free article] [PubMed]
  • Pique-Regi R, et al. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. Honolulu (Hawaii): 2007. Wavelet footprints and sparse bayesian learning for DNA copy number change analysis.
  • Pique-Regi R, et al. IEEE International Workshop on Genomic Signal Processing and Statistics, 2008 (GENSiPS 2008). Phoenix (AZ): 2008a. Bayesian detection of recurrent copy number alterations across multiple array samples; pp. 1–4.
  • Pique-Regi R, et al. Sparse representation and bayesian detection of genome copy number alterations from microarray data. Bioinformatics. 2008b;24:309–318. [PMC free article] [PubMed]
  • Redon R, et al. Global variation in copy number in the human genome. Nature. 2006;444:444–454. [PMC free article] [PubMed]
  • Rigaill G, et al. ITALICS: an algorithm for normalization and DNA copy number calling for affymetrix SNP arrays. Bioinformatics. 2008;24:768–774. [PubMed]
  • Schadt EE, et al. Feature extraction and normalization algorithms for high-density oligonucleotide gene expression array data. J. Cell Biochem. 2001;(Suppl. 37):120–125. [PubMed]
  • Tipping ME. Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res. 2001;1:211–244.
  • Willenbrock H, Fridlyand J. A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics. 2005;21:4084–4091. [PubMed]
  • Wipf D, Rao B. Sparse Bayesian learning for basis selection. IEEE-Trans-SP. 2004;52:2153–2164.
  • Zhao X, et al. An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorphism arrays. Cancer Res. 2004;64:3060–3071. [PubMed]

Articles from Bioinformatics are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...