Logo of narLink to Publisher's site
Nucleic Acids Res. Jul 2011; 39(13): e89.
Published online May 14, 2011. doi:  10.1093/nar/gkr137
PMCID: PMC3141250

An all-statistics, high-speed algorithm for the analysis of copy number variation in genomes

Abstract

Detection of copy number variation (CNV) in DNA has recently become an important method for understanding the pathogenesis of cancer. While existing algorithms for extracting CNV from microarray data have worked reasonably well, the trend towards ever larger sample sizes and higher resolution microarrays has vastly increased the challenges they face. Here, we present Segmentation analysis of DNA (SAD), a clustering algorithm constructed with a strategy in which all operational decisions are based on simple and rigorous applications of statistical principles, measurement theory and precise mathematical relations. Compared with existing packages, SAD is simpler in formulation, more user friendly, much faster and less thirsty for memory, offers higher accuracy and supplies quantitative statistics for its predictions. Unique among such algorithms, SAD's running time scales linearly with array size; on a typical modern notebook, it completes high-quality CNV analyses for a 250 thousand-probe array in ~1 s and a 1.8 million-probe array in ~8 s.

INTRODUCTION

Amplification or deletion of chromosomal segments can lead to abnormal mRNA transcript levels and results in malfunctioning of cellular processes. Locating such chromosomal aberrations in comparative genomic DNA samples, or copy number variation (CNV) (14), is an important step in understanding the pathogenesis of many diseases, especially cancer. Array comparative genomic hybridization (CGH) is a high-throughput technique developed for measuring such changes (57). CGH arrays using Bacterial Artificial Chromosome (BAC) clones have resolutions of the order of 1Mb (6). Those using cDNA and oligonucleotide as probes (1,8) are less robust than BACs for large segments, but offer much higher resolutions (in the order of 50–100kb). In particular, oligonucleotide arrays allow design flexibility and greater coverage and provide good sensitivity (8). Tiling on custom arrays is also available now for even finer resolution of specific regions and allow the detection of micro-amplifications and deletions (9,10). The drastic improvement in resolution has led to a corresponding increase in the number of probes on an array; modern high-resolution arrays now easily exceed one million probes. Such arrays exact a severe requirement on the speed and accuracy of algorithms used to analyze them and have vastly reduced the usefulness of existing algorithms that are An external file that holds a picture, illustration, etc.
Object name is gkr137i23.jpg(N2)—N is array size—in computation time or memory requirement. Here, we propose a novel algorithm, segmentation analysis of DNA (SAD), for studying CNV in high-resolution arrays.

For a probe, the log2-ratio of intensities from a pair of microarrays is termed a datum. Based on our observation that datum errors tend to be normally distributed, we designed SAD with three features, respectively involving the use of: (i) the Gaussian distribution function (Gaussian) as a probability density function (PDF) for evaluating the true value of a measured datum; (ii) a clustering procedure based on a technique we call pair-wise Gaussian merging (PGM); (iii) z-statistic for making clustering decisions. Details are given in Methods. The operational principles of PGM are schematically illustrated in Figure 1. In this case, the original 10 datums are predicted by SAD to have an underlying structure of two segments. SAD has one essential parameter, the threshold z-value z0, and an optional one, the sampling size Ns. z0 defines a significance level p0 for making clustering decisions and for calling CNVs. Ns is used for speeding up SAD.

Figure 1.
Schematic illustration of PGM applied to genome segmentation. Frames on the left, with the x-axis indicating relative probe position on the genome, display datums, as solid grey squares and clusters, as black crosses with errorbars; frames on the right ...

We show in the following sections that, compared with algorithms found in the literature, SAD has a simpler but more rigorous formulation, is easier to understand and simpler to use, provides clearer statistical interpretation for its results, requires less memory, offers better accuracy and is vastly faster in computation speed.

MATERIALS AND METHODS

Normal distribution of error

Data not having any CNV are best for demonstrating normal distribution of error. For this reason we contrasted pairs of replicate arrays among each of the four triplicate array sets, NA15510_Nsp, NA15510_Sty, NA10851_Nsp and NA10851_Sty (henceforth the Redon data set), that were produced on the Affymetrix 500K EA platform in a CNV study (3). Because each set has three contrasted pairs, the sets give a total of 12 error distributions. In Figure 2, the error distributions, after standardization and normalization, are compared to standard normal distributions in terms of the Kolmogorov–Smirnov statistic (KS). The small KS values confirm that Gaussian is an excellent approximation to the error distributions.

Figure 2.
Normality test of datum error using the Redon data set. In terms of KS, the normalized standard error distributions, shown as grey histograms, are compared to standard normal distributions, shown as black lines.

We examined error properties in more detail using the Affymetrix 500K copy number sample data set (http://www.affymetrix.com). Figure 3a shows the log2-ratio profile of chromosome 2 from the (CRL-5868D, CRL-5957D) STY pair and our selection of two ~8000-datum sections of obviously distinct means. Figure 3b compares the log2-ratio distributions of the two sections with their respective Gaussian approximations, G(y;0.35,(0.22)2) and G(y;−0.13,(0.23)2), which have different means but similar variances. These two sections and an artificial 8000-datum section of randomly generated G(y;0,(0.22)2) noise were used to study the sample-size and spatial dependence of error. Each section is partitioned into subsections of width 4i, i = 3 to 8, plus a discarded remainder. The error of each subsection is measured using Equation (4). Each section at each i thus has an error distribution whose mean and standard deviation are plotted in Figure 3c. The two sections are shown to have spatial as well as statistical properties similar to that of the artificial data. In particular, this implies that, for the array data, statistical errors (excluding breakpoints) are more or less uniformly distributed.

Figure 3.
Sample-size and spatial independence of variation. Data are from the Affymetrix 500K copy number sample data set. (a) The 2 sections and a remainder of chromosome 2 from the (CRL-5868D,CRL-5957D) STY pair. (b) log2-ratio distributions of sections 1 and ...

Pair-wise Gaussian merging

Given a measured value ν, the conditional probability for its true value being y is Pr(y|ν) = Pr(y  ν)/Pr(ν). Similarly, given a set of independently measured values Ω = i|i = 1,  , w}, we have Pr(y|Ω) = Pr(y  Ω)/Pr(Ω) and, from the independence of events, Pr(y  Ω) = An external file that holds a picture, illustration, etc.
Object name is gkr137i1.jpg, Pr(Ω) = An external file that holds a picture, illustration, etc.
Object name is gkr137i2.jpg. Therefore, Pr(y|Ω) = An external file that holds a picture, illustration, etc.
Object name is gkr137i3.jpg. In case of continuous variables, the probability that the true value lies in the interval y to y + dy is Pr(y;dy|ν) = dyD(y|Ω), with An external file that holds a picture, illustration, etc.
Object name is gkr137i4.jpg, where the D's are PDFs. Given that errors are normally distributed with initial variance An external file that holds a picture, illustration, etc.
Object name is gkr137i5.jpg, we approximate D(yi) by a Gaussian An external file that holds a picture, illustration, etc.
Object name is gkr137i6.jpg = An external file that holds a picture, illustration, etc.
Object name is gkr137i7.jpgexp(−(y  An external file that holds a picture, illustration, etc.
Object name is gkr137i8.jpg. Repeatedly using the relation that a product of two Gaussians is another Gaussian we have

equation image

equation image
(1)

equation image
(2)

We call this method of merging Gaussians to obtain a PDF from a set of measurements Gaussian merging (GM). The formulations of both μ and σ are intuitively understood: μ is the mean of the measured values and σ2 is inversely proportional to sample size, as expected.

To allow the possibility that Ω comprises multiple subsets each the manifest of a different true value, we conduct a two-sample z-test (for independent samples with equal variances), before merging two Gaussians using a z-value, here called the resolvability,

equation image
(3)

where Gk [equivalent] An external file that holds a picture, illustration, etc.
Object name is gkr137i9.jpg, k = 1 and 2. That zr follows a standard normal distribution is shown in Supplementary Data. The corresponding P-value of zr tests the null hypothesis that G1 and G2 have the same true value. Given threshold resolvability z0, we say G1 and G2 are resolvable if |zr(G1,G2)|≥ z0, in which case the two Gaussians are kept separate, and are unresolvable and merged otherwise. The following four-step procedure, which we call PGM, partitions Ω into resolvable subsets: (i) Estimate the variance of each datum. (ii) Select z0. (iii) Identify the unresolvable pair of Gaussians with the smallest zr and use GM to merge the pair. (iv) Iterate step (iii) until all remaining pairs are resolvable. PGM is a type of agglomerative hierarchical clustering using zr as distance. In the present application, only spatially contiguous datums (except when separated by an outlier) are merged, and the partitioned subsets correspond to segments of different log2-ratios.

The SAD algorithm: clustering

SAD has two clustering modes: the linear mode (LM) for low-resolution arrays or when computation time is not a concern, and the parallel mode (PM) otherwise. LM has a single parameter z0 while PM has an additional parameter Ns whose default value of 100 is highly recommended. The steps in LM are: (i) Computation of An external file that holds a picture, illustration, etc.
Object name is gkr137i10.jpg. Let {νi|i = 1,N} be the initial data of log2-ratio, qi = νi+1  νi and SDq be the standard deviation of the qi's, then

equation image
(4)

An external file that holds a picture, illustration, etc.
Object name is gkr137i11.jpg measures datum error and is sensitive only to the existence of breakpoints, which are assumed to be sparse. Treat each datum as a single-datum cluster and assign An external file that holds a picture, illustration, etc.
Object name is gkr137i12.jpg to the i-th datum-cluster. (ii) Selection of z0. This stipulates when PGM iteration stops and addresses the statistical issues discussed in the following subsection. (iii) PGM Phase I. Perform chromosome-wide PGM iteratively to all contiguous cluster pairs. At the end of this phase each remaining single-datum cluster is a ‘loner’ whose existence prevents the merging of its two neighbouring clusters even if they are resolvable. (iv) PGM Phase II. Along with contiguous pairs, continue step (iii) to merge loner-divided pairs. After a loner-divided pair is merged the dividing loner becomes an ‘outlier’ and is excluded from subsequent calculation. At the end of this stage each of the resultant clusters is a ‘segment’ with an associated Gaussian G(y;μ,σ2) serving as a PDF for its true value. (v) Normalization. Perform genome-wide PGM on the entire set of segments to merge contiguous as well as unconnected segment pairs. Identify the largest resultant cluster and denote its mean by An external file that holds a picture, illustration, etc.
Object name is gkr137i13.jpg, here called the ‘baseline’. The baseline will be taken as the reference for CNV significance test.

As PGM involves very little computation, LM is inherently a fast algorithm. On the other hand, owing to the iterative procedure, the problem size is An external file that holds a picture, illustration, etc.
Object name is gkr137i24.jpg(N2), implying long computation time when N is large. PM reduces the problem size to An external file that holds a picture, illustration, etc.
Object name is gkr137i25.jpg(N) with little sacrifice in accuracy. In that case, a sampling size Ns is selected (by the user) and the various steps in LM are adjusted as follows. In (v), An external file that holds a picture, illustration, etc.
Object name is gkr137i14.jpg is computed using only the widest Ns segments. This reduces problem size from An external file that holds a picture, illustration, etc.
Object name is gkr137i15.jpg, where Nseg is the number of resultant segments, to An external file that holds a picture, illustration, etc.
Object name is gkr137i16.jpg. In (iii) and (iv), prior to merging the entire current cluster set is partitioned to subsets of Ns contiguous clusters, plus a remainder. The subsets are processed in parallel and the most unresolvable pair in each subset, if there is any, is merged. Thereafter the subsets of clusters (some of which have been reduced in size through merging) are joined with the remainder circularly, with the beginning of the remainder taken as the starting point, and readied for a new round of partition and merging. This is a dynamical procedure resulting in a different partition in each iteration. The problem size for each of the N/Ns subsets is An external file that holds a picture, illustration, etc.
Object name is gkr137i17.jpg, making the total problem size An external file that holds a picture, illustration, etc.
Object name is gkr137i26.jpg(NNs).

The SAD algorithm: CNV calling and selection of z0

After clustering, consider two contiguous segments: a narrow segment s1 of G1 = An external file that holds a picture, illustration, etc.
Object name is gkr137i18.jpg and a much wider non-CNV segment s2 of G2 = An external file that holds a picture, illustration, etc.
Object name is gkr137i19.jpg. Let Ha be the null hypothesis that s1 is non-CNV (i.e. the true value of s1 is An external file that holds a picture, illustration, etc.
Object name is gkr137i20.jpg). An independent one-sample z-test using a z-value, here called the ‘aberrance’,

equation image
(5)

yields a P-value for testing Ha, as is expected by the central limit theorem. From Equations (3 and 5), because w2 [dbl greater-than sign] w1, we have

equation image
(6)

The lower bound for |zr(G1,G2)|, z0, is therefore also the approximate lower bound for |za(G1)|. We therefore employ p0, the corresponding P-value of z0, as the significance level for testing Ha. We call s1 a CNV if |za(G1)|≥ z0. More specifically, we call the segment a ‘gain’ if za(G1)  z0, or a ‘loss’ if za(G1)   z0.

Because An external file that holds a picture, illustration, etc.
Object name is gkr137i21.jpg is just the signal to noise ratio (SNR) of s1, Equation (6) leads to

equation image
(7)

That is, if SNR is known, z0 also sets an approximate lower bound for CNV width.

Software availability

The SAD program is available for download at: http://www.sybbi.ncu.edu.tw/software.htm or upon request by email at: moc.liamg@gnigrem.naissuag.esiwriap.

RESULTS

In Lai et al. (11) (hereafter referred to as LJKP) the performances of 11 CNV algorithms—3 smoothing-only (SO) algorithms, lowess, wavelet (12) and quantreg (13) and 8 estimation-performing (EP) algorithms, CGHseg (14), CBS (15), ChARM (16), ACE (17), HMM (18), GLAD (19), GA (20) and CLAC (21)—were compared using simulated data for testing receiver operating characteristic (ROC) as well as real Glioblastoma Multiforme (GBM) data. LJKP found that the overall top three EP performers were CGHseg, CBS and GLAD. In Fiegler et al. (22) two more recently developed EP algorithms, CNVfinder (22) and SW-ARRAY (23), were compared in accuracy using real data. Among these algorithms only CALC and ACE provide quantitative statistics.

We test SAD against the 10 EP algorithms in ROC. The SO algorithms were excluded because they do not explicitly address breakpoints. The ones rated accurate, CGHseg, CBS and GLAD, were further compared to SAD in speed and memory. In addition we validated SAD on low- and high-resolution data sets. We designate a SAD run in LM by SAD(z0,−) and in PM by SAD(z0,Ns).

Accuracy

We calculated (details in Supplementary Data) the ROC curves of SAD the same way as in LJKP except that for better statistics we generated 10 000 instead of 100 simulated chromosomes (of 100 datums each) for each parameter set in each setting. The results (Supplementary Figure S1) indicate that a higher z0 is more suitable for easy settings (wide CNV and large SNR) while a lower z0 better facilitates CNV detection in difficult settings (narrow CNV or small SNR). Table 1 compares SAD(z0,100), z0 = 1.5, 2.0 and 4.0, in area-under-curve (AUC) value with the 10 EP algorithms for two easy settings, (SNR,width) = (4,20) and (3,40), and two difficult settings, (2,5) and (1,10). Numbers for the eight LJKP-tested algorithms were read from Figure 2 in LJKP. Numbers for SW-ARRAY and CNVfinder were calculated using their reportedly optimal parameter values. In the easy settings, SAD(1.5–4.0,100), CBS, CGHseg, GA, GLAD and HMM perform well. In the difficult settings, SAD(1.5–2.0,100) is the best performer and CGHseg is next. Although CNVfinder performs above average in the difficult settings, it is below average in the easy settings.

Table 1.
Comparison in AUC value of ROC, of SAD against existing algorithms for two easy settings, (SNR,width) = (4,20) and (3,40), and two difficult settings, (2,5) and (1,10)

In PM, higher computation speed is facilitated by using a smaller Ns. Because PM alters the clustering order relative to that in LM, this can induce error when Ns is too small. We tested SAD in this regard and find that overall error is negligible when Ns[greater than or approximate]100 (Supplementary Figure S2).

Speed and memory

All calculations reported here were carried out on a computer with Intel Core 2 Duo T7500 2.2G (L2:4M) CPU, 2GBs of DDRII memory, and uses Windows XP as operating system. All programs ran as a single thread and uses 50% of the CPU. Our SAD program is written in Visual C++. The other algorithms were tested with provided programs at default parameter values. The simulated chromosomes were generated with SNR = 2. Each simulated chromosome had either one or two gains. For planting the gains each chromosome was divided into five same-width sections. The second section was amplified in one-gain cases, and the second and the forth sections were amplified in two-gain cases. Computation time τ was measured for each case; the difference in τ between one and two gains reflects the dependence of speed on genomic profiles. Memory test was read from the processes tab of Windows Task Manager and involves two steps: data loading and data processing. The reading between the two steps, denoted by κd, is memory used for program and data. The maximum reading during data processing was recorded as κo and the difference κp = κo  κd was taken to be the maximum memory needed for data processing. The power-law exponents γτ and γκ were derived from the N dependences of τ and κp, respectively.

We compared SAD(10,100) to CGHseg, CBS and GLAD and show the results in Figure 4. We see that: (i) SAD is vastly faster than the others; at N  106 it is already two orders of magnitude faster than CBS, its closest competitor. (ii) In computation time SAD is An external file that holds a picture, illustration, etc.
Object name is gkr137i27.jpg(N) while GLAD and CGHseg are An external file that holds a picture, illustration, etc.
Object name is gkr137i28.jpg(N2). CBS, claimed to be An external file that holds a picture, illustration, etc.
Object name is gkr137i29.jpg(N) at low resolution (24), becomes An external file that holds a picture, illustration, etc.
Object name is gkr137i30.jpg(N2) at N  5 × 105. (iii) Speed dependence on genomic profile, reflected by the difference between the 1-gain results and the 2-gain results, is significant for CBS, minor for GLAD and CGHseg, and negligible for SAD. (iv) SAD requires the least amount of memory, overall (κo) as well as for data-processing (κp). (v) In memory requirement SAD and GLAD scale as An external file that holds a picture, illustration, etc.
Object name is gkr137i31.jpg(N), CBS displays irregularity, and CGHseg scales as An external file that holds a picture, illustration, etc.
Object name is gkr137i32.jpg(N2). On a computer with 2 GBs of memory, CGHseg ceases to function when N exceeds about 16 000. For this reason CGHseg is not considered for further comparison.

Figure 4.
Comparisons of SAD to CGHseg, CBS and GLAD in speed and memory requirement. (a) Computation time τ versus N. (b) Power-law exponent γτ for τ derived from (a). (c) Overall memory κo versus N. (d) Data-processing ...

Using real data, we ran SAD(10,100) on a 1.8-million-probeset Affymetrix Genome-Wide Human SNP Array 6.0 hybridized with a colorectal cancer sample, and measured τ = 8 seconds and κo = 323 MBs.

Validation on a low-resolution data set

We used a 2276-BAC public data set from the NIGMS Human Genetics Cell Repository (25) (henceforth the Snijders dataset) to perform low-resolution validation of SAD and to demonstrate the utility of z0 for limiting CNV width. The dataset corresponds to 15 human cell strains. As identified by spectral karyotyping, each cell strain has either one or two CNVs and eight of the CNVs on six strains were detected to be whole-chromosome. We set a value of z0 using Equation (7). For trisomic segments, the data set has SNR  0.58/0.09, where 0.58  log2(3/2) is approximately the log2-ratio of a trisomic segment and 0.09 is the value for An external file that holds a picture, illustration, etc.
Object name is gkr137i22.jpg obtained from Equation (4). To detect a minimum CNV width between one datum (because one-datum CNVs are likely to be outliers) and two, 6.4 < z0 < 9.1 is required. We therefore used SAD(8,100) for this calculation.

Because the data set had previously been examined by GLAD (19) and CBS (15), we compared the three sets of results in full details in Supplementary Table S1, and summarize the comparison as follows. (1) SAD(8,100) detects more CNVs than GLAD and CBS do. (2) SAD(8,100) gives far fewer false-positives; the average numbers of false positive breakpoints per cell strain are 2/15, 46/15, 26/15, 37/9 and 16/9 for SAD(8,100), GLAD(λ′ = 8), GLAD(λ′ = 10), CBS(α = 0.01) and CBS(α = 0.001), respectively. (3) SAD alone assigns a z-value to each CNV for assessing significance. (4) SAD(8,100) alone detects whole-chromosome CNVs on whose detection GLAD and CBS are silent because they are based on breakpoint detection within chromosomes.

Validation on a high-resolution dataset

In Redon et al. (3), 43 genomic regions were examined by SYBR real-time PCR or MassSpec to validate the respective CNV calls for NA15510 vs NA10851 on the Affymetrix 500K EA platform. We used three of these regions, cnp8, cnp23 and cnp36, respectively determined in (3) to be gain, loss and gain, to validate SAD and to demonstrate the utility of z0 for characterizing CNV significance. In Figure 5, the results of three runs, SAD(10,100), SAD(8,100) and SAD(6,100), on the first Sty replicates of the Redon dataset are respectively shown in frame sets (a), (b) and (c). At z0 = 10 (Figure 5a) only cnp36 is detected with za = 10.6. When z0 is lowered to 8 (Figure 5b), cnp23 is detected with za =  8.2. When z0 is further lowered to 6 (Figure 5c), cnp8 is detected with za = 7.4.

Figure 5.
A high-resolution validation test for SAD on 3 genomic regions with known CNVs, whose positions are shown as thick black segments in the frames. The three sets of frames are for the three runs: (a) SAD(10,100); (b) SAD(8,100); and (c) SAD(6,100). Data ...

DISCUSSION

We have demonstrated that by virtue of its accuracy, parsimony in memory use and speed, SAD can manage the challenges analyzing modern high-resolution microarrays significantly better than existing algorithms. Algorithmically SAD is easy to understand because it employs fundamental principles of statistics and precise but very simple mathematics [as compared to the mathematics in the formulation of, say, GLAD (19)]. SAD makes all internal decisions based on statistics and provides an external quantitative statistic. With only two user-tunable parameters, z0 and Ns, the meanings of which are both intuitively accessible, SAD is also the easiest to use. Users can select z0, the primary parameter, based on their requirement for CNV significance or CNV width. We recommend setting the second parameter, Ns, to 100. This guarantees good accuracy and a computation time that is An external file that holds a picture, illustration, etc.
Object name is gkr137i33.jpg(N).

Quantitative statistics provide the basis on which a level of confidence may be assigned to each inference and for setting a priority for experimental confirmation for such inferences. All measurements, especially those involving microarrays, carry inherent statistical error. SAD quantifies such errors as data uncertainty, tracks the latter throughout a clustering process using exact mathematical relations, and provides z-values for assessing CNV significance. The z-values, when used for downstream calculations such as the identification of recurrent aberrations using multiple arrays, allows the initial uncertainty to be passed on further.

SAD is an application build on PGM. The upgrading of SAD computation time from An external file that holds a picture, illustration, etc.
Object name is gkr137i34.jpg(N2) to An external file that holds a picture, illustration, etc.
Object name is gkr137i35.jpg(N) is a consequence of the parallel processing made possible by the employment of agglomerative hierarchical clustering in PGM. The superior accuracy of SAD results from the exploitation by PGM of a common trait seen in most systems: that measurement errors are normally distributed. The operating principle of SAD is accessible to the user because in PGM the resolving power used for determining breakpoints is controlled via an intuitive statistic threshold. These properties of PGM promise its usefulness and wide application, beyond CNV, in the general analysis of microarray data.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

National Science Council (ROC) (Grant 97-2112-M-008-013; in part); Cathay General Hospital-NCU Collaboration (Grant 97-CGH-NCU-A1, in part) Funding for open access charge: National Science Council and the Ministry of Education (Research grants to H.C.L.).

Conflict of interest statement. None declared.

Supplementary Material

Supplementary Data:

REFERENCES

1. Pollack JR, Perou CM, Alizadeh AA, Eisen MB, Pergamenschikov A, Williams CF, Jeffrey SS, Botstein D, Brown PO. Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nat. Genet. 1999;23:41–46. [PubMed]
2. Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Månér S, Massa H, Walker M, Chi M, et al. Large-Scale copy number polymorphism in the human genome. Science. 2004;305:525–528. [PubMed]
3. Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, et al. Global variation in copy number in the human genome. Nature. 2006;444:444–454. [PMC free article] [PubMed]
4. Barnes C, Plagnol V, Fitzgerald T, Redon R, Marchini J, Clayton D, Hurles ME. A robust statistical method for case-control association testing with copy number variation. Nat. Genet. 2008;40:1245–1252. [PMC free article] [PubMed]
5. Solinas-Toldo S, Lampel S, Stilgenbauer S, Nickolenko J, Benner A, Dohner H, Cremer T, Lichter P. Matrix-based comparative genomic hybridization: biochips to screen for genomic imbalances. Gene Chromosome. Canc. 1997;20:399–407. [PubMed]
6. Pinkel D, Segraves R, Sudar D, Clark S, Poole I, Kowbel D, Collins C, Kuo WL, Chen C, Zhai Y, et al. High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat. Genet. 1998;20:207–211. [PubMed]
7. Pinkel D, Albertson DG. Array comparative genomic hybridization and its applications in cancer. Nat. Genet. 2005;37(Suppl.):11–17. [PubMed]
8. Brennan C, Zhang Y, Leo C, Feng B, Cauwels C, Aguirre AJ, Kim M, Protopopov A, Chin L. High-resolution global profiling of genomic alterations with long oligonucleotide microarray. Cancer Res. 2004;64:4744–4748. [PubMed]
9. Lucito R, Healy J, Alexander J, Reiner A, Esposito D, Chi M, Rodgers L, Brady A, Sebat J, Troge J, et al. Representational oligonucleotide microarray analysis: a highresolution method to detect genome copy number variation. Genome Res. 2003;13:2291–2305. [PMC free article] [PubMed]
10. Ishkanian AS, Malloff CA, Watson SK, deLeeuw RJ, Chi B, Coe BP, Snijders A, Albertson DG, Pinkel D, Marra MA, et al. A tiling resolution DNAmicroarray with complete coverage of the human genome. Nat. Genet. 2004;36:299–303. [PubMed]
11. Lai WR, Johnson MD, Kucherlapati R, Park PJ. Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics. 2005;21:3763–3770. [PMC free article] [PubMed]
12. Hsu L, Self SG, Grove D, Randolph T, Want K, Delrow JJ, Loo L, Porter P. Denoising array-based comparative genomic hybridization data using wavelets. Biostatistics. 2005;6:211–226. [PubMed]
13. Eilers PHC, de Menezes RX. Quantile smoothing of array CGH data. Bioinformatics. 2005;21:1146–1153. [PubMed]
14. Picard F, Robin S, Lavielle M, Vaisse C, Daudin J. A statistical approach for array CGH data analysis. BMC Bioinforma. 2005;6:27. [PMC free article] [PubMed]
15. Olshen AB, Venkatraman ES, Lucito R, Wigler M. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics. 2004;5:557–572. [PubMed]
16. Myers CL, Dunham MJ, Kung SY, Troyanskaya OG. Accurate detection of aneuploidies in array CGH and gene expression microarray data. Bioinformatics. 2004;20:3533–3543. [PubMed]
17. Lingjærde OC, Baumbusch LO, Liestøl K, Glad IK, Børresen-Dale A. CGH-Explorer: a program for analysis of array-CGH data. Bioinformatics. 2005;21:821–822. [PubMed]
18. Fridlyand J, Snijders AM, Pinkel D, Albertson DG, Jain AN. Hidden Markov models approach to the analysis of array CGH data. J. Multivariate Anal. 2004;90:132–153.
19. Hupé P, Stransky N, Thiery J, Radvanyi F, Barillot E. Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics. 2004;20:3413–3422. [PubMed]
20. Jong K, Marchiori E, van der Vaart A, Ylstra B, Weiss M, Meijer G. In Lecture Notes in Computer Science. Berlin: Springer; 2003. Chromosomal breakpoint detection in human cancer; pp. 54–65.
21. Wang P, Kim Y, Pollack J, Narasimhan B, Tibshirani R. A method for calling gains and losses in array CGH data. Biostatistics. 2005;6:45–58. [PubMed]
22. Fiegler H, Redon R, Andrews D, Scott C, Andrews R, Carder C, Clark R, Dovey O, Ellis P, Feuk L, et al. Accurate and reliable high-throughput detection of copy number variation in the human genome. Genome Res. 2006;16:1566–1574. [PMC free article] [PubMed]
23. Price TS, Regan R, Mott R, Hedman Å, Honey B, Daniels RJ, Smith L, Gerrnfield A, Tiganescu A, Buckle Vl, et al. SW-ARRAY: a dynamic programming solution for the identification of copy-number changes in genomic DNA using array comparative genome hybridization data. Nucleic Acids Res. 2005;33:3455–3464. [PMC free article] [PubMed]
24. Venkatraman ES, Olshen AB. A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007;23:657–663. [PubMed]
25. Snijders AM, Nowak N, Segraves R, Blackwood S, Brown N, Conroy J, Hamilton G, Hindle AK, Huey B, Kimura K, et al. Assembly of microarrays for genome-wide measurement of DNA copy number. Nat. Genet. 2001;29:263–264. [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...