- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# An all-statistics, high-speed algorithm for the analysis of copy number variation in genomes

^{1,}

^{2}Hsing-Chung Lee,

^{1,}

^{3}Qingdong Ling,

^{1,}

^{2}Hsiao-Rong Chen,

^{1}Yi-An Ko,

^{1}Tsong-Shan Tsou,

^{1,}

^{2,}

^{4}Sun-Chong Wang,

^{1}Li-Ching Wu,

^{1}and H. C. Lee

^{1,}

^{2,}

^{5,}

^{6,}

^{*}

^{1}Graduate Institute of Systems Biology and Bioinformatics, National Central University, Chungli, Taiwan 32001,

^{2}Cathay Medical Research Institute,

^{3}Department of Surgery, Cathay General Hospital, Taipei, Taiwan 10630,

^{4}Graduate Institute of Statistics, National Central University,

^{5}Department of Physics, National Central University, Chungli, Taiwan 32001 and

^{6}National Center for Theoretical Sciences, Shinchu, Taiwan 30043

## Abstract

Detection of copy number variation (CNV) in DNA has recently become an important method for understanding the pathogenesis of cancer. While existing algorithms for extracting CNV from microarray data have worked reasonably well, the trend towards ever larger sample sizes and higher resolution microarrays has vastly increased the challenges they face. Here, we present Segmentation analysis of DNA (SAD), a clustering algorithm constructed with a strategy in which all operational decisions are based on simple and rigorous applications of statistical principles, measurement theory and precise mathematical relations. Compared with existing packages, SAD is simpler in formulation, more user friendly, much faster and less thirsty for memory, offers higher accuracy and supplies quantitative statistics for its predictions. Unique among such algorithms, SAD's running time scales linearly with array size; on a typical modern notebook, it completes high-quality CNV analyses for a 250 thousand-probe array in ~1s and a 1.8 million-probe array in ~8s.

## INTRODUCTION

Amplification or deletion of chromosomal segments can lead to abnormal mRNA transcript levels and results in malfunctioning of cellular processes. Locating such chromosomal aberrations in comparative genomic DNA samples, or copy number variation (CNV) (1–4), is an important step in understanding the pathogenesis of many diseases, especially cancer. Array comparative genomic hybridization (CGH) is a high-throughput technique developed for measuring such changes (5–7). CGH arrays using Bacterial Artificial Chromosome (BAC) clones have resolutions of the order of 1Mb (6). Those using cDNA and oligonucleotide as probes (1,8) are less robust than BACs for large segments, but offer much higher resolutions (in the order of 50–100kb). In particular, oligonucleotide arrays allow design flexibility and greater coverage and provide good sensitivity (8). Tiling on custom arrays is also available now for even finer resolution of specific regions and allow the detection of micro-amplifications and deletions (9,10). The drastic improvement in resolution has led to a corresponding increase in the number of probes on an array; modern high-resolution arrays now easily exceed one million probes. Such arrays exact a severe requirement on the speed and accuracy of algorithms used to analyze them and have vastly reduced the usefulness of existing algorithms that are (*N*^{2})—*N* is array size—in computation time or memory requirement. Here, we propose a novel algorithm, segmentation analysis of DNA (SAD), for studying CNV in high-resolution arrays.

For a probe, the log2-ratio of intensities from a pair of microarrays is termed a datum. Based on our observation that datum errors tend to be normally distributed, we designed SAD with three features, respectively involving the use of: (i) the Gaussian distribution function (Gaussian) as a probability density function (PDF) for evaluating the true value of a measured datum; (ii) a clustering procedure based on a technique we call pair-wise Gaussian merging (PGM); (iii) *z*-statistic for making clustering decisions. Details are given in Methods. The operational principles of PGM are schematically illustrated in Figure 1. In this case, the original 10 datums are predicted by SAD to have an underlying structure of two segments. SAD has one essential parameter, the threshold *z*-value *z*_{0}, and an optional one, the sampling size *N*_{s}. *z*_{0} defines a significance level *p*_{0} for making clustering decisions and for calling CNVs. *N*_{s} is used for speeding up SAD.

*x*-axis indicating relative probe position on the genome, display datums, as solid grey squares and clusters, as black crosses with errorbars; frames on the right

**...**

We show in the following sections that, compared with algorithms found in the literature, SAD has a simpler but more rigorous formulation, is easier to understand and simpler to use, provides clearer statistical interpretation for its results, requires less memory, offers better accuracy and is vastly faster in computation speed.

## MATERIALS AND METHODS

### Normal distribution of error

Data not having any CNV are best for demonstrating normal distribution of error. For this reason we contrasted pairs of replicate arrays among each of the four triplicate array sets, NA15510_Nsp, NA15510_Sty, NA10851_Nsp and NA10851_Sty (henceforth the Redon data set), that were produced on the Affymetrix 500K EA platform in a CNV study (3). Because each set has three contrasted pairs, the sets give a total of 12 error distributions. In Figure 2, the error distributions, after standardization and normalization, are compared to standard normal distributions in terms of the Kolmogorov–Smirnov statistic (KS). The small KS values confirm that Gaussian is an excellent approximation to the error distributions.

We examined error properties in more detail using the Affymetrix 500K copy number sample data set (http://www.affymetrix.com). Figure 3a shows the log2-ratio profile of chromosome 2 from the (CRL-5868D, CRL-5957D) STY pair and our selection of two ~8000-datum sections of obviously distinct means. Figure 3b compares the log2-ratio distributions of the two sections with their respective Gaussian approximations, *G*(*y*;0.35,(0.22)^{2}) and *G*(*y*;−0.13,(0.23)^{2}), which have different means but similar variances. These two sections and an artificial 8000-datum section of randomly generated *G*(*y*;0,(0.22)^{2}) noise were used to study the sample-size and spatial dependence of error. Each section is partitioned into subsections of width 4^{i}, *i*=3 to 8, plus a discarded remainder. The error of each subsection is measured using Equation (4). Each section at each *i* thus has an error distribution whose mean and standard deviation are plotted in Figure 3c. The two sections are shown to have spatial as well as statistical properties similar to that of the artificial data. In particular, this implies that, for the array data, statistical errors (excluding breakpoints) are more or less uniformly distributed.

### Pair-wise Gaussian merging

Given a measured value ν, the conditional probability for its true value being *y* is *Pr*(*y*|ν)=*Pr*(*y*∩ν)/*Pr*(ν). Similarly, given a set of independently measured values Ω={ν_{i}|*i*=1,…,*w*}, we have *Pr*(*y*|Ω)=*Pr*(*y*∩Ω)/*Pr*(Ω) and, from the independence of events, *Pr*(*y*∩Ω)=, *Pr*(Ω)=. Therefore, *Pr*(*y*|Ω)=. In case of continuous variables, the probability that the true value lies in the interval *y* to *y*+*dy* is *Pr*(*y*;*dy*|ν)=*dyD*(*y*|Ω), with , where the *D*'s are PDFs. Given that errors are normally distributed with initial variance , we approximate *D*(*y*|ν_{i}) by a Gaussian =exp(−(*y*−. Repeatedly using the relation that a product of two Gaussians is another Gaussian we have

We call this method of merging Gaussians to obtain a PDF from a set of measurements Gaussian merging (GM). The formulations of both μ and σ are intuitively understood:μ is the mean of the measured values and σ^{2} is inversely proportional to sample size, as expected.

To allow the possibility that Ω comprises multiple subsets each the manifest of a different true value, we conduct a two-sample *z*-test (for independent samples with equal variances), before merging two Gaussians using a *z*-value, here called the resolvability,

where *G*_{k}, *k*=1 and 2. That *z*_{r} follows a standard normal distribution is shown in Supplementary Data. The corresponding *P*-value of *z*_{r} tests the null hypothesis that *G*_{1} and *G*_{2} have the same true value. Given threshold resolvability *z*_{0}, we say *G*_{1} and *G*_{2} are resolvable if |*z*_{r}(*G*_{1},*G*_{2})|≥*z*_{0}, in which case the two Gaussians are kept separate, and are unresolvable and merged otherwise. The following four-step procedure, which we call PGM, partitions Ω into resolvable subsets: (i) Estimate the variance of each datum. (ii) Select *z*_{0}. (iii) Identify the unresolvable pair of Gaussians with the smallest *z*_{r} and use GM to merge the pair. (iv) Iterate step (iii) until all remaining pairs are resolvable. PGM is a type of agglomerative hierarchical clustering using *z*_{r} as distance. In the present application, only spatially contiguous datums (except when separated by an outlier) are merged, and the partitioned subsets correspond to segments of different log2-ratios.

### The SAD algorithm: clustering

SAD has two clustering modes: the linear mode (LM) for low-resolution arrays or when computation time is not a concern, and the parallel mode (PM) otherwise. LM has a single parameter *z*_{0} while PM has an additional parameter *N*_{s} whose default value of 100 is highly recommended. The steps in LM are: (i) Computation of . Let {ν_{i}|*i*=1,*N*} be the initial data of log2-ratio, *q*_{i}=ν_{i+1}−ν_{i} and *SD*_{q} be the standard deviation of the *q*_{i}'s, then

measures datum error and is sensitive only to the existence of breakpoints, which are assumed to be sparse. Treat each datum as a single-datum cluster and assign to the *i*-th datum-cluster. (ii) Selection of *z*_{0}. This stipulates when PGM iteration stops and addresses the statistical issues discussed in the following subsection. (iii) PGM Phase I. Perform chromosome-wide PGM iteratively to all contiguous cluster pairs. At the end of this phase each remaining single-datum cluster is a ‘loner’ whose existence prevents the merging of its two neighbouring clusters even if they are resolvable. (iv) PGM Phase II. Along with contiguous pairs, continue step (iii) to merge loner-divided pairs. After a loner-divided pair is merged the dividing loner becomes an ‘outlier’ and is excluded from subsequent calculation. At the end of this stage each of the resultant clusters is a ‘segment’ with an associated Gaussian *G*(*y*;μ,σ^{2}) serving as a PDF for its true value. (v) Normalization. Perform genome-wide PGM on the entire set of segments to merge contiguous as well as unconnected segment pairs. Identify the largest resultant cluster and denote its mean by , here called the ‘baseline’. The baseline will be taken as the reference for CNV significance test.

As PGM involves very little computation, LM is inherently a fast algorithm. On the other hand, owing to the iterative procedure, the problem size is (*N*^{2}), implying long computation time when *N* is large. PM reduces the problem size to (*N*) with little sacrifice in accuracy. In that case, a sampling size *N*_{s} is selected (by the user) and the various steps in LM are adjusted as follows. In (v), is computed using only the widest *N*_{s} segments. This reduces problem size from , where *N*_{seg} is the number of resultant segments, to . In (iii) and (iv), prior to merging the entire current cluster set is partitioned to subsets of *N*_{s} contiguous clusters, plus a remainder. The subsets are processed in parallel and the most unresolvable pair in each subset, if there is any, is merged. Thereafter the subsets of clusters (some of which have been reduced in size through merging) are joined with the remainder circularly, with the beginning of the remainder taken as the starting point, and readied for a new round of partition and merging. This is a dynamical procedure resulting in a different partition in each iteration. The problem size for each of the *N*/*N*_{s} subsets is , making the total problem size (*NN*_{s}).

### The SAD algorithm: CNV calling and selection of *z*_{0}

After clustering, consider two contiguous segments: a narrow segment *s*_{1} of *G*_{1}= and a much wider non-CNV segment *s*_{2} of *G*_{2}=. Let *H*_{a} be the null hypothesis that *s*_{1} is non-CNV (i.e. the true value of *s*_{1} is ). An independent one-sample *z*-test using a *z*-value, here called the ‘aberrance’,

yields a *P*-value for testing *H*_{a}, as is expected by the central limit theorem. From Equations (3 and 5), because *w*_{2}*w*_{1}, we have

The lower bound for |*z*_{r}(*G*_{1},*G*_{2})|, *z*_{0}, is therefore also the approximate lower bound for |*z*_{a}(*G*_{1})|. We therefore employ *p*_{0}, the corresponding *P*-value of *z*_{0}, as the significance level for testing *H*_{a}. We call *s*_{1} a CNV if |*z*_{a}(*G*_{1})|≥*z*_{0}. More specifically, we call the segment a ‘gain’ if *z*_{a}(*G*_{1})≥*z*_{0}, or a ‘loss’ if *z*_{a}(*G*_{1})≤−*z*_{0}.

Because is just the signal to noise ratio (SNR) of *s*_{1}, Equation (6) leads to

That is, if SNR is known, *z*_{0} also sets an approximate lower bound for CNV width.

### Software availability

The SAD program is available for download at: http://www.sybbi.ncu.edu.tw/software.htm or upon request by email at: moc.liamg@gnigrem.naissuag.esiwriap.

## RESULTS

In Lai *et al.* (11) (hereafter referred to as LJKP) the performances of 11 CNV algorithms—3 smoothing-only (SO) algorithms, lowess, wavelet (12) and quantreg (13) and 8 estimation-performing (EP) algorithms, CGHseg (14), CBS (15), ChARM (16), ACE (17), HMM (18), GLAD (19), GA (20) and CLAC (21)—were compared using simulated data for testing receiver operating characteristic (ROC) as well as real Glioblastoma Multiforme (GBM) data. LJKP found that the overall top three EP performers were CGHseg, CBS and GLAD. In Fiegler *et al.* (22) two more recently developed EP algorithms, CNVfinder (22) and SW-ARRAY (23), were compared in accuracy using real data. Among these algorithms only CALC and ACE provide quantitative statistics.

We test SAD against the 10 EP algorithms in ROC. The SO algorithms were excluded because they do not explicitly address breakpoints. The ones rated accurate, CGHseg, CBS and GLAD, were further compared to SAD in speed and memory. In addition we validated SAD on low- and high-resolution data sets. We designate a SAD run in LM by SAD(*z*_{0},−) and in PM by SAD(*z*_{0},*N*_{s}).

### Accuracy

We calculated (details in Supplementary Data) the ROC curves of SAD the same way as in LJKP except that for better statistics we generated 10000 instead of 100 simulated chromosomes (of 100 datums each) for each parameter set in each setting. The results (Supplementary Figure S1) indicate that a higher *z*_{0} is more suitable for easy settings (wide CNV and large SNR) while a lower *z*_{0} better facilitates CNV detection in difficult settings (narrow CNV or small SNR). Table 1 compares SAD(*z*_{0},100), *z*_{0}=1.5, 2.0 and 4.0, in area-under-curve (AUC) value with the 10 EP algorithms for two easy settings, (SNR,width)=(4,20) and (3,40), and two difficult settings, (2,5) and (1,10). Numbers for the eight LJKP-tested algorithms were read from Figure 2 in LJKP. Numbers for SW-ARRAY and CNVfinder were calculated using their reportedly optimal parameter values. In the easy settings, SAD(1.5–4.0,100), CBS, CGHseg, GA, GLAD and HMM perform well. In the difficult settings, SAD(1.5–2.0,100) is the best performer and CGHseg is next. Although CNVfinder performs above average in the difficult settings, it is below average in the easy settings.

In PM, higher computation speed is facilitated by using a smaller *N*_{s}. Because PM alters the clustering order relative to that in LM, this can induce error when *N*_{s} is too small. We tested SAD in this regard and find that overall error is negligible when *N*_{s}100 (Supplementary Figure S2).

### Speed and memory

All calculations reported here were carried out on a computer with Intel Core 2 Duo T7500 2.2G (L2:4M) CPU, 2GBs of DDRII memory, and uses Windows XP as operating system. All programs ran as a single thread and uses 50% of the CPU. Our SAD program is written in Visual C++. The other algorithms were tested with provided programs at default parameter values. The simulated chromosomes were generated with SNR=2. Each simulated chromosome had either one or two gains. For planting the gains each chromosome was divided into five same-width sections. The second section was amplified in one-gain cases, and the second and the forth sections were amplified in two-gain cases. Computation time τ was measured for each case; the difference in τ between one and two gains reflects the dependence of speed on genomic profiles. Memory test was read from the processes tab of Windows Task Manager and involves two steps: data loading and data processing. The reading between the two steps, denoted by κ_{d}, is memory used for program and data. The maximum reading during data processing was recorded as κ_{o} and the difference κ_{p}=κ_{o}−κ_{d} was taken to be the maximum memory needed for data processing. The power-law exponents γ_{τ} and γ_{κ} were derived from the *N* dependences of τ and κ_{p}, respectively.

We compared SAD(10,100) to CGHseg, CBS and GLAD and show the results in Figure 4. We see that: (i) SAD is vastly faster than the others; at *N*≈10^{6} it is already two orders of magnitude faster than CBS, its closest competitor. (ii) In computation time SAD is (*N*) while GLAD and CGHseg are (*N*^{2}). CBS, claimed to be (*N*) at low resolution (24), becomes (*N*^{2}) at *N*≈5×10^{5}. (iii) Speed dependence on genomic profile, reflected by the difference between the 1-gain results and the 2-gain results, is significant for CBS, minor for GLAD and CGHseg, and negligible for SAD. (iv) SAD requires the least amount of memory, overall (κ_{o}) as well as for data-processing (κ_{p}). (v) In memory requirement SAD and GLAD scale as (*N*), CBS displays irregularity, and CGHseg scales as (*N*^{2}). On a computer with 2 GBs of memory, CGHseg ceases to function when *N* exceeds about 16000. For this reason CGHseg is not considered for further comparison.

**a**) Computation time τ versus

*N*. (

**b**) Power-law exponent γ

_{τ}for τ derived from (a). (

**c**) Overall memory κ

_{o}versus

*N*. (

**d**) Data-processing

**...**

Using real data, we ran SAD(10,100) on a 1.8-million-probeset Affymetrix Genome-Wide Human SNP Array 6.0 hybridized with a colorectal cancer sample, and measured τ=8 seconds and κ_{o}=323MBs.

### Validation on a low-resolution data set

We used a 2276-BAC public data set from the NIGMS Human Genetics Cell Repository (25) (henceforth the Snijders dataset) to perform low-resolution validation of SAD and to demonstrate the utility of *z*_{0} for limiting CNV width. The dataset corresponds to 15 human cell strains. As identified by spectral karyotyping, each cell strain has either one or two CNVs and eight of the CNVs on six strains were detected to be whole-chromosome. We set a value of *z*_{0} using Equation (7). For trisomic segments, the data set has SNR≈0.58/0.09, where 0.58≈log_{2}(3/2) is approximately the log2-ratio of a trisomic segment and 0.09 is the value for obtained from Equation (4). To detect a minimum CNV width between one datum (because one-datum CNVs are likely to be outliers) and two, 6.4<*z*_{0}<9.1 is required. We therefore used SAD(8,100) for this calculation.

Because the data set had previously been examined by GLAD (19) and CBS (15), we compared the three sets of results in full details in Supplementary Table S1, and summarize the comparison as follows. (1) SAD(8,100) detects more CNVs than GLAD and CBS do. (2) SAD(8,100) gives far fewer false-positives; the average numbers of false positive breakpoints per cell strain are 2/15, 46/15, 26/15, 37/9 and 16/9 for SAD(8,100), GLAD(λ′=8), GLAD(λ′=10), CBS(α=0.01) and CBS(α=0.001), respectively. (3) SAD alone assigns a *z*-value to each CNV for assessing significance. (4) SAD(8,100) alone detects whole-chromosome CNVs on whose detection GLAD and CBS are silent because they are based on breakpoint detection within chromosomes.

### Validation on a high-resolution dataset

In Redon *et al.* (3), 43 genomic regions were examined by SYBR real-time PCR or MassSpec to validate the respective CNV calls for NA15510 vs NA10851 on the Affymetrix 500K EA platform. We used three of these regions, cnp8, cnp23 and cnp36, respectively determined in (3) to be gain, loss and gain, to validate SAD and to demonstrate the utility of *z*_{0} for characterizing CNV significance. In Figure 5, the results of three runs, SAD(10,100), SAD(8,100) and SAD(6,100), on the first Sty replicates of the Redon dataset are respectively shown in frame sets (a), (b) and (c). At *z*_{0}=10 (Figure 5a) only cnp36 is detected with *z*_{a}=10.6. When *z*_{0} is lowered to 8 (Figure 5b), cnp23 is detected with *z*_{a}=−8.2. When *z*_{0} is further lowered to 6 (Figure 5c), cnp8 is detected with *z*_{a}=7.4.

## DISCUSSION

We have demonstrated that by virtue of its accuracy, parsimony in memory use and speed, SAD can manage the challenges analyzing modern high-resolution microarrays significantly better than existing algorithms. Algorithmically SAD is easy to understand because it employs fundamental principles of statistics and precise but very simple mathematics [as compared to the mathematics in the formulation of, say, GLAD (19)]. SAD makes all internal decisions based on statistics and provides an external quantitative statistic. With only two user-tunable parameters, *z*_{0} and *N*_{s}, the meanings of which are both intuitively accessible, SAD is also the easiest to use. Users can select *z*_{0}, the primary parameter, based on their requirement for CNV significance or CNV width. We recommend setting the second parameter, *N*_{s}, to 100. This guarantees good accuracy and a computation time that is (*N*).

Quantitative statistics provide the basis on which a level of confidence may be assigned to each inference and for setting a priority for experimental confirmation for such inferences. All measurements, especially those involving microarrays, carry inherent statistical error. SAD quantifies such errors as data uncertainty, tracks the latter throughout a clustering process using exact mathematical relations, and provides *z*-values for assessing CNV significance. The *z*-values, when used for downstream calculations such as the identification of recurrent aberrations using multiple arrays, allows the initial uncertainty to be passed on further.

SAD is an application build on PGM. The upgrading of SAD computation time from (*N*^{2}) to (*N*) is a consequence of the parallel processing made possible by the employment of agglomerative hierarchical clustering in PGM. The superior accuracy of SAD results from the exploitation by PGM of a common trait seen in most systems: that measurement errors are normally distributed. The operating principle of SAD is accessible to the user because in PGM the resolving power used for determining breakpoints is controlled via an intuitive statistic threshold. These properties of PGM promise its usefulness and wide application, beyond CNV, in the general analysis of microarray data.

## SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

## FUNDING

National Science Council (ROC) (Grant 97-2112-M-008-013; in part); Cathay General Hospital-NCU Collaboration (Grant 97-CGH-NCU-A1, in part) Funding for open access charge: National Science Council and the Ministry of Education (Research grants to H.C.L.).

*Conflict of interest statement*. None declared.

## REFERENCES

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (1.1M)

- A multi-sample based method for identifying common CNVs in normal human genomic structure using high-resolution aCGH data.[PLoS One. 2011]
*Park C, Ahn J, Yoon Y, Park S.**PLoS One. 2011; 6(10):e26975. Epub 2011 Oct 31.* - Optimizing copy number variation analysis using genome-wide short sequence oligonucleotide arrays.[Nucleic Acids Res. 2010]
*Oldridge DA, Banerjee S, Setlur SR, Sboner A, Demichelis F.**Nucleic Acids Res. 2010 Jun; 38(10):3275-86. Epub 2010 Feb 15.* - Software comparison for evaluating genomic copy number variation for Affymetrix 6.0 SNP array platform.[BMC Bioinformatics. 2011]
*Eckel-Passow JE, Atkinson EJ, Maharjan S, Kardia SL, de Andrade M.**BMC Bioinformatics. 2011 May 31; 12:220. Epub 2011 May 31.* - Comparing CNV detection methods for SNP arrays.[Brief Funct Genomic Proteomic. 2009]
*Winchester L, Yau C, Ragoussis J.**Brief Funct Genomic Proteomic. 2009 Sep; 8(5):353-66. Epub 2009 Sep 8.* - SNP array analysis in constitutional and cancer genome diagnostics--copy number variants, genotyping and quality control.[Cytogenet Genome Res. 2011]
*de Leeuw N, Hehir-Kwa JY, Simons A, Geurts van Kessel A, Smeets DF, Faas BH, Pfundt R.**Cytogenet Genome Res. 2011; 135(3-4):212-21. Epub 2011 Sep 16.*

- Interpreting genomic data via entropic dissection[Nucleic Acids Research. 2013]
*Azad RK, Li J.**Nucleic Acids Research. 2013 Jan; 41(1)e23* - Copynumber: Efficient algorithms for single- and multi-track copy number segmentation[BMC Genomics. ]
*Nilsen G, Liestøl K, Loo PV, Moen Vollan HK, Eide MB, Rueda OM, Chin SF, Russell R, Baumbusch LO, Caldas C, Børresen-Dale AL, Lingjærde OC.**BMC Genomics. 13591*

- PubMedPubMedPubMed citations for these articles

- An all-statistics, high-speed algorithm for the analysis of copy number variatio...An all-statistics, high-speed algorithm for the analysis of copy number variation in genomesNucleic Acids Research. Jul 2011; 39(13)e89PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...