# Detecting common copy number variants in high-throughput sequencing data by using JointSLM algorithm

^{1,}

^{2,}

^{3,}

^{*}Matteo Benelli,

^{1,}

^{2,}

^{4}Seungtai Yoon,

^{5}Franco Roviello,

^{6,}

^{7}and Francesca Torricelli

^{1,}

^{2}

^{1}Laboratory Department, Diagnostic Genetic Unit, Careggi Hospital, Florence 5014,

^{2}Center for the Study of Complex Dynamics (CSDC), University of Florence, Florence 50019,

^{3}Department of Medical and Surgical Critical Care, University of Florence, Florence 50141,

^{4}I.N.F.N, Sezione di Firenze, Florence 50100, Italy,

^{5}Seaver Autism Center and Department of Psychiatry, Mount Sinai School of Medicine, New York 10029, USA,

^{6}Surgical Oncology and Department of Human Pathology and Oncology, University of Siena, Siena and

^{7}Translational Research Laboratory Istituto Toscano Tumori, Siena, Italy

## Abstract

The discovery of genomic structural variants (SVs), such as copy number variants (CNVs), is essential to understand genetic variation of human populations and complex diseases. Over recent years, the advent of new high-throughput sequencing (HTS) platforms has opened many opportunities for SVs discovery, and a very promising approach consists in measuring the depth of coverage (DOC) of reads aligned to the human reference genome. At present, few computational methods have been developed for the analysis of DOC data and all of these methods allow to analyse only one sample at time. For these reasons, we developed a novel algorithm (JointSLM) that allows to detect common CNVs among individuals by analysing DOC data from multiple samples simultaneously. We test JointSLM performance on synthetic and real data and we show its unprecedented resolution that enables the detection of recurrent CNV regions as small as 500 bp in size. When we apply JointSLM to analyse chromosome one of eight genomes with different ancestry, we identify 3000 regions with recurrent CNVs of different frequency and size: hierarchical clustering on these regions segregates the eight individuals in two groups that reflect their ancestry, demonstrating the potential utility of JointSLM for population genetics studies.

## INTRODUCTION

The discovery of structural variants (SVs), including copy number variants (CNVs) and balanced rearrangements such as inversions and translocations, is deeply changing our understanding of the human genotype (2,1). Recently, multiple studies have discovered an abundance of structural variations of DNA segments that range from kilobases (kb) to megabases (Mb) in size (3). SVs have been found among normal individuals (4–6) while others participate in causing various disease states (7). For instance, genetic variants associated with cancer often result from rearrangements and alterations in proto-oncogenes or tumour suppressor genes (8–10), and Alzheimer and Parkinson's diseases have been associated with changes in gene dosage resulting from alterations in copy number (11,12).

In the last decade SVs detection has been performed with microarray technologies. The high-density CGH arrays (aCGH) and SNP genotyping arrays afford a level of resolution that allows CNV boundaries to be called with relatively high precision at a genome-wide level. However, although microarray platforms have been successfully used to identify CNVs, their resolution is limited by either the density of the array itself (for aCGH) or by the density of known SNP loci (for SNP arrays). For instance, currently available array platforms that consist of more than 1 million probes have a lower limit of detection of ∼5 to 10 kb (6,13).

Over recent years, the advent of new high-throughput sequencing (HTS) platforms, such as Illumina's Genome Analyzer and ABI's SOLiD, have opened many opportunities for SV discovery and has enabled initiatives such as the 1000 Genomes project (http://www.1000genomes.org) that aims to sequence the genomes of more than 1000 individuals to extend our knowledge on human genetic variation. HTS technologies are able to sequence a full human genome per week generating milions of short nucleotide sequences in a single instrument run.

The first HTS-based approach to detect SVs were based on paired-end read mapping (PEM), which identifies insertions and deletions by comparing the distance between mapped read pairs to the average insert size of the genomic library. Although this method is able to identify deletions smaller than 1 kb with high sensitivity, it does not allow the discovery of insertions larger than average insert size of the library and the exact borders of SVs in complex genomic regions rich in segmental duplication (14,15).

In this scenario, a very promising approach for the identification of SVs using HTS technologies consists in measuring the depth of coverage (DOC) of reads aligned to the human reference genome (15). At present, few computational methods have been developed for the analysis of DOC data: Campbell *et al.* (16) use the Circular Binary Segmentation algorithm (17) originally developed for genomic hybridization microarray data, Chiang *et al.* (18) use a local change-point analysis technique, while Yoon *et al.* (19) developed a new method based on the significance testing that works on intervals of data points. Although these algorithms are very sensitive and specific in discovering SVs from DOC data, they allow to analyse only one sample at time. The simultaneous analysis of multiple samples can improve statistical strength in the identification of signals shared by the data, increasing the resolution of SVs detection. Moreover, the identification of signals shared by multiple samples can lead to the detection of regions of interest since disease-critical genes are more likely to be found in regions that are common or recurrent among samples.

For these reasons, we have developed a novel algorithm, named JointSLM, that allows to analyse DOC signals from multiple samples simultaneously for the identification of common DNA events (recurrent CNVs) across individuals. By means of simulated data, we show that our algorithm is able to sensitively and accurately detect common structural variants as small as 500 bp in size. The comparison with other three state of the art methods show that our joint model allows one to obtain an unprecedented resolution in the detection of recurrent CNVs. We applied JointSLM to the DOC data of eight genomes and we demonstrate its unique advantage in population-based studies.

## MATERIALS AND METHODS

### DOC

HTS technologies, such as Illumina's Genome Analyzer and ABI's SOLiD, are able to generate milions of short nucleotide sequences in a single instrument run. Assuming the sequencing process is uniform, the number of reads mapping to a region follows a Poisson distribution and is expected to be proportional to the number of times the region appears in the DNA sample: a genomic region that has been deleted (duplicated) will have less (more) reads mapping to it than a region not deleted (duplicated).

Following this assumption, the copy number of any genomic region can be estimated by counting the number of aligned reads to the reference genome. Campbell *et al.* (16) and Chiang *et al.* (18) were the firsts to use this approach to detect copy-number alterations between tumour and healthy samples of the same individuals, while more recently Yoon *et al.* (19) proposed to use the read count in sequence data to look for genomic regions that differ in copy number between normal individuals of the 1000 genomes project.

The strategy to obtain DOC data consists in counting the number of mapped reads in non-overlapping windows of fixed length and then correcting each window by GC content: Campbell *et al.* (16) and Chiang *et al.* (18) used the logarithm of the ratio between the number of aligned reads from a tumour sample and the match normal sample, while Yoon *et al.* (19) used the number of aligned reads every 100-bp, corrected for GC content and median normalized to copy number 2 [median normalization is defined as 2 × (read count)/(mean read count over the genome) for each sample]. The DOC data obtained with this approach is mathematically very similar to the signal obtained from aCGH log_{2} ratios. Deletions or duplications are identified as a decrease or increase in coverage across multiple consecutive windows. Moreover, like aCGH log_{2}-ratios, DOC sequences have noise caused by mapping errors and random fluctuations in genome coverage. For these reasons, the events in DOC can be detected using the same types of segmentation algorithms that are used for aCGH data.

In a recent paper (20), we developed a fast and powerful algorithm to segment aCGH data in which the log_{2} ratios were modelled as a sum of two independent stochastic processes by means of Shifting Level Model (SLM). Due to their similarity with aCGH genomic profiles, also DOC genomic profiles can be considered to be generated by the sum of two processes: a biological process due to a real variation of the number of DNA copies and a white noise process that mimics experimental error.

Here, we introduce a novel method that extend the SLM algorithm from the classical univariate form to a multivariate form to segment multiple DOC signals simultaneously for the identification of common alterations. For each sample studied in this work, we take into consideration the logarithm in base 2 of the median-normalized DOC data obtained as in Yoon *et al.* (19): DOC was measured by counting the number of mapped reads in 100-bp windows, correcting for GC content and then median-normalizing to copy number 2. The 100-bp windows were chosen because at 30x coverage, the distribution of read counts is well approximated by a normal distribution, thus permitting us to assume normality in our mathematical modelling. DOC data from multiple samples were modelled as a sequential processes made of *N* observation each.

### The multivariate form of the SLM

We consider *M* sequential processes (samples) with *N* observations each (100-bp windows) and we denote with *t* (*t* = 1, … , *M*) and *i* (*i* = 1, … , *N*) the respective indexes. We model the sequential process *x* = *x*_{1}, … , *x*_{i}, … , *x*_{N}, where *x*_{i} = (*x*_{i1}, … , *x*_{iM})′, as the sum of two independent stochastic processes:

where *m*_{i} = (*m*_{i1}, … , *m*_{iM}) is the vector of the unobserved mean level and ϵ_{i} is the vector of white noises. The white noise vector ϵ_{i} follows a multivariate normal distribution with mean μ_{ϵ} = [0] and covariance matrix Σ_{ϵ}; *z*_{i} are random variables taking the values in [0, 1] with probabilities η = *Pr*(*z*_{i} = 1), 1 − η = *Pr*(*z*_{i} = 0); δ_{i} are random vectors that follow a multivariate normal distribution and μ_{i} is the vector of the means.

The process *m*_{i} is controlled by the process *z*_{i} : when *z*_{i−1} = 0, *m*_{i} is the same as *m*_{i−1} and when *z*_{i−1} = 1, *m*_{i} takes its new value according to a multivariate Gaussian law with mean μ and covariance matrix Σ_{ϵ} independently of *m*_{i−1}:

Combining the definitions given above, the joint distribution of the observations and latent variables, given the parameters, has the following form:

where Θ = (μ, Σ_{μ}, Σ_{ϵ}, η).

Since the DOC data are modelled as the sum of two independent stochastic processes, the expected value of *x*_{i} is equal to μ and its covariance matrix is given by the sum of the covariances of the two processes:

Using (7) we can introduce a different parametrization of the SLM by defining the parameter ω such that Σ_{μ} = ω · Σ and Σ_{ϵ} = (1–ω) · Σ.

When we deal with multiple DOC signals (profiles) simultaneously, we have to take into account some fundamental considerations: (i) each DOC profile is characterized by its technical noise caused by mapping errors and random fluctuations in genome coverage and (ii) in each DOC profile, a CNV can be present at variable copy number in comparison with a reference genome. For these reasons, the white noise distributions and mean level distributions can be considered independent across samples and we can write:

where σ_{ϵ,t} and σ_{μ,t} are the standard deviations of the normal distribution of the *t*-th sequential process for the white noise and the mean level distributions, respectively.

With these assumptions, from a mathematical point of view, the random process *z*_{i} is the only variable that correlates the samples. When *z*_{i} changes its value, the mean level of each sample have a shift. In this way, our joint model is able to detect common shift in the mean of multiple samples.

### The JointSLM algorithm

The joint distribution of Equation (5) defines an Hidden Markov Model (HMM) of order one with state variable *q*_{i} = (*m*_{i}, *z*_{i}) and multivariate emission probability (see Supplementary Data). The fact that the multivariate SLM is an HMM allows us to make use of the several algorithms developed for these kinds of models.

To estimate the parameters of the Multivariate SLM, we develop a two step algorithm that follows the idea of (20), based on dynamic programming. The inputs to the algorithm are the sequences *x* = {*x*_{1}, … , *x*_{M}} to be jointly segmented, the initial estimate of the number of states *K*_{0} and the parameters ω and η. In the first step, we estimate the parameters *m*_{i} by means of the Baum and Welch re-estimation strategy (21), while in the second step we estimate the best state sequence *s* (the *z*_{i} variables, i.e. the points of mean shift) by means of the Viterbi algorithm. Finally, we convert the data from log space to copy number space and we calculate the median of the data that belong to each segment. A detailed description of the algorithm and the study of the effect of the parameters *K*_{0}, ω and η on the performance of JointSLM are reported in Supplementary Data.

## RESULTS

### Synthetic data analysis

To assess the performance of JointSLM algorithm in identifying common DNA events of different size, we made an intensive simulation based on synthetic data generated from the GC-adjusted DOC data of chromosomes 1 and X of the male individual NA18507. To estimate specificity, we generated synthetic chromosomes by sampling 10 000 100-bp windows from chromosome 1 to simulate normal copy number. To estimate sensitivity, we added to the normal copy number chromosomes nine deletions of size 200 bp, 300 bp, 400 bp, 500 bp, 700 bp, 1 kb, 2.5 kb, 5 kb and 10 kb sampled from chromosome X. To minimize the sampling of random false positives we removed all gaps, segmental duplications, telomeres/centromeres and regions with known CNVs from the Database of Genomic Variants (DGV, http://projects.tcag.ca/variation/) and the Genome Structural Variation Consortium (GSV, http://www.sanger.ac.uk/humgen/cnv/42mio/).

We applied JointSLM with different parameter settings on data sets made of 10, 30 and 50 normal copy number chromosomes, and we evaluated false positive rate (FPR) by counting the number of detected alterations. The specificity of the algorithm can be controlled by the parameters η and ω (see Supplementary Data): the higher are the values of η and ω and the larger is the number of detected false positive (FP) events. For instance, in the 10 samples analysis (Figure 1b), when we set η = 10^{−3} and ω = 0.3 we detect an average of 9.4 FP events that range between 100 and 500 bp in size (6.0 events of 100 bp, 2.6 of 200 bp, 0.62 of 300 bp, 0.12 of 400 bp and 0.08 of 500 bp), while using a more conservative set of parameters (η = 10^{−6} and ω = 0.1), we detect an average of 0.91 FP events (0.66 of 100 bp, 0.22 of 200 bp, 0.02 of 300 bp and 0.01 of 400 bp). In the analysis of the 30 samples data set (Supplementary Figure), we detected an average of 21.6 FP events (FPR = 0.2%) with ω = 0.3 and η = 10^{−3} and an average of 9.6 (0.1%) events with ω = 0.1 and η = 10^{−6}, while for the 50 sample analysis the FPR grows to 0.3% with ω = 0.3 and η = 10^{−3} and to 0.2% with ω = 0.1 and η = 10^{−6}. These results show that the use of the most conservative set of parameters allows us to obtain a global FPR smaller than 0.01%.

**a**) Each curve represents the TPR estimate

**...**

Moreover, the great majority of the FP detected by our method in the 30 and 50 samples data set is made of 100 bp events, and in all the cases JointSLM does not identify FP events larger than 400 bp in size.

To quantify the detection power of our algorithm, we applied JointSLM with different parameter settings on simulated data sets made of 10, 30 and 50 synthetic chromosomes with common deletions inserted in a fraction of samples *f* (with *f* that ranges between 0.1 to 1) and we estimated TPR as the fraction of correctly detected alterations. The results of these simulations (Figure 1a and Supplementary Figures S1 and S2) show that the resolution of the algorithm (i.e. the ability of identifying regions of different size) does not depend on the number of samples analysed simultaneously but is strongly dependent on the fraction of altered samples *f*.

When *f* is small (i.e. only 10 or 20% of the samples are altered), JointSLM is able to correctly locate only regions larger than 1 kb in size, while for higher values of *f* (larger than 50%) the resolution of the algorithm drastically increases, allowing the identification of very small alterations (smaller than 1 kb). By setting ω = 0.1 and η = 10^{−6} we were able to correctly detect regions greater than 1 kb and shared in more than 20% of the samples, while when we set the parameters to less conservative values (ω to 0.2/0.3 and η set to 10^{−4} / 10^{−3}), we observed a dramatic improvement in detecting small alterations: in these cases, JointSLM is able to correctly detect common alterations smaller than 500 bp and shared among the 20% of the samples.

In order to evaluate the ability of our algorithm in correctly detecting the boundaries of common DNA events (breakpoints problem), we generated synthetic chromosomes in which common deletions are not perfectly aligned but randomly shifted of *n* 100 bp windows (with *n* that ranges from 1 to 5). The resolution of the algorithm is not affected by these perturbations: also in this case we were able to detect genomic events larger than 1 kb in size with ω = 0.1 and η = 10^{−6} and smaller than 1 kb with ω=0.3 and η = 10^{−3} (see Supplementary Figures S3–S5).

The extensive simulation study we performed show that the parameter ω allows to control both sensitivity and specificity, while η is able to control only specificity and has weak effect on sensitivity: these results suggest to use very conservative values of η (10^{−5} / 10^{−6}) in order to contain FP detection and tuning ω to obtain the desired level of sensitivity.

### Comparison with other algorithms

To demonstrate the advantages of analysing multiple samples at once by means of our joint model instead of using single sample models, we compared the performance of JointSLM with other three algorithms: the CBS (17) and EWT (19) methods that have been already used in the analyses of DOC data and the GLAD method (22) previously used for the analysis of array CGH data. To this end, we applied the three methods with default parameter settings to the synthetic data sets of the previous paragraph and we calculated the TPR as the fraction of correctly detected alterations and the FPR as the average number of FP detected in each chromosome. To call gain and losses with CBS algorithm, we used the same thresholds used for the JointSLM algorithm (see Supplementary Data). The results of these analyses and the comparison with JointSLM performance are detailed in Figure 2.

**...**

All algorithms perform well in terms of specificity: they detect a very small number of FB events and all the FP identified are smaller than 500 bp in size. In terms of sensitivity, our joint model outperforms the other single sample algorithms: JointSLM is able to detect very small alterations (200 bp) with a TPR larger than 0.8 showing an unprecedented sensitivity in detecting CNVs, while the other methods allow to detect only events larger than 500 bp. A more detailed study of Figure 2 shows that the number of FP events detected by JointSLM decreases when the number of samples analysed at once increases. This is probably the most interesting feature of our algorithm: analysing a large number of samples improves specificity without affecting sensitivity.

These results clearly suggests that the simultaneous analysis of multiple samples with our joint model improves the statistical strength in the identification of small CNVs and the use of JointSLM algorithm allows to extend the detection power of CNVs.

### Real data analysis

In order to identify common CNVs among multiple individuals, we applied JointSLM to the DOC data of eight genomes. These included a CEU trio of European ancestry (NA12878, NA12891 and NA12892), a YRI trio of Yoruba Nigerian ethnicity (NA19238, NA19239 and NA19240) that belong to 1000 Genomes project and two additional published genomes, including a Yoruban individual NA18507 (23) and a Chinese individual YH (24).

To minimize type-I error and obtain a very robust set of CNVs, we ran JointSLM using a conservative set of parameters (*K*_{0} = 20, η = 10^{−6} ω = 0.1), and we identified a total of 3000 CNV regions in chromosome 1 (for some examples of the JointSLM segmentation see Supplementary Figures S6–S9): 820 (27%) are smaller than 500 bp, 1131 (38%) ranges between 500 and 1000 bp, 760 (25%) ranges between 1 and 5 kb and 289 (10%) are larger than 5 kb (Table 1). All the CNVs detected in this analysis are listed in Supplementary Table S1.

Of all these regions, 958 (32%) are shared by only one sample, 457 (15%) by two samples, 330 (11%) by three samples, 233 (8%) by four samples, 165 (5%) by five samples, 155 (5%) by six samples, 168 (6%) by seven samples and 534 (18%) are present in all of the eight samples. According to Nguyen *et al.* (25), we found that the CNV regions identified by our algorithm are significantly overrepresented close to telomeres and centromeres (Supplementary Figure S10). Additionally, 799 of the 2180 RefSeq genes of chromosome 1 are contained or overlap with our set of called regions.

In order to validate the genomic events detected by our algorithm, we compared our calls with the known CNVs in DGV version 10 and each call was considered validated if there is any overlap of 1 bp or greater. The global validation rate is about 58%, and around 43% (3255/7666) of known CNVs were found in our call set. When we consider called regions that ranges between 1 and 5 kb the validation rate is around 70–80%, and goes up to 95–100% for genomic events greater than 5 kb. On the other hands, when we take into consideration CNVs smaller than 1 kb, the validation rate ranges between 40% and 60% (see Supplementary Data for more details).

As a further test, we compared our set of calls with a set of common CNVs recently assessed by GSV Consortium using high resolution array-CGH platforms. The common CNVs were detected in 40 individuals (20 CEU Caucasian and 20 Yoruban samples) by means of a NimbleGen tiling array set of 42 million probes and include 748 CNV regions for chromosome 1. We found that 25% of the CNVs identified by JointSLM overlap with the GSV calls (Table 1) and around 50% of the GSV calls were present in our callset: for regions larger than 5 kb we found that the overlap with GSV regions is around 70% (70% for both regions that ranges between 5 and 10 kb and regions larger than 10 kb), while for CNVs smaller than 5 kb it reduces to 10–30% (35% for regions that ranges between 1 and 5 kb and 10% for regions smaller than 1 kb).

Lastly, we compared our set of calls with SVs detected by PEM-based approach. The SVs of two of the individuals considered in this study (YH and NA18507) were previously analysed by means of PEM-based approach (23,24). To understand the differences between JointSLM and PEM-based methods in detecting known CNVs, we took the set of copy number variants of GSV as a set of true positive (TP), and we determined the proportion of TP identified by the two approaches. In the samples NA18507 and YH, JointSLM was able to identify 290 (39%) and 256 (34%) of the 748 CNV regions of the validation set, while PEM-based methods detected 125 (17%) and 79 (7%) (see Figure 3).

These results show that our algorithm has good sensitivity with respect to PEM methods in identifying CNVs previously detected by array-CGH. However, there is a little overlap between our call set and the call sets obtained with PEM approaches: for NA18507 we found an overlap of 18% (184/996) and for individual YH an overlap of 23% (46/194).

To understand if the discrepancy between PEM and our calls is due to detection limits of our algorithm, we calculated the median value of the DOC data for each non-overlapping region identified by PEM-based methods for both YH and NA18507. We found that 94% (811/856) and 82% (122/148) of the non-overlapping regions identified by PEM methods in NA18507 and YH, respectively, have a median value that ranges between 1.2 and 2.8 copies. These results suggest that the differences between PEM and our calls are not due to detection limits of the JointSLM algorithm but to the fact that the PEM- and DOC-based approaches allow to detect different classes of SVs: the discrepancy between PEM- and DOC-specific events has been previously reported (19) and it is explained by the fact that DOC-specific events show a large overlap with annotated segmental duplications (SDs), while PEM-specific events show an enrichment with simple repeat (SR) events.

### JointSLM and clustering

To demonstrate the utility of our algorithm for population genetic analysis, we applied clustering analysis to the matrix of the CNV regions identified by JointSLM in chromosome 1. We performed Ward's hierarchical clustering with the aim to group both CNV regions and individuals. We used Pearson correlation coefficient for clustering individuals and the euclidean distance for clustering genomic events.

Table 2 and Figure 4 report the results of the hierarchical clustering. Although no information on the identity of the individuals was used in the analysis, the algorithm was able to segregate the ancestry of the eight individuals in two main clusters: the first cluster include the european ancestry family and the chinese individual, while the second cluster include the nigerian ancestry family and the Yoruban individual NA18507. The clustering on the genomic events identified seven groups of regions with complex patterns of CNVs. In particular, we were able to detect three clusters (A, C and E) that contain regions with common amplifications, three clusters (B, D and F) that contain regions with common deletions and a cluster (G) that is made of deletions present only in the individual NA18507.

^{−6}, ω = 0.1 and

*K*

_{0}= 20. Each row represents

**...**

The genomic regions grouped in cluster A, E and F contain CNVs with high population frequency (shared among almost all of the eight individuals), while clusters B, C and D contain subgroups of CNVs that are primarily shared among Yoruban ancestry or european ancestry. As expected, the three clusters with common amplifications showed a greater enrichment of annotated SDs compared with the clusters that contain deletions. SDs accounted for 73, 63 and 54% of the total base pairs of clusters A, C and E, while for clusters B, D and F we found an overlap of 29, 31 and 23%, respectively.

Conversely, we observed that the clusters that contain common deletions (B, D and F) showed a greater enrichment of annotated genes compared with the amplification clusters. The number of RefSeq genes that overlap with clusters A, B and C are 128, 151 and 43, while for the deletion clusters D, E and F we found an overlap with 236, 360 and 117 RefSeq genes, respectively.

We also studied the overlap between the CNV regions that belong to each of the seven clusters and annotated regions with SRs: we found that all the clusters have an average overlap of the order of 6–8% with SR events, with the exception of cluster A that have more than 15%. Clusters A contain 434 amplification regions with very complex pattern of CNVs: around 65% of the amplifications that belong to cluster A have an estimated number of copies greater than 4.

## DISCUSSION

We developed a novel algorithm that extend the univariate SLM to the multivariate case in order to detect recurrent shifts in the mean of multiple sequential processes. We applied JointSLM to DOC signals obtained from high coverage sequencing data in order to infer common CNVs among multiple individuals.

The results obtained in simulated chromosomes show that JointSLM correctly detects recurrent CNV regions as small as 500 bp in size with sensitivity larger than 90%. The comparison with other state of the art methods demonstrated the our joint model is able to obtain an unprecedented resolution in the analysis of DOC data (see Supplementary Data). We applied our algorithm on chromosome 1 of eight genomes and we identified 3000 regions with recurrent CNVs of different frequency and size. We validated in silico the 3000 CNVs regions by studying the overlap with the annotated CNVs of the Database of Genomic Variants. We found that more than 50% of the inferred regions overlap with annotated CNVs and the validation rate grows up to 70–100% for regions larger than 1 kb. These results clearly show that the use of DOC data combined with our algorithm allows to obtain an unprecedented resolution in the identification of genomic structural variants.

We also demonstrate the utility of JointSLM algorithm for population genetics analysis by applying cluster analysis to the inferred regions. Hierarchical clustering applied to the samples is able to separate the eight individuals in two main clusters that reflects their ancestry, while cluster analysis on the regions allows to identify groups of CNVs that share structural features such as enrichment in segmental duplications, enrichment in simple repeats and gene content.

JointSLM is also able to analyse DOC data from one sample at time obtaining very good results in terms of sensitivity and specificity (see Supplementary Data).

The resolution of JointSLM strictly depends on the signal to noise ratio (SNR) of the data (see ‘Materials and Methods’ section): increasing the SNR of DOC data by reducing the sequencing error rate or augmenting the coverage of the sequencing experiments, will improve the performance of JointSLM in detecting small shifts in the signals.

The JointSLM algorithm can be also used to analyse multiple tumour samples data for the discovery of recurrent copy number alterations. In this case, to estimate DNA copy numbers it is necessary to take into account cellularity and tumoural heterogeneity and for this reason we would need a more sophisticated approach similar to CGHCall (26) or FastCall (27) instead of using the simple rounding to the closest integer. Although all the analyses performed in this article were made on sequence data from the Illumina Genome Analyzer, the algorithm we developed is generic and can be used to analyse DOC data produced by different HTS platforms.

## AVAILABILITY

The JointSLM algorithm is implemented as an R package. The source code includes both R and Fortran codes. The JointSLM package and a brief manual is freely available as Supplementary Data.

## ACKNOWLEDGEMENT

We thank the 1000 Genomes Consortium which provided part of the data used in this study (http://www.1000genomes.org). We thank Prof. Jonathan Sebat for helpful discussions.

## FUNDING

This study was supported by the project “Gene expression profile and therapeutic implication in gastric cancer. From the clinical overview to the translational research” of the Istituto Toscano Tumori (ITT).

*Conflict of interest statement*. None declared.

## REFERENCES

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (514K) |
- Citation

- Detecting copy number variation with mated short reads.[Genome Res. 2010]
*Medvedev P, Fiume M, Dzamba M, Smith T, Brudno M.**Genome Res. 2010 Nov; 20(11):1613-22. Epub 2010 Aug 30.* - Systematic inference of copy-number genotypes from personal genome sequencing data reveals extensive olfactory receptor gene content diversity.[PLoS Comput Biol. 2010]
*Waszak SM, Hasin Y, Zichner T, Olender T, Keydar I, Khen M, Stütz AM, Schlattl A, Lancet D, Korbel JO.**PLoS Comput Biol. 2010 Nov 11; 6(11):e1000988. Epub 2010 Nov 11.* - Read count approach for DNA copy number variants detection.[Bioinformatics. 2012]
*Magi A, Tattini L, Pippucci T, Torricelli F, Benelli M.**Bioinformatics. 2012 Feb 15; 28(4):470-8. Epub 2011 Dec 23.* - Statistical challenges associated with detecting copy number variations with next-generation sequencing.[Bioinformatics. 2012]
*Teo SM, Pawitan Y, Ku CS, Chia KS, Salim A.**Bioinformatics. 2012 Nov 1; 28(21):2711-8. Epub 2012 Aug 31.* - Current analysis platforms and methods for detecting copy number variation.[Physiol Genomics. 2013]
*Li W, Olivier M.**Physiol Genomics. 2013 Jan 7; 45(1):1-16. Epub 2012 Nov 6.*

- GROM-RD: resolving genomic biases to improve read depth detection of copy number variants[PeerJ. ]
*Smith SD, Kawash JK, Grigoriev A.**PeerJ. 3e836* - Identification of copy number variants in whole-genome data using Reference Coverage Profiles[Frontiers in Genetics. ]
*Glusman G, Severson A, Dhankani V, Robinson M, Farrah T, Mauldin DE, Stittrich AB, Ament SA, Roach JC, Brunkow ME, Bodian DL, Vockley JG, Shmulevich I, Niederhuber JE, Hood L.**Frontiers in Genetics. 645* - Common Copy Number Variation Detection From Multiple Sequenced Samples[IEEE transactions on bio-medical engineerin...]
*Duan J, Deng HW, Wang YP.**IEEE transactions on bio-medical engineering. 2014 Mar; 61(3)928-937* - Detection of common copy number variation with application to population clustering from next generation sequencing data[Conference proceedings : ... Annual Interna...]
*Duan J, Zhang JG, Deng HW, Wang YP.**Conference proceedings : ... Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual Conference. 2012; 20121246-1249* - Identification of structural variation in mouse genomes[Frontiers in Genetics. ]
*Keane TM, Wong K, Adams DJ, Flint J, Reymond A, Yalcin B.**Frontiers in Genetics. 5192*

- PubMedPubMedPubMed citations for these articles

- Detecting common copy number variants in high-throughput sequencing data by usin...Detecting common copy number variants in high-throughput sequencing data by using JointSLM algorithmNucleic Acids Research. 2011 May; 39(10)e65

Your browsing activity is empty.

Activity recording is turned off.

See more...