- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# A versatile statistical analysis algorithm to detect genome copy number variation

^{†}

^{‡}Archisman Rudra,

^{†}

^{‡}Harry Ostrer,

^{§}Robert Lucito,

^{¶}Michael Wigler,

^{¶}and Bud Mishra

^{†}

^{¶}

^{}

^{†}Courant Institute of Mathematical Sciences, New York University, 251 Mercer Street, New York, NY 10012;

^{¶}Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY 11724; and

^{§}Human Genetics Program, New York University School of Medicine, New York, NY 10012

^{}To whom correspondence should be addressed. E-mail: mishra/at/nyu.edu.

^{‡}R.-S.D. and A.R. contributed equally to this work.

## Abstract

We have developed a versatile statistical analysis algorithm for the detection of genomic aberrations in human cancer cell lines. The algorithm analyzes genomic data obtained from a variety of array technologies, such as oligonucleotide array, bacterial artificial chromosome array, or array-based comparative genomic hybridization, that operate by hybridizing with genomic material obtained from cancer and normal cells and allow detection of regions of the genome with altered copy number. The number of probes (i.e., resolution), the amount of uncharacterized noise per probe, and the severity of chromosomal aberrations per chromosomal region may vary with the underlying technology, biological sample, and sample preparation. Constrained by these uncertainties, our algorithm aims at robustness by using a priorless maximum *a posteriori* estimator and at efficiency by a dynamic programming implementation. We illustrate these characteristics of our algorithm by applying it to data obtained from representational oligonucleotide microarray analysis and array-based comparative genomic hybridization technology as well as to synthetic data obtained from an artificial model whose properties can be varied computationally. The algorithm can combine data from multiple sources and thus facilitate the discovery of genes and markers important in cancer, as well as the discovery of loci important in inherited genetic disease.

**Keywords:**array-based comparative genomic hybridization, copy-number fluctuations, maximum

*a posteriori*estimator

**G**enomes in a population are polymorphic, giving rise to diversity and variation. In cancer, even somatic cell genomes can rearrange themselves, often resulting in genomic deletion (hemi- or homozygous) and amplifications. Means for assessing these chromosomal aberrations quickly, inexpensively, and accurately have many potential scientific, clinical, and therapeutic implications (1, 2), particularly in the genomics of cancer and inherited diseases. Genome-based methods for studying cancer, in contrast to the gene expression-based methods, can exploit the stability of DNA (as a component of the cancerous cell, which does not vary as a function of the cell's physiological state). Karyotyping, determination of ploidy, and comparative genomic hybridization have been useful tools for this purpose even though they are crude and produce data that must be processed by sophisticated statistical algorithms to serve as useful guides to diagnosis and treatment.

Microarray methods are an important new technology that can be used to study variations between regular and cancer genomes. Imagine that one can sample the genome uniformly (independently and identically distributed) and reproducibly to create a large number of oligonucleotides (on the order of 100,000 probes) located every 30 kb or so. These oligonucleotides almost always come from regions of the genome that do not share homologous sequences elsewhere in the genome. These sequences (typically less than a few hundred base pairs long) occupy unique positions in the normal genome and have exactly two copies.

If one such oligonucleotide belongs to a region in a cancer genome that has an altered copy number, say, *c* (0 ≤ *c* ≠ 2), then when the cancer genome is sampled, this oligonucleotide will occur with a probability that is *c*/2 times that in the regular genome. The copy number can be computed by a ratiometric measurement of the abundance of an oligonucleotide in a cancer sample measured against that in the regular genome. This technique can be generalized to measure the copy number variations for many probes simultaneously with high-throughput microarray experiments. Even though the ratiometric measurements used and the associated regularizations tame the multiplicative noises in the system to some extent, there remains a large amount of uncharacterized noise (generally additive) that can render the data worthless unless a proper data-analysis algorithm is applied. Because the data may come from multiple sources collected with varying protocols, such an algorithm must be general and be based on a minimal set of prior assumptions about the methods. The algorithm we describe below reflects these desiderata.

Our Bayesian approach constructs a most plausible hypothesis concerning regional changes and the corresponding associated copy number. It can be viewed as an optimization process minimizing a score function that assigns penalties of different type for each kind of deviation from genomic normality (break-points, unexplainable probe values, noise, etc.); we discuss how these penalties are derived. We describe various algorithmic alternatives, their implementations, and the empirical results derived using real data (where the underlying facts are not directly verifiable) and simulated data (where the true facts are known).

## Statistical Model

We start by describing a probabilistic generative model for observed copy number data. The model is Bayesian in spirit, in that we use parameterized prior distributions and use the posterior distribution function to estimate the underlying model. We use a maximum *a posteriori* (MAP) technique to estimate the underlying model. This idealized statistical model takes into account some major sources of copy number variation in an irregular genome and is described by two scalar parameters 0 ≤ *p*_{r}, *p*_{b} ≤ 1.

We assume that there is a copy-number distribution for probes at locations that have not been affected by the chromosomal aberrations associated with cancer. We call these probes regular probes. We also assume that the probability for a particular probe being regular is *p*_{r} and that the associated regular copy-number distribution, after log transformation, is Gaussian, with mean μ_{r} and standard deviation σ_{r}. For the other probes, which we call deviated, the log-transformed copy-number distributions also are assumed to be Gaussian, with unknown mean and standard deviation, distinct from the regular distribution. There are usually many sets of probes drawn from different deviated distributions.

We also assume that there are locations in the genome that are particularly susceptible to amplification (also known as duplication) and deletion events. These aberrations change the copy numbers of probes locally. We model the number of such mutations as a Poisson process with parameter *p*_{b}*N*, where *N* is the length of the genome (i.e., total number of probes).

We subdivide the probes along the genome into *k* nonoverlapping intervals. Probes belonging to a particular interval are assumed to have a similar evolutionary history of duplication and deletion events, and therefore have similar copy-number distributions. The number of intervals into which the probes can be separated represents the progressive degeneration of a cancer cell line. We do not model single nucleotide polymorphisms and other point-mutation events, and this undermodeling reappears as localized noise in our analyzed data.

In our picture, each interval in this subdivision has a “true” copy number. Our goal is to estimate the correct subdivision and the copy numbers associated with each subinterval. Despite its simplicity, our model can serve as the basis of a statistical algorithm to infer the aberrations without overfitting the data.

More formally, given a set of *N* probe copy-number values arranged on the genome, we assume that there is an unknown partition of this set into nonoverlapping subintervals. The probe copy number values in the *j*th interval are assumed to arise as independent samples from a Gaussian distribution *N*(μ_{j}, σ_{j}). The parameters relating to the *j*th interval can be represented as the tuple *I*_{j} = (μ_{j}, *i*_{j}, σ_{j}), where μ_{j} and σ_{j} are the mean and standard deviation of the appropriate Gaussian distribution and *i*_{j} is the position of the last probe in the interval. We call such a set of intervals I = {*I*_{j}|*j* = 1,..., *k*} an interval structure. When a particular interval in I is regular, its mean is the regular mean μ_{r}. If an interval *I*_{j} is deviated, then its population mean μ_{j} is unknown and is estimated by using the sample mean over the interval. In this work, we assume that all of the σ_{j} terms are equal to some common value σ, and we therefore omit them from the notation. We denote an interval structure I_{N} with *k* intervals and whose intervals have associated means μ_{1},..., μ_{k} and endpoints *i*_{1},..., *i*_{k} (necessarily *i*_{k} = *N*) as *i*_{1}, μ_{1}, *i*_{2}, μ_{2},..., *i*_{k}, μ_{k}.

Our goal is to estimate the unknown interval structure I_{N} from an input sequence {*v*_{i}, *i* = 1... *N*} of copy numbers of *N* successive probes.

The statistical model described thus far fits naturally into a Bayesian setting. We can start with a prior distribution on the set of interval structures depending only on the number of intervals and the number of regular probes with two scalar parameters *p*_{r} and *p*_{b} whose significance is described above.

This prior has two components, the first a Poisson distribution to model the number of intervals with Poisson parameter *p*_{b}*N*. The second component is a sequence of Bernoulli trials, one for each probe with probability *p*_{r} that a given probe is regular. Combining these factors, the prior distribution becomes

where #*regular* is the number of regular probes with the “regular” copy-number distribution and #*deviated* is the number of remaining probes in the interval structure I_{N}. In each interval *I*_{j}, the data points are modeled by adding independent Gaussian noise to this prior structure and are drawn from the Gaussian distribution *N*(μ_{j}, σ).

The data likelihood function for the first *n* probes is given by the product of Gaussians:

where the *i*th probe is covered by the *j*th interval of the interval structure I_{N} and μ_{j} is the mean of the corresponding Gaussian distribution. denotes the density function of the Gaussian distribution. By multiplication, we obtain the posterior likelihood function:

In the above expression for *L*, only the μ values of nonregular processes are unknown, and we estimate these values by using the sample mean for the interval. The MAP solution to the segmentation problem is obtained by finding the interval structure I^{*} that maximizes this likelihood function or, equivalently, minimizes the negative log likelihood of *L*.

## Algorithm and Implementation

A dynamic programming algorithm efficiently minimizes the negative posterior log likelihood function obtained above. Starting with an interval structure *I* = *i*_{1}, μ_{1},..., *i*_{k}, μ_{k}, we can extend it to the interval structure *I*′ = *i*_{1}, μ_{1},..., *i*_{k+1}, μ_{k+1}, where *i*_{k+1} > *i*_{k}. The following formula computes the log likelihood for such an extension

where the last term on the right side is chosen according to whether the last added interval (i.e., the one extending from *i*_{k} + 1 to *i*_{k+1}) is regular or not. if the Boolean formula *e* is true, and 0 otherwise. We also point out that the MAP approach permits the estimation of the μ terms in a uniform manner. When the last added interval is regular, the value μ_{k+1} is fixed at the global mean μ. When the last added interval is deviated, however, the MAP criterion automatically forces the choice of the sample mean of the data points covered by the last interval as the value for μ_{k+1}. One could build hierarchical models for the mean and use these “shrinkage-like” estimators as well (3), although we do not explore that approach here.

The negative log likelihood function satisfies an optimality condition that allows one to use a standard dynamic programming algorithm [of time-complexity *O*(*N*^{2})] in this setting.

## Results

We evaluate the performance of this simple Bayesian scheme on three kinds of data. For each of these data sets, we will see that proper choice of the parameter values *p*_{r} and *p*_{b} leads to good segmentation. Indeed, coefficients chosen from within a fairly large region of the “*p*_{r}-*p*_{b} space” lead to a good segmentation because our procedure is stable over a large domain. The parameters *p*_{b}, *p*_{r}, μ, and σ play different roles: an increase in *p*_{b} yields more intervals in the segmentation, and, as *p*_{r} is increased, more probes come to be classified as regular, and therefore the number of different segments diminish.

The choice of μ is critical because it controls the bias in the resulting segmentation. The choice of σ is also important because increasing σ weakens the influence of the data on the segmentation obtained.

**Representational Oligonucleotide Microarray Analysis (ROMA) Data from Breast Cancer Cell Lines.** ROMA is a comparative genomic hybridization (CGH) technique developed by Wigler and colleagues (1) at Cold Spring Harbor Laboratory. It evolved from an earlier method, representational differential analysis, which was adapted for greatly increased volumes of data obtained by using an oligonucleotide microarray. ROMA uses a comparative “two-color” scheme to compare multiple genomes, each represented with reduced complexity by using a PCR-based method (4, 5). As in other array-based methods, ROMA performs simultaneous array hybridization to compare a normal genome at one fluorescent wavelength and a tumor genome at another. The DNA representations used by ROMA are based on amplification of short restriction endonuclease fragments and hence are predictable from the nucleotide sequence of the genome. We have tested our algorithm on the data sets from the Wigler laboratory obtained by ROMA from the genomes of breast cancer cell lines. The data set is based on 85,000 well characterized probes, each of length 70 bp, providing a resolution of a probe every 15-30 kb.

Figs. Figs.11 and and22 show subsampled ROMA breast cancer data from chromosomes 2 and 8, respectively, overlaid with the segmentation found by our algorithm. The low-complexity DNA representation used in ROMA, together with a careful choice of probes, provides low-noise data that can be characterized accurately by the algorithm.

**Array-Based CGH (arrayCGH) Data from Prostate Cancer Cell Lines.** arrayCGH is a recently developed technique that maps duplicated or deleted chromosomal segments onto high-density arrays of well characterized bacterial artificial chromosomes (BACs), rather than onto metaphase chromosomes. This method has been used for precise mapping of duplications and deletions occurring in cancers and other human diseases, including birth defects and mental retardation (see ref. 6 for review and applications of this technique). Tumors that have been studied by using this method are breast, head and neck, Wilms, esophageal, pulmonary artery intimal, adrenocortical, renal, and prostate cancers and lymphomas. We have tested our algorithm on a data set obtained by high-resolution arrayCGH analysis of prostate cancer tissue. The data were supplied by a group at Nijmegen University Medical Center and obtained by hybridization on their custom array composed of ≈3,500 fluorescence *in situ* hybridization-verified clones selected to cover the genome with an average of one clone per megabase (7).

Fig. 3 shows the performance of our segmentation algorithm on data from prostate cancer cell lines obtained through arrayCGH experiments. We note that these data are noisier than the ROMA data considered previously. But, despite the increased noise, the segmentation algorithm is robust and yields reasonable segmentations.

*p*

_{r}= 0.55,

*p*

_{b}= 0.01, sampling rate 1 in 10.

**Simulated Data.** To further test our algorithm, we can use an artificial but biologically inspired model to generate synthetic data. To generate simulated copy-number data, we choose loci uniformly over a genome such that the probability of a duplication or deletion event taking place at that location on the genome is given by *p*_{b}. At each of these points, we assign a new copy-number value that represents the mean for the new interval. The mean values are drawn from a power-transformed γ distribution to mimic the observed distribution of means in experimental data. The lengths of the intervals follow a geometric distribution such that the ratio of expected fragment length and the expected distance between the beginning of each interval is *p*_{r}. Once the segmentation and the mean values are chosen, we generate the simulated data by adding random Gaussian noise. A typical simulated genome is shown in Fig. 4.

**Effect of Noise on Performance.** We investigate the effect of increasing the σ of the underlying model on the performance of the segmenter. Assuming the parameters of the model are correctly estimated, the segmenter can output the estimated mean value at every probe position. Using the known mean values, we can compare these two sequences of means. In our setting, a good measure of error is the number of misclassified probes, i.e., the number of probes that are known to be regular but were classified as amplified or deleted and vice versa. Fig. 5 shows the increase in the rate of misclassification as σ increases.

## Prior Selection

Proper selection of a prior distribution has received extensive attention in the literature. Approaches include noninformative priors [Jeffreys (8)], reference priors [Bernardo (9); see also Berger and Bernardo (10) and Kass and Wasserman (11)], and conjugate priors [Raiffa and Schlaifer (12)] among others. Conjugate prior methods frequently arise in connection with exponential families of distributions [see Brown (13)]. Other approaches include using invariance properties to posit prior distributions with good performance. More recent and somewhat more data-dependent techniques include hierarchical and empirical Bayes techniques. Textbooks such as those by Bernardo and Smith (14), Berger (15), Carlin and Louis (16), Gelman *et al*. (17), and Robert (18) cover model selection as a part of Bayesian learning.

For the problem of estimating probe copy numbers, the prior distribution is specified by the two probability parameters, *p*_{r} and *p*_{b}. The other parameters (μ, the regular mean, and σ^{2}, the regular variance) can be estimated by experiment. The problem of prior selection reduces to the problem of optimally selecting the values of *p*_{r} and *p*_{b} to prevent overfitting of the data.

Minimax approaches choose prior distributions that minimize the maximum value of the likelihood function (Eq. **4**). This criterion is pessimistic, in that it chooses the prior that generates the worst likelihood value. See, for example, Berger (15), Brown (19-21), and Strawderman (22-24). In the nonparametric setting of function estimation, multiscale methods have been proven to be asymptotically minimax by Donoho and Johnstone (25-27).

We adapt an approach, based on statistical decision theory, that directly controls the level of overfitting without explicitly depending on the asymptotic performance guarantees of minimax approaches. We rely on the fact that, in any segmentation, each jump separates the probes in the two adjoining intervals. If a segmentation is overfitted, at least one of its jumps must be overfitted, too. We use Hotelling's *t*^{2} statistic [see Anderson (28) or Wilks (29)] at each jump to compute a measure of this overfitting.

We apply an *F* test to Hotelling's *t*^{2} statistic to test whether two sets of independent samples come from populations with the same mean. This *F* test is possible because we assume that the two sets of samples have the same (but still unknown) variance. Let *x*_{1}, *x*_{2},..., *x*_{N1} and *y*_{1}, *y*_{2},..., *y*_{N2} be the two sets of independent samples taken from successive intervals of size *N*_{1} and *N*_{2}, respectively. Then, we define the statistic:

where and are the respective sample means, and *df*_{1} and *df*_{2} refer to the respective degrees of freedom of the two samples. Under the null hypothesis (of equal means), *t*^{2} follows an *F* distribution with 1, (*df*_{1} + *df*_{2}) degrees of freedom. This leads to a one-tailed *F* test.

Intuitively, *t*^{2} needs to be large to avoid overfitting. The cumulative probability for the appropriate *F* distribution yields a score that determines the quality of the break. We also compute a score for the goodness of fit for the whole segmentation. This procedure yields a set of scores: one for each break and one for the goodness of fit. The minimum of these scores is used to evaluate the whole segmentation. We select the parameters *p*_{r} and *p*_{b} to maximize this score by searching at regular intervals over the parameter space. We can continue to refine the search in the neighborhood of the optimal values obtained. However, the algorithm is already extremely stable in a large region of the *p*_{r}-*p*_{b} space and yields, in practice, very good segmentations.

## Discussion

The problem of detecting copy-number variations has assumed biological importance in recent years. Most extant algorithms use a global thresholding approach for this problem. This is the case, for example, in Vissers *et al*. (7) as well as many commercially available packages. These algorithms have the advantage of simplicity but perform poorly in the presence of noise and correlations. Other published approaches have used smoothing (30), hidden Markov models (HMMs) (31, 32), and mixtures of Gaussians, as well as approaches that try to estimate the correlations between probes (33, 34). Although smoothing certainly improves the performance of threshold-based approaches, the specifics remain somewhat ad hoc, and the method requires tuning dependent on the source and resolution of the data.

HMMs have the advantage of having a general (although slow) learning algorithm; however, their performance is very sensitive to the topology of the HMM. For this reason, researchers tend to analyze very narrow classes of data with a particular HMM, e.g., a prostate cancer cell line. Very rarely are normal-normal data so analyzed, because this analysis usually necessitates the construction of an HMM with a different topology, leading to questions about the comparative power of such analyses. The main problem with both this approach and others based on assuming a distributional form for cancerous data is that the cancerous insertion-deletion polymorphisms are characterized by being nonnormal, rather than belonging to a specific distributional form. Therefore, fitting cancer data leads to the construction of a specific HMM topology that might depend on the specific cancer as well as the goodness of fit desired by the statistician.

Olshen and Venkatraman (35) have advocated another approach based on recursive change-point detection in the copy number data. The existence of a large literature on change-point analysis makes this approach attractive. Conversely, an efficient implementation of this algorithm is difficult. The specific statistic chosen for change-point detection in this and other work of the group perform poorly on normal-normal data due to overly pessimistic criteria. In some sense, our approach of putting a Bayesian prior on the number of change points enables us to be aggressive about detecting change points.

We have devised a versatile MAP estimator algorithm to analyze arrayCGH data. This algorithm uses a model that captures the genomic amplification-deletion processes but is relatively insensitive to additive noise in the data. When the algorithm was tested on a wide variety of data from ROMA- and arrayCGH-based methods, this particular feature of the algorithm provided strength and robustness. We note that the correct choice of *p*_{r} and *p*_{b} is critical in the segmentation algorithm. High values of *p*_{b} tend to yield overfitted solutions, whereas high values of *p*_{r} drive us toward biased solutions that mark all segments as regular. The advantage of having an algorithm with only two numerical parameters is that a simple and natural statistical criterion enables the proper choice of these parameters in all cases.

We parenthetically note that our approach extends to multidimensional data *mutatis mutandis*. The relevant likelihood function needs to be changed to the following

The *t*^{2} statistic can be modified similarly.

Prior work by Donoho and colleagues (36-38) on detecting geometrical features in point clouds by using multiresolution methods relates to the ideas presented here. These papers focus on the use of multiresolution approaches for efficiency and statistical stability. There is also prior work by Kolaczyk (see ref. 39, for example) that gives a unified Bayesian treatment to multiresolution analysis and covers large classes of both continuous as well as discrete processes. Our approach leads to an efficient algorithm for sequence-like data, which can be used in a multiscale setting if desired. Furthermore, in our approach, the probabilistic generative model directly leads to the cost function; thus, other generative models, e.g., poisson models, can be easily considered in this setting. It should be noted that the Hotelling's *t*^{2} statistic cannot be easily generalized to this setting.

## Acknowledgments

We thank Yi Zhou (New York University Bioinformatics Group) and two anonymous reviewers for many helpful discussions, suggestions, and relevant references to statistical literature. We also thank Lakshmi Muthuswami (Cold Spring Harbor Laboratory), and Eric Schoenmakers and Joris Veltman (Nijmegen University Medical Center, Nijmegen, The Netherlands) for providing the data used here and for explaining their biological significance. The work reported in this paper was supported by grants from the National Science Foundation (NSF) Qubic Program, the NSF Information Technology Research Program, the Defense Advanced Research Projects Agency, a Howard Hughes Medical Institute Biomedical Support Research Grant, the U.S. Department of Energy, the U.S. Air Force (Air Force Research Laboratory), the National Institutes of Health, and the New York State Office of Science, Technology and Academic Research.

## Notes

Author contributions: B.M. designed research; R.-S.D., A.R., and B.M. performed research; R.-S.D., A.R., and B.M. contributed new reagents/analytical tools; R.-S.D., A.R., H.O., M.H.W., and B.M. analyzed data; R.-S.D., A.R., and B.M. wrote the paper; and R.L. and M.H.W. contributed data.

Abbreviations: MAP, maximum *a posteriori*; ROMA, representational oligonucleotide microarray analysis; CGH, comparative genomic hybridization; arrayCGH, array-based CGH; HMM, hidden Markov model.

## References

**,**1726-1736. [PMC free article] [PubMed]

**,**42-49.

**,**1254-1263.

**,**946-951. [PubMed]

**,**4487-4492. [PMC free article] [PubMed]

**,**Suppl. 2, R145-R152. [PubMed]

*et al*. (2003) Am. J. Hum. Genet. 73

**,**1261-1270. [PMC free article] [PubMed]

**,**453-461. [PubMed]

**,**113-147.

**,**1343-1370.

**,**855-903.

**,**1277-1282.

**,**385-388.

**,**255-263.

**,**1364-1368.

**,**879-921.

**,**1-32.

**,**859-897.

*et al*. (2004) Science 305

**,**525-528. [PubMed]

**,**132-153.

**,**74. [PMC free article] [PubMed]

**,**540-549. [PubMed]

**,**920-933.

**National Academy of Sciences**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (336K)

- Representational oligonucleotide microarray analysis: a high-resolution method to detect genome copy number variation.[Genome Res. 2003]
*Lucito R, Healy J, Alexander J, Reiner A, Esposito D, Chi M, Rodgers L, Brady A, Sebat J, Troge J, et al.**Genome Res. 2003 Oct; 13(10):2291-305. Epub 2003 Sep 15.* - Assessing the significance of conserved genomic aberrations using high resolution genomic microarrays.[PLoS Genet. 2007]
*Guttman M, Mies C, Dudycz-Sulicz K, Diskin SJ, Baldwin DA, Stoeckert CJ Jr, Grant GR.**PLoS Genet. 2007 Aug; 3(8):e143.* - Challenges in array comparative genomic hybridization for the analysis of cancer samples.[Genet Med. 2007]
*Nowak NJ, Miecznikowski J, Moore SR, Gaile D, Bobadilla D, Smith DD, Kernstine K, Forman SJ, Mhawech-Fauceglia P, Reid M, et al.**Genet Med. 2007 Sep; 9(9):585-95.* - Detection of DNA copy number alterations in cancer by array comparative genomic hybridization.[Genet Med. 2007]
*Michels E, De Preter K, Van Roy N, Speleman F.**Genet Med. 2007 Sep; 9(9):574-84.* - Comparative genomic hybridization.[Annu Rev Genomics Hum Genet. 2005]
*Pinkel D, Albertson DG.**Annu Rev Genomics Hum Genet. 2005; 6:331-54.*

- Copy Number Variant Analysis of Human Embryonic Stem Cells[Stem cells (Dayton, Ohio). 2008]
*Wu H, Kim KJ, Mehta K, Paxia S, Sundstrom A, Anantharaman T, Kuraishy AI, Doan T, Ghosh J, Pyle AD, Clark A, Lowry W, Fan G, Baxter T, Mishra B, Sun Y, Teitell MA.**Stem cells (Dayton, Ohio). 2008 Jun; 26(6)1484-1489* - Bayesian Frequentist hybrid Model wth Application to the Analysis of Gene Copy Number Changes[Journal of applied statistics. 2011]
*Yuan A, Chen G, Xiong J, He W, Rotimi C.**Journal of applied statistics. 2011; 38(5)987-1005* - Copy number variation signature to predict human ancestry[BMC Bioinformatics. ]
*Pronold M, Vali M, Pique-Regi R, Asgharzadeh S.**BMC Bioinformatics. 13336* - Detecting simultaneous changepoints in multiple sequences[Biometrika. 2010]
*Zhang NR, Siegmund DO, Ji H, Li JZ.**Biometrika. 2010 Sep; 97(3)631-645* - Parsimonious Higher-Order Hidden Markov Models for Improved Array-CGH Analysis with Applications to Arabidopsis thaliana[PLoS Computational Biology. 2012]
*Seifert M, Gohr A, Strickert M, Grosse I.**PLoS Computational Biology. 2012 Jan; 8(1)e1002286*

- A versatile statistical analysis algorithm to detect genome copy number variatio...A versatile statistical analysis algorithm to detect genome copy number variationProceedings of the National Academy of Sciences of the United States of America. Nov 16, 2004; 101(46)16292PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...