- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- Bioinformatics
- PMC3025715

# A statistical framework for Illumina DNA methylation arrays

^{1}Department of Biostatistics,

^{2}Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, NC 27599,

^{3}Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI 53792 and

^{4}Division of Biostatistics, University of Minnesota, Minneapolis, MN 55455, USA

## Abstract

**Motivation:** The Illumina BeadArray is a popular platform for profiling DNA methylation, an important epigenetic event associated with gene silencing and chromosomal instability. However, current approaches rely on an arbitrary detection *P*-value cutoff for excluding probes and samples from subsequent analysis as a quality control step, which results in missing observations and information loss. It is desirable to have an approach that incorporates the whole data, but accounts for the different quality of individual observations.

**Results:** We first investigate and propose a statistical framework for removing the source of biases in Illumina Methylation BeadArray based on several positive control samples. We then introduce a weighted model-based clustering called LumiWCluster for Illumina BeadArray that weights each observation according to the detection *P*-values systematically and avoids discarding subsets of the data. LumiWCluster allows for discovery of distinct methylation patterns and automatic selection of informative CpG loci. We demonstrate the advantages of LumiWCluster on two publicly available Illumina GoldenGate Methylation datasets (ovarian cancer and hepatocellular carcinoma).

**Availability:** `R` package `LumiWCluster` can be downloaded from http://www.unc.edu/~pfkuan/LumiWCluster

**Contact:** ude.cnu.soib@naukfp

**Supplementary information:** Supplementary data are available at *Bioinformatics* online.

## 1 INTRODUCTION

DNA methylation is an important epigenetic modification and plays critical roles in transcriptional regulation, chromosomal stability, genomic imprinting and X-inactivation (Rakyan *et al.*, 2008). Numerous literatures have established the influence of DNA methylation in transcriptional aberrations in human diseases including various types of cancer (Esteller, 2007; Irizarry *et al.*, 2009; Koga *et al.*, 2009). DNA methylation occurs in cytosines of CpG dinucleotides in human in a non-random fashion across the genome. In particular, CpG-rich regions (known as CpG islands) are usually hypomethylated, whereas repetitive genomic sequences are hypermethylated in normal cells. Over the past decade, there has been vast research in studying alterations of DNA methylation in cancer. A general observed trend of perturbed DNA methylation includes hypomethylation of oncogenes and hypermethylation of tumor suppressor genes, leading to genomic instability and tumorigenesis (Esteller, 2007; Irizarry *et al.*, 2009).

Several platforms are available for DNA methylation profiling that includes high-throughput arrays and more recently, the next-generation sequencing instruments. The experimental approaches in high-throughput array-based methylation include bisulfite conversion-based methods, restriction enzyme-based methods and immunoprecipitation-based methods (Down *et al.*, 2008). A popular robust methylation profiling platform via bisulfite conversion is the Illumina GoldenGate and Infinium Methylation Assays based on the BeadArray technology. This technology utilizes 3 μm silica beads which are replicated ~30 times on the array and has emerged as an attractive platform for genotyping, expression and methylation analysis (Dunning *et al.*, 2008b; Lynch *et al.*, 2009; Xie *et al.*, 2009). This technology requires less sample input but produces high-quality data (Dunning *et al.*, 2008b; Xie *et al.*, 2009), thereby reducing the cost of array experiments. The increasing popularity of Illumina BeadArray technology is apparent given the numerous scientific publications since 2008. However, there is only a handful of statistical framework for analyzing Illumina BeadArray gene expression (Dunning *et al.*, 2008b; Wong *et al.*, 2008), and limited work is available for methylation array counterpart (Lynch *et al.*, 2009). Existing work for gene expression BeadArray falls into the categories of data preprocessing and differential expression detection. This includes background correction methods (Dunning *et al.*, 2008b; Xie *et al.*, 2009), variance-stabilizing techniques (Dunning *et al.*, 2008a) and modified test statistics for differential gene expression in Illumina BeadArray (Wong *et al.*, 2008).

Most of the framework for gene expression analysis is based on the assumption that the majority of the genes are not differentially expressed. In contrast, many sites are expected to be methylated (Irizarry *et al.*, 2008), and therefore the assumptions in gene expression are not applicable to methylation experiments. The goal of our article is to provide a statistical framework for array-based methylation profiling on Illumina BeadArray technology, by studying the source of biases and the data-generating mechanism of Illumina methylation assays. Specifically, we propose a model for correcting the source of biases in Illumina Methylation BeadArray and introduce a weighted model-based approach for clustering the methylation profiles. The framework of our weighted model-based clustering can also be directly applied to other Illumina BeadArray platforms, e.g. gene expression BeadChip, because it does not rely on the assumption that most beads are from the null distribution of no differential expression. In the next section, we describe the methylation data structure and introduce our proposed statistical framework.

## 2 MOTIVATION

Methylation levels in Illumina methylation assays are quantified by the *beta* value using the ratio of intensities between methylated (*M*) and unmethylated (*U*) alleles. Specifically,

where *M* and *U* are the red and green dyes, respectively, for the GoldenGate and VeraCode Methylation assays, whereas for Infinium assay, *M* and *U* are signals A and B (produced by two different bead types and reported in the same color), respectively. The constant 100 is to regularize *beta* when both *M* and *U* values are small (Bibikova *et al.*, 2006). The *beta* values are continuous and range from 0 (unmethylated) to 1 (completely methylated). Each locus reports an average *beta* value obtained from the average of *M*'s and *U*'s across approximately 30 bead replicates, and individual bead-level measurements are not readily available (Dunning *et al.*, 2008b; Wong *et al.*, 2008). A standard summary output from BeadStudio (Illumina software to process raw intensities) includes four columns for each sample, i.e. (1) average *beta*, (2) average *M*, (3) average *U* and (4) detection *P*-values, for each locus. Therefore, our proposed framework will be based on the average *beta* values for convenience.

The detection *P*-value reported by BeadStudio can be used as a quality control measure of probe performance. The detection *P*-value is defined as 1 − *P*-value computed from the background model characterizing the chance that the signal was distinguishable from negative controls (Supplementary Materials). Standard protocol by Illumina recommends excluding probes that have a detection *P*-value greater than an *arbitrary* cutoff of 0.05. On the other hand, Marsit *et al.* (2009) excluded samples that consist of ≥25% observations with detection *P*-values ≥ 1 × 10^{−5}, as well as probes (CpG loci) with median detection *P*-values >0.05, whereas Hernandez-Vargas *et al.* (2010) excluded probes with detection *P*-values >0.01 in >10% of the samples. In Section 3.2, we will introduce a modeling framework that avoids arbitrary choice of detection *P*-value threshold.

We will now explore and illustrate the source of biases present in Illumina Methylation BeadArray based on the data generated by the Thomas-Conway Lab at UNC-Chapel Hill. The methylation experiment is performed on the GoldenGate Cancer Panel I methylation panel, which interrogates 1505 CpG loci/probes associated with 803 cancer-related genes (tumor suppressor genes, oncogenes, genes involved in DNA repair, cell-cycle control, differentiation, apoptosis, X-linked and imprinted genes) where 28.6% contain one CpG site per gene, 57.3% contain two CpG sites and 14.1% have three or more sites (Illumina, 2006). The probe length varies between 41 bp and 59 bp with median 50 bp, whereas the number of CG dinucleotides for each probe varies between 1 and 7. A pair of allele-specific oligonucleotide (ASO) and locus-specific oligonucleotide (LSO) measures the methylation level of a specific CG dinucleotide for a probe under the assumption that flanking CG dinucleotides within the same probe exhibit similar methylation status. In other words, the observed methylation level for the measured CG dinucleotide should not be affected by the number of flanking CG dinucleotides. To investigate the potential source of biases in Illumina Methylation assays, we utilize six positive control samples from our dataset, in which all the cytosines in CG dinucleotides are expected to be methylated and any deviation from methylated status indicates the presence of technical biases. We apply quantile normalization to these six positive control samples. Figures 1 and and22 compare the pairwise correlations among these positive controls before and after normalization, respectively. Both plots show the high correlations among the positive controls, with improvement after normalization.

In addition, array-based technology are known to be affected by the sequence and thermodynamic properties, e.g. melting temperature and GC content in protein–DNA binding and gene expression experiments (Dunning *et al.*, 2008b; Wei *et al.*, 2008). In Figure 3, we plot the quantile normalized average *beta* values pooled from all positive controls against probe length, number of CG dinucleotides within each probe, melting temperature and GC content. Individual plots for each positive control sample are given in Supplementary Materials. Melting temperature is computed according to Wei *et al.* (2008), whereas GC content is the percentage of C and G nucleotides for a given probe. As evident from Figure 3, the observed methylation level is influenced by the sequence and thermodynamics properties of the probes. Sequence length bias in methylation array is also observed in Lynch *et al.* (2009). Since melting temperature is a function of GC content and sequence length, and GC content is highly correlated with the number of CG dinucleotides, our modeling framework will incorporate sequence length and GC content. In gene expression and protein–DNA binding arrays, GC content exhibits increasing trend with intensities due to the three hydrogen bonds compared with two hydrogen bonds in AT pairs. However, the decreasing trend observed in methylation arrays can be attributed to the loss of efficiency in binding for a probe with more CG dinucleotides, because the CG dinucleotides within a probe are expected to have similar methylation status. Although CG dinucleotides yield a more straightforward interpretation, we choose GC content which has more distinct values for better function approximation.

## 3 METHODS

### 3.1 Estimating *L*_{j} and *GC*_{j} biases

For notational brevity, we denote the average *beta* value and detection *P*-value for each locus *j*, *j* = 1,…, *p* and sample *i*, *i* = 1,…, *n* as β_{ij} and *p*_{ij}, respectively. Let *L*_{j} and GC_{j} denote the probe length and GC content for locus *j*, respectively. Since β_{ij} (0, 1), beta distribution arises as a natural distribution for modeling the observed β_{ij} (Houseman *et al.*, 2008; Siegmund *et al.*, 2004). However, maximum likelihood estimation of the unknown parameters (α, β) in a beta distribution does not have a closed form and relies on numerical methods (Ji *et al.*, 2005). In this article, we consider an alternative for modeling β_{ij} via a logit transformation,

To avoid a logit tranformation of β = 0, we add an ϵ = 10^{−4} to β as

where and Ū are the average *M* and *U* values across ~30 replicates as mentioned in Section 2. We model

where *s*_{ij} is the true methylation level and *h*_{1}, *h*_{2} characterize the bias arising from sequence length and GC content. We estimate *h*_{1}, *h*_{2} from the positive control sample instead of treatment sample itself. This is to safeguard against removing actual methylation signals due to the potential confounding effect with GC content, i.e. hypomethylation in CpG islands (CG-rich regions) of normal cells (Esteller, 2007; Irizarry *et al.*, 2008). Let

under the constraints (1) α_{1} = 44α_{2}, (2) α_{3} = 57α_{2}, (3) γ_{1} = 0.4γ_{2} and (4) γ_{3} = 0.8γ_{2} for continuity at the knots. That is, we model *h*_{1} and *h*_{2} as piecewise constant + linear + constant. The knots are chosen so that the number of observations in {*L*_{j} < 44} and {*L*_{j} ≥ 57} are comparable with the number of observations in {*L*_{j} = *n*} for *n* = 44,…, 56 as well as to avoid over (under)-estimation for large (small) *L*_{j}, and vice versa for *GC*_{j}.

In Figure 4, we plot the corrected against the attributes, where

for the positive control. As evident from this figure, the bias of these four attributes are reduced substantially.

### 3.2 A weighted model-based approach

One of the most common applications of DNA methylation is in identifying subgroups with distinct methylation patterns (Christensen *et al.*, 2009; Houseman *et al.*, 2008; Marsit *et al.*, 2009; Shen *et al.*, 2009; Siegmund *et al.*, 2004) via unsupervised clustering techniques. Numerous clustering methods have been developed, including non-parametric (e.g. agglomerative hierarchical clustering) and model-based approaches. Model-based clustering assumes that the data is generated from a finite mixture model, in which each mixture component corresponds to a cluster. It has emerged as a popular technique and allows for statistical inference (e.g. selecting number of clusters and estimating membership probability) to be carried out (Fraley and Raftery, 2002; Siegmund *et al.*, 2004). We let *y*_{i} = (*y*_{i1},…, *y*_{ip})^{T} to be a vector of logit-transformed *beta* values for sample *i*. We assume that *y*_{i} is generated from a mixture of *K* multivariate normal distributions. Specifically,

where *h*_{1}, *h*_{2} are pre-estimated from positive control samples as shown in Section 3.1. Define . Let **θ**_{k} = (π_{k}, μ_{k}, Σ_{k}) be the unknown parameters. The mixture model log-likelihood function for the whole data is given by

As pointed out in Section 2, standard preprocessing steps for Illumina methylation assays include a quality control of probe measurements by excluding probes that have detection *P*-values (*p*_{ij}) larger than an arbitrary cutoff. This step results in missing observations and information loss by discarding a subset of probes. To avoid using a hard threshold for quality assessment, we would like to assign a weight for each sample which reflects its quality rather than discarding a sample completely from the analysis. We introduce an alternative model-based clustering on a weighted likelihood-based approach. The objective of a weighted likelihood function is to assign a different weight to each sample, in which samples with higher weights have more influence in estimating the mixture parameters for cluster structure inference. The weighted mixture model log -likelihood function is given by

where *w*_{i} is the weight of sample *i*. Without loss of generality, we assume ∑_{i=1}^{n} *w*_{i} = 1. The detection *P*-values arise as a natural weight function, since samples with large detection *P*-values are less reliable. A possible choice of weight function is *w*_{i} = median_{j}(log *p*_{ij})/∑_{i=1}^{n} median_{i}(log *p*_{ij}) (0, 1). Note that the criteria in Marsit *et al.* (2009) which excluded samples that consist of ≥25% observations with detection *P*-values ≥ 1 × 10^{−5} is a special case by defining *w*_{i} = *I*[*Q*3_{j}(*p*_{ij}) < 1 × 10^{−5}]/∑_{i=1}^{n}*I*[*Q*3_{j}(*p*_{ij}) < 1 × 10^{−5}], where *Q*3 is the third quartile.

Weighted model-based clustering has been shown to outperform the non-weighted method in both simulations and real datasets from remote sensing images in geology (Richards *et al.*, 2009). In addition, Seo *et al.* (2004) showed that detection *P*-values weighting in computing Pearson's correlation coefficient improved the performance of expression profiling in Affymetrix microarrays. The mixture modeling framework can be recast in an expectation–maximization (EM) framework for estimating the unknown parameters **θ**_{k}. We introduce *z*_{ik} to be the unobserved indicator latent variable taking value 1 if sample *i* belongs to cluster *k* and 0 otherwise. The complete weighted log-likelihood function is

In clustering DNA methylation profiles for identifying subgroups among the *n* samples, we are faced with the well-known ‘large p, small n’ problem. One strategy is to apply dimension reduction, e.g. principal component analysis (PCA) to the *p* loci, followed by clustering on the reduced space. However, treating dimension reduction and clustering as two separate steps may destroy the cluster structure in the data (Raftery, 2003; Wang and Zhu, 2008). Moreover, each PCA is a linear combination of all CpG loci, and does not allow for automatic selection of important CpG loci. Therefore, our goal is to incorporate a variable selection in model-based clustering approach, which identifies important CpG loci (variable selection) and subgroups among the *n* samples (clustering) simultaneously based on a penalized criterion (Pan and Shen, 2007; Wang and Zhu, 2008). We consider a penalized complete weighted log-likelihood to achieve the goal:

where **Ω** = {μ_{kj}, *k* = 1,…, *K*; *j* = 1,…, *p*} and *J*(**Ω**) is a penalty function. Several choices of penalty functions are available, e.g. Pan and Shen (2007) proposed an *L*_{1}-norm penalty function which takes the form *J*(**Ω**) = ∑_{k=1}^{K} ∑_{j=1}^{p}|μ_{kj}|. As pointed out by Wang and Zhu (2008), however, there is a natural group structure among μ_{kj}'s, i.e. for each *j*, we can treat μ_{kj}, *k* = 1,…, *K* as a group since they are associated with the same CpG locus. The *L*_{1}-norm penalty function ignores this group structure and treats μ_{kj} individually. As a result, it tends to keep many unimportant loci in the model.

To circumvent this problem, Wang and Zhu (2008) proposed a penalty function that incorporates group information and shrinks μ_{kj}'s more effectively. In addition, some loci have large detection *P*-values across all samples and are unreliable. Therefore, we would like to impose heavier penalty on these loci. We achieve this by introducing *g*_{j} to be weight of locus *j*, where larger *g*_{j} values indicate more reliable probes. One possible choice is *g*_{j} = median_{i}(log *p*_{ij})/∑_{j=1}^{p} median_{i}(log *p*_{ij}) (0, 1) (here we take median across samples, cf. *w*_{i}: median across loci). We generalize the proposed penalty function by Wang and Zhu (2008) by including the detection *P*-values as follows:

where μ_{kj} = γ_{j}θ_{kj}, 's are the unpenalized estimates of cluster means and α is a non-negative tuning parameter. Under this penalty function, loci with large detection *P*-values will be assigned a higher penalty, and are more likely to be excluded in the variable selection. λ is a tuning parameter that controls the sparsity, i.e. small (large) λ results in the selection of more (fewer) CpG loci. Additional details on the proposed penalty function are given in Supplementary Materials. We further assume that Σ_{k} = Σ = diag(σ_{1}^{2},…, σ_{p}^{2}) as in Wang and Zhu (2008). That is, the covariance matrices are the same across different clusters and are diagonal, a common assumption for a high dimension and small sample size problem. Further theoretical justifications for adopting the diagonal structure of the covariance matrix are provided by Bickel and Levina (2004).

At the *t* iteration of the EM algorithm, the *E*-step computes

for *i* = 1,…, *n* and *k* = 1,…, *K*.

The *M*-step involves maximizing Equation (1) with respect to (π_{k}, μ_{kj}, σ_{j}^{2}). Following the derivations in Wang and Zhu (2008) with modifications to incorporate the weights *w*_{i}, *g*_{j} and that ∑_{i=1}^{n} *w*_{i} = 1,

The estimates for γ_{j} and θ_{kj} are not trivial, since the penalty function is singular at the origin point. However, similar to the derivation in Wang and Zhu (2008), we can update estimates of γ_{j} and θ_{kj} iteratively by the following explicit formula, which makes our method easy to implement in practice:

where and . The *E*-step and *M*-step are iterated till convergence.

As in Pan and Shen (2007) and Wang and Zhu (2008), we choose the tuning parameter λ and the number of clusters *K* by minimizing the Bayesian information criterion (BIC). To account for the weights *w*_{i} in the likelihood functions in which ∑_{i=1}^{n} *w*_{i} = 1, we define a modified BIC as follows:

where *P* is the total number of non-zero estimates in , and . The first term is the right-hand side is exactly the regular −2∑_{i=1}^{n} loglik when *w*_{i} = 1/*n*, ∀*i*. We name our method LumiWCluster (Il*Lumi*na *W*eighted model-based *Cluster*ing).

## 4 RESULTS

### 4.1 DNA methylation studies in ovarian cancer

We illustrate our proposed method on the ovarian epithelial carcinoma tumors and cell lines methylation dataset from Houshdaran *et al.* (2009). In Figure 5, we show the presence of sequence bias on the positive control sample in this dataset. The observed pattern in consistent with our six positive control samples (Fig. 3). Using the estimated *h*_{1} and *h*_{2} from our six positive controls, we adjust for the observed bias in the ovarian methylation dataset. Although *h*_{1} and *h*_{2} are estimated from an independent source of data, we show that the effect of sequence bias is reduced significantly for the ovarian methylation dataset (Fig. 6).

This ovarian cancer dataset consists of 27 primary tumors (15 serous, 9 endometrioid and 3 clear cell) and 15 cell lines. By applying our proposed weighted clustering approach (LumiWCluster) to this dataset, the optimal number of clusters chosen is *K* = 4. Details are provided in Supplementary Materials. We also compare the clustering results from Gaussian mixture model without penalty and weights, i.e. *J*(**Ω**) = 0, *w*_{i} = 1/*n* ∀*i*, *g*_{j} = 1/*p* ∀*j*. The optimal number of clusters chosen is *K* = 3. We refer to this model as GMM-nopenalty. This model is a special case of Mclust by Fraley and Raftery (2002). Mclust is a non-penalized standard model-based clustering which allows for different functional forms of the covariance matrices, where all CpG loci are retained in the resulting clustering. We also include the clustering results from Mclust, *k*-means and PAM (partitioning around medoids, a more robust version of k-means) (Kaufman and Rousseeuw, 1990) which identify *K* = 4, 2 and 2 as the optimal number of clusters, respectively. The optimal number of clusters chosen by *k*-means and PAM is based on the ‘silhouette’ criterion (Rousseeuwl, 1987). As shown in Table 1, *k*-means and PAM are unable to separate cell lines from primary tumors; whereas LumiWCluster, GMM-nopenalty and Mclust yield comparable cluster membership, where cell lines are separated from primary tumors, but there are some mixing among the three tumor subtypes. However, an advantage of LumiWCluster is that it automatically shrinks 554 CpG loci to 0, which implies that these loci do not contribute to the clustering. This refines the set of CpG loci which are important in the resulting cluster structure. These 554 CpG loci include the eight sites which have median detection *P* > 0.05 and are excluded using the filtering criterion in Marsit *et al.* (2009). We also demonstrate that the ability of LumiWCluster in selecting important CpGs yields tighter clusters in Supplementary Materials. In addition, LumiWCluster results in smaller BIC compared with GMM-nopenalty and Mclust, indicating a better model fit. We provide additional information on the advantages of incorporating the detection *P*-values in our proposed clustering approach in Supplementary Materials.

### 4.2 DNA methylation studies in HCV-cirrhosis

Our next example is on a data set measuring DNA methylation of hepatocellular carcinoma (HCC) (Archer *et al.*, 2010) from Illumina GoldenGate Methylation BeadArray. This data set consists of 20 samples from HCC with cirrhosis and 20 normal liver tissues. Similar to Section 4.1, we compare the clustering results from LumiWCluster, GMM-nopenalty, Mclust, k-means and PAM on the normalized data (corrected for sequence length and GC content biases). LumiWCluster selects *K* = 2, whereas the rest of the methods select *K* = 3 as the optimal number of clusters. LumiWCluster automatically shrinks 639 CpG to zero and results in perfect separation between the C and N samples. However, the other methods contain misclassification of these samples (Table 2). In Supplementary Materials, we also provide the clustering results for the other methods by fixing *K* = 2. Unlike LumiWCluster, these methods still misclassify the two different types of samples.

Next, we run GMM-nopenalty on the subset of CpG loci selected by LumiWCluster [referred to as ‘GMM-nopenalty (with subset of CpGs selected by LumiWCluster)’ in Table 2]. Interestingly, the optimal number of clusters is chosen to be 2 and results in perfect classification. This highlights that LumiWCluster is able to select informative CpG loci that differentiate cirrhosis from normal tissues.

To further illustrate the advantage of LumiWCluster in selecting important CpG loci, we carry out the non-parametric Wilcoxon rank-sum test for comparing cirrhosis and normal group samples on each of these 1505 CpG loci. The *P*-values are adjusted using the false discovery rate (FDR) control (Benjamini and Hochberg, 1995). At FDR of 0.05, 597 loci are significant, of which 578 overlap with the 866 loci selected by LumiWCluster. This shows that LumiWCluster is able to retain statistically significant loci. As a comparison, 19 CpG loci have median detection *P* > 0.05 and will be excluded using the filtering criterion in Marsit *et al.* (2009). Among these 19 loci, 15 of them were shrunk to 0 by LumiWCluster. Figure 7A and B shows the distribution of the β values for two CpG loci that were not shrunk to 0, which appear to be informative in differentiating HCC with cirrhosis from normal liver tissues. We also include two CpG loci that have median detection *P* > 0.05 and were shrunk to 0 by LumiWCluster (Fig. 7C and D). This again demonstrates the ability of LumiWCluster in selecting informative CpGs which can differentiate the two groups despite being completely unsupervised.

## 5 DISCUSSION

The delineation of DNA methylation patterns is important in understanding how these epigenetic changes might lead to aberrant expression patterns and disease (Laird, 2010). Advancements in biotechnology have enabled high-throughput profiling of DNA methylation, including the Illumina GoldenGate and Infinium BeadArray via bisulphite conversion. These platforms are robust, highly reproducible and require less starting materials. In the first part of this study, we illustrated the source of biases present in Illumina Methylation arrays and proposed a model for correcting these biases.

A common approach in analyzing Illumina Methylation data includes omitting CpG loci and samples that exhibit detection *P*-values larger than an arbitrary cutoff. This hard thresholding step often results in missing observations and information loss by discarding a subset of probes. We proposed a weighted model-based approach called LumiWCluster that weights each CpG locus/sample by its detection *P*-values for clustering DNA methylation profiles. In this article, we set the weights as the median detection *P*-values across samples (or CpG loci) which appears to perform well in the two case studies. Optimal selection of weight functions is beyond the scope of this article and will be an interesting future research direction.

## ACKNOWLEDGEMENTS

We thank Drs Nancy Thomas, Kathleen Conway and Sharon Edmiston for providing the control samples and the reviewers for valuable comments.

*Funding*: NCI (grants 5-P30-CA16086-34 and 1R21CA134368-01A1 to P.K. and H.C., in parts); Susan G. Komen Foundation (grant KG081397 to P.K. and H.C., in parts).

*Conflict of Interest*: none declared.

## REFERENCES

- Archer K, et al. High-throughput assessment of CpG site methylation for distinguishing between HCV-cirrhosis and HCV-associated hepatocellular carcinoma. Mol. Genet. Genomics. 2010;283:341–349. [PMC free article] [PubMed]
- Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B. 1995;57:289–300.
- Bibikova M, et al. High-throughput DNA methylation profiling using universal bead arrays. Genome Res. 2006;16:383–393. [PMC free article] [PubMed]
- Bickel P, Levina E. Some theory for Fisher's linear discriminant function, “naive Bayes”, and some alternatives when there are many more variables than observations. Bernoulli. 2004;10:989–1010.
- Christensen B, et al. Aging and environmental exposures alter tissue-specific DNA methylation dependent upon CpG island context. PLoS Genet. 2009;5:e1000602. [PMC free article] [PubMed]
- Down T, et al. A Bayesian deconvolution strategy for immunoprecipitation based DNA methylome analysis. Nat. Biotechnol. 2008;26 [PMC free article] [PubMed]
- Dunning M, et al. Spike-in validation of an Illumina-specific variance-stabilizing transformation. BMC Res. Notes. 2008a;1:18. [PMC free article] [PubMed]
- Dunning M, et al. Statistical issues in the analysis of Illumina data. BMC Bioinformatics. 2008b;9:85. [PMC free article] [PubMed]
- Esteller M. Cancer epigenomics: DNA methylomes and histone-modifications maps. Nat. Rev. Genet. 2007;8:286–298. [PubMed]
- Fraley C, Raftery A. Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 2002;97:611–631.
- Hernandez-Vargas H, et al. Hepatocellular carcinoma displays distinct DNA methylation signatures with potential as clinical predictors. PLoS One. 2010;5:e9749. [PMC free article] [PubMed]
- Houseman A, et al. Model-based clustering of DNA methylation array data: a recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distribution. BMC Bioinformatics. 2008;9:365. [PMC free article] [PubMed]
- Houshdaran S, et al. DNA methylation profiles of ovarian epithelial carcinoma tumors and cell lines. PLoS ONE. 2009;5:e9359. [PMC free article] [PubMed]
- Illumina. GoldenGate methylation cancer panel I. 2006 Available at http://www.illumina.com/technology/goldengate_methylation_assay.ilmn.
- Irizarry R, et al. Comprehensive high-throughput arrays for relative methylation (CHARM) Genome Res. 2008;18:780–790. [PMC free article] [PubMed]
- Irizarry R, et al. The human colon cancer methylome shows similar hypo- and hypermethylation at conserved tissue-specific CpG island shores. Nat. Genet. 2009;41:178–186. [PMC free article] [PubMed]
- Ji Y, et al. Applications of beta-mixture models in bioinformatics. Bioinformatics. 2005;21:2118–2122. [PubMed]
- Kaufman L, Rousseeuw P. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley-Interscience; 1990.
- Koga Y, et al. Genome -wide screen of promoter methylation identifies novel markers in melanoma. Genome Res. 2009;19:1462–1470. [PMC free article] [PubMed]
- Laird P. Principles and challenges of genome-wide DNA methylation analysis. Nat. Rev. Genet. 2010;11:191–203. [PubMed]
- Lynch A, et al. Considerations for processing and analysis of Goldengate-based two-colour illumina platforms. Stat. Methods Med. Res. 2009;18:437–452. [PubMed]
- Marsit C, et al. Epigenetic profiling reveals etiologically distinct patterns of DNA methylation in head and neck squamous cell carcinoma. Carcinogenesis. 2009;30:416–422. [PMC free article] [PubMed]
- Pan W, Shen X. Penalized model-based clustering with application to variable selection. J. Mach. Learn. Res. 2007;80:1145–1164.
- Raftery A. Discussion of “Bayesian clustering with variable selection and transformation selection” by liu et al. Bayesian Stat. 2003;7:266–271.
- Rakyan V, et al. An integrated resource for genome-wide identification and analysis of human tissue-specific differential methylated regions (tDMRs) Genome Res. 2008;18:1518–1529. [PMC free article] [PubMed]
- Richards J, et al. Weighted model-based clustering for remote sensing image analysis. Comput. Geosci. 2009;14:125–136.
- Rousseeuwl P. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987;20:53–65.
- Seo J, et al. Interactively optimizing signal-to-noise ratios in expression profiling, project-specific algorithm selection and detection p-value weighting in Affymetrix microarrays. Bioinfomatics. 2004;20:2534–2544. [PubMed]
- Shen R, et al. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics. 2009;25:2906–2912. [PMC free article] [PubMed]
- Siegmund K, et al. A comparison of cluster analysis methods using DNA methylation data. Bioinformatics. 2004;20:1896–1904. [PubMed]
- Wang S, Zhu J. Variable selection for model-based high dimensional clustering and its application to microarray data. Biometrics. 2008;64:440–448. [PubMed]
- Wei H, et al. A study of the relationships between oligonucleotide properties and hybridization signal intensities from NimbleGen microarray datasets. Nucleic Acids Res. 2008;36:2926–2938. [PMC free article] [PubMed]
- Wong W, et al. On the necessity of different statistical treatment for Illumina BeadChip and Affymetrix GeneChip data and its significance for biological interpretation. Biol. Direct. 2008;3 [PMC free article] [PubMed]
- Xie Y, et al. Statistical methods of background correction for Illumina BeadArray data. Bioinformatics. 2009;25:751–757. [PMC free article] [PubMed]

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (3.4M)

- MethLAB: a graphical user interface package for the analysis of array-based DNA methylation data.[Epigenetics. 2012]
*Kilaru V, Barfield RT, Schroeder JW, Smith AK, Conneely KN.**Epigenetics. 2012 Mar; 7(3):225-9.* - Model-based clustering of DNA methylation array data: a recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distributions.[BMC Bioinformatics. 2008]
*Houseman EA, Christensen BC, Yeh RF, Marsit CJ, Karagas MR, Wrensch M, Nelson HH, Wiemels J, Zheng S, Wiencke JK, et al.**BMC Bioinformatics. 2008 Sep 9; 9:365. Epub 2008 Sep 9.* - A new statistical approach to detecting differentially methylated loci for case control Illumina array methylation data.[Bioinformatics. 2012]
*Chen Z, Liu Q, Nadarajah S.**Bioinformatics. 2012 Apr 15; 28(8):1109-13. Epub 2012 Feb 24.* - Illumina universal bead arrays.[Methods Enzymol. 2006]
*Fan JB, Gunderson KL, Bibikova M, Yeakley JM, Chen J, Wickham Garcia E, Lebruska LL, Laurent M, Shen R, Barker D.**Methods Enzymol. 2006; 410:57-73.* - Monitoring methylation changes in cancer.[Adv Biochem Eng Biotechnol. 2007]
*Beier V, Mund C, Hoheisel JD.**Adv Biochem Eng Biotechnol. 2007; 104:1-11.*

- Comparisons of Non-Gaussian Statistical Models in DNA Methylation Analysis[International Journal of Molecular Sciences...]
*Ma Z, Teschendorff AE, Yu H, Taghia J, Guo J.**International Journal of Molecular Sciences. 15(6)10835-10854* - Network-based Regularization for Matched Case-Control Analysis of High-dimensional DNA Methylation Data[Statistics in medicine. 2013]
*Sun H, Wang S.**Statistics in medicine. 2013 May 30; 32(12)2127-2139* - Recursively partitioned mixture model clustering of DNA methylation data using biologically informed correlation structures[Statistical applications in genetics and mo...]
*Koestler DC, Christensen BC, Marsit CJ, Kelsey KT, Houseman EA.**Statistical applications in genetics and molecular biology. 12(2)225-240* - SOCS3 Promoter Hypermethylation Is a Favorable Prognosticator and a Novel Indicator for G-CIMP-Positive GBM Patients[PLoS ONE. ]
*Feng Y, Wang Z, Bao Z, Yan W, You G, Wang Y, Hu H, Zhang W, Zhang Q, Jiang T.**PLoS ONE. 9(3)e91829* - Integrative genomic analysis identifies epigenetic marks that mediate genetic risk for epithelial ovarian cancer[BMC Medical Genomics. ]
*Koestler DC, Chalise P, Cicek MS, Cunningham JM, Armasu S, Larson MC, Chien J, Block M, Kalli KR, Sellers TA, Fridley BL, Goode EL.**BMC Medical Genomics. 78*

- A statistical framework for Illumina DNA methylation arraysA statistical framework for Illumina DNA methylation arraysBioinformatics. Nov 15, 2010; 26(22)2849PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...