- Journal List
- Int J Biostat
- PMC2818740

# A Nearly Exhaustive Search for CpG Islands on Whole Chromosomes^{*}

^{*}Correspondence: Fushing Hsieh, Department of Statistics, Mathematical Sciences Building, University of California, Davis, CA 95616; email: ude.sivadcu.dlaw@gnihsuf. Fushing Hsieh is supported in part by the NFS under Grant No: BCS 0527766 and HSD 0826844, and by the NIH under Grant No: 1RO1AG025218-01A2.

## Abstract

CpG islands are genome subsequences with an unexpectedly high number of CG di-nucleotides. They are typically identified using filtering criteria (e.g., G+C% expected vs. observed CpG ratio and length) and are computed using sliding window methods. Most such studies illusively assume an exhaustive search of CpG islands are achieved on the genome sequence of interest. We devise a Lexis diagram and explicitly show that filtering criteria-based definitions of CpG islands are mathematically incomplete and non-operational. These facts imply that the sliding window methods frequently fail to identify a large percentage of subsequences that meet the filtering criteria. We also demonstrate that an exhaustive search is computationally expensive. We develop the Hierarchical Factor Segmentation (HFS) algorithm, a pattern recognition technique with an adaptive model selection device to overcome the incompleteness and non-operational drawbacks, and to achieve effective computations for identifying CpG-islands. The concept of a CpG island “core” is introduced and computed using the HFS algorithm, which is independent from any specific filtering criteria. Upon such a CpG island “core,” a CpG-island is constructed using a Lexis diagram. This two-step computational approach provides a nearly exhaustive search for CpG islands that can be practically implemented on whole chromosomes. In a simulation study realistically mimicking CpG-island dynamics through a Hidden Markov Model we demonstrate that this approach retains very high sensitivity and specificity, that is, very low rates of false positives and false negatives. Finally, we apply the HFS algorithm to identify CpG island cores on human chromosome 21.

## 1. Introduction

In the human genome, the CpG di-nucleotide (a cytosine followed by a guanine) is distinctive in the sense that its observed frequency is less than one fifth of its expected overall frequency. Furthermore, CpGs are characterized by unusual distributional patterns of sparsity and aggregation along the chromosomes. Importantly, CpGs are associated with methylation status (Bird, 1986). In particular, segments of the genome with aggregated CpG di-nucleotides, socalled CpG islands, receive very intensive attention from biological, biomedical and evolutionary researchers because of their functional properties in relation to genomics and cancer development. Though the underlying generating mechanism for CpG islands by and large still remains mysterious, the spatial patterns of CpGs in the genome are currently attributed to methylation at the 5′-end of the cytosine, followed by conversion of cytosine (C) to thymine (T) via deamination. Thus the majority of CpG islands are clusters of non-methylated CpGs that are protected from this process.

Researchers have established in the last decade that CpG islands indeed play many important functional roles. They are confirmed to be frequently located close to transcription start sites, and are involved in the regulation of gene expression (Saxonov et al, 2006). Based on these associations, many researchers have used CpG islands to predict gene promoters (Bajic et al, 2004), and have linked changes in the methylation status of CpG islands to cancer development (Ushijima, 2005; Opavsky et al, 2007).

Theoretically speaking, all association studies of CpG islands rely on a complete set of well-defined CpG islands on the genome sequence under study. This is because statistical inference about associations and promoter predictions may be seriously biased when based on an incomplete set of CpG islands. This requirement in fact has promoted the development of many computational algorithms that attempt to achieve an exhaustive search for CpG islands by using machine learning and other statistical methodologies (Bock et al, 2006, 2007). Many of the CpG island finder algorithms in the literature are available online.

Typically these algorithms employ a sliding window computation to scan through a target genome sequence for segments that satisfy some subjective choice of filtering criteria based on information about G+C%, expected vs. observed CpG ratio, and length. Among the many versions, the most popular one was proposed by Gardiner-Garden and Frommer (1987), and later modified by Takai and Jones (2002). The simplicity of these filtering criteria seems to mislead many researchers explicitly or implicitly to assume or anticipate the exhaustive search capability of their CpG island finder algorithms.

However this exhaustive search issue has not yet been rigorously addressed in the literature. Unfortunately, counter to their assumptions and anticipations, most CpG island finder algorithms are in fact not equipped with such a capability. This fact can be perceived from the following incoherence: Different CpG finders, based on the same set of filtering criteria and genome sequences, very commonly produce very different CpG island predictions. These incoherent identifications with significant non-overlapping proportions indicate that these finders may miss a significant proportion of true CpG islands. For instance, we empirically estimated that the missing proportion is likely to be as high as 50% for the algorithm used in Bock et al (2006, 2007), when applied to human chromosome 21. Similarly, the algorithm used in Yamada et al (2004), can have very low sensitivity as well.

On the other hand, some CpG island finders have reasonably good sensitivity, such as the CpGIE algorithm proposed in Wong and Leung (2004). But algorithms in this class are computationally practical only for relatively short genome sequences. This shows another essential aspect of the exhaustive search issue.

In this paper we point out that the incoherence among CpG island finders and their high percentages of missed islands are all caused by the fundamental issue: no operational definition of a CpG island is available, or equivalently, CpG island definitions are mathematically incomplete. By mathematically incomplete definition of a CpG island we mean that it is not possible to ascertain precisely whether a genome location is either inside or outside of a CpG island. This incompleteness crucially undermines the computational feasibility of any exhaustive search directly based on filtering criteria. We explicitly demonstrate the mathematical incompleteness through a technique we developed here, called a Lexis diagram. Further, since a lengthier genome sequence is subject to more severe effects of mathematical incompleteness, the Lexis diagram reveals that an exhaustive search using sliding window computations for all subsequences satisfying a particular set of filtering criteria is computationally infeasible.

To avoid the effects of mathematical incompleteness of CpG island definitions, and at the same time to resolve the computational infeasibility issue, we propose a CpG island identification approach in two steps. In the first step, a pattern recognition technique with an adaptive model selection device, called Hierarchical Factor Segmentation (HFS), is developed to identify a collection of genome segments with aggregating CpG di-nucleotides, called “cores” of CpG islands. In the second step, by centering at each identified CpG island core, we then compute and identify a CpG island using the Lexis diagram pertaining to any specific choice of filtering criteria.

The reasons why our approach is free from the issue of mathematical in-completeness are the follows. Within Step 1, no filtering criteria are involved in the computations of the HFS algorithm when it is applied on the whole genome sequence. After the HFS algorithm identifies CpG island cores along the genome sequence, the Lexis diagram is used in Step 2 to overcome potential “local” mathematical incompleteness issues within the vicinity of each CpG island core. Furthermore, the reason why our approach can achieve a nearly exhaustive search for CpG islands is that the HFS algorithm can efficiently compute and identify segments which have a higher density of CpG di-nucleotides in contrast to its neighboring segments. The adaptive model selection device built into the HFS algorithm is designed to balance the total number of cores (more cores being closer to an exhaustive search) against the computational complexity of the algorithm.

This paper is organized as follows. The mathematical incompleteness of the filtering criteria and construction of the Lexis diagram are discussed in Section 2. The HFS algorithm and its adaptive model selection device are developed in Section 3. In Section 4, our approach is demonstrated and evaluated through real data analysis of an illustrative genome sequence and human chromosome 21. In Section 5, our approach is shown to have very high sensitivity and specificity in identifying CpG islands under simulated genome dynamics based on the Hidden Markov Model. Some related issues and problems are collected and addressed in the discussion section. The resultant collection of CpG island cores on the whole chromosome 21 is provided in a supplementary file.

## 2. Incompleteness of filtering criteria via Lexis diagram

In this section, we first show how to construct a Lexis diagram to accommodate a set of filtering criteria. Then, based on the constructed Lexis diagram, we explicitly point out the mathematical incompleteness pertaining to this filtering criteria and discuss its implications.

Denote the genome sequence of interest
${X}_{n}$ = {*X** _{i}*}, with

*X*

*{*

_{i}*A, G, T, C}*at the

*i*th base pair (bp),

*i*= 0,...,

*n.*A CpG dinucleotide on ${X}_{n}$ is an occurrence of a

*C*followed by

*G.*The filtering criteria is given as follows (see Gardiner-Garden and Frommer (1987) and Takai and Jones (2002)):

**[CpG filtering criteria]**- [
**GF-1**] GC content ≥ 50%; - [
**GF-2**] Ratio*O/E*of observed*(O)*to expected*(E)*number of CpG dinucleotides in ${X}_{n}$ is ≥ 0.6; - [
**GF-3**] Length ≥ 200 bp (or 500 bp).

The threshold values 50%, 0.6 and 200 (or 500) used in the criteria are subjective choices. Different choices are used in different studies. The expected number *(E)* of CpGs is computed as *n* times the product of the frequencies of *G* and *C* in
${X}_{n}$ *(i.e.,* assuming independence). Heuristically, identifying CpG islands means to screen through a genome sequence for all segments satisfying all three conditions.

The Lexis diagram is a technique originally used to represent lifetime data with staggered entries. In the algorithm below we adapt this diagram to represent runs of genomic sites satisfying the filtering criteria.

**[Left-to-Right] version of the Lexis diagram****[Initialize**] Let*i*= 0.**[LR-1**] Sequentially compute the two characteristic values [GF-1] (GC content) and [GF-2]*(O/E*ratio) on segments*{X*_{j}*: j*[*i, k*]}for*k =*200,...,*n;***[LR-2**] Consider the 45° line crossing the horizontal axis at*i.*Mark the point*(i*+*k, k*) along this line whenever both [GF-1] and [GF-2] are satisfied on the subsegment [*i*,*i*+*k*] of ${X}_{n}$;**[LR-3**] Repeat steps [LR-1] and [LR-2] starting from*X*with_{i}*i*= 1,...,*n –*200.

A [Right-to-Left] Lexis diagram can be similarly constructed with *X** _{n}* as the starting point and performing all computations in the opposite direction.

Figure 1(a) shows a genome sequence marked with CpG dinucleotides. [Left-to-Right] Lexis diagrams for the two subsegments marked with green brackets are given in Figure 1(b) (right-most subsegment) and Figure 1(c) (left-most subsegment). These diagrams illustrate the sparsity and clustering of CpG di-nucleotides. Specifically, a marked point with coordinate *(i* + *k, k*), (*k* > 200), in the *n×n* lattice coordinate system indicates that the subsegment *[i, i + k]* of
${X}_{n}$ satisfies [GF-1] and [GF-2] with length *k.* For example, marked points *b*_{1} = (407,406), *b*_{2} = (440, 423) and *b*_{3} = (699, 547) in Figure 1(b) correspond to the subsegments [1, 407], [17, 440] and [152, 699], all satisfying the first two filtering conditions with lengths being equal to 406, 423 and 547, respectively. Here we note that both GC content and *O/E* are sensitive to starting location and length. For example, marked points *b*_{4} = (500, 347) in Figure 1(b) and *c*_{1} = (358, 305) in Figure 1(c) indicate that subsegment [153, 500] of genome segment #1 and segment [53, 358] of genome subsegment #2 do not satisfy [GF-1] and [GF-2]. However if we either extend both subsegments a little bit in length or move the starting point slightly to the right then the two conditions are satisfied.

**Lexis diagram,**(a) Actogram representation of a genome sequence with each CpG occurrence marked by a “bar”; (b) Lexis diagram for first segment marked by green square brackets; (c) Lexis diagram for second marked segment.

In addition, we observe wiggle patterns on both the lower-left and upper-right ends of the Lexis diagram, as shown in Figure 1(b,c). We see many subsegments satisfying the filtering criteria. Some are contained by the others, such as the marked subsegment corresponding to *c*_{3} = (699, 613) which contains the one marked by *c*_{4} = (690, 577). In contrast, the two subsegments marked by *b*_{2} and *b*_{3} are overlapping, but neither contains the other. We remark that there will be many similar examples in the [Right-to-Left] Lexis diagram, though here we focus on the [Left-to-Right] diagram.

These observations together point out that many genome locations have an undetermined status with regard to being inside or outside a CpG island. This is an important observation of the filtering criteria. Here we call it mathematical incompleteness. An immediate implication of this incompleteness is that the above filtering criteria is not operational for CpG island identification in the sense that there does not exist an algorithmic prescription that can be used to unambiguously define CpG islands using the filtering criteria. Therefore *ad hoc* human judgement is needed in order to define CpG islands. Alternatively, one can define a site as being in a CpG island if it is in any subsegment satisfying the criteria. In this case, an exhaustive search like that illustrated in the Lexis diagram is necessary.

The exhaustive search requires *O(n*^{2}*)* calculations. For a relatively short genome sequence with *n* ≈ 10^{5}, the computation is of order 10^{10}. Typical mammalian chromosomes are on the order of a million bp long. Thus, an exhaustive search is infeasible on a single processor. We note that the calculations can be done in parallel, with steps [[LR-1]] and [[LR-2]] computed for each *i* on a separate processor. With order *n* processors, the search could theoretically be reduced to *O(n).* However, most currently available computer clusters typically have only tens or hundreds of nodes.

The Lexis diagram also reveals the reason why CpG island finder algorithms based on sliding window computations have low sensitivity. We illustrate two cases in Figure 1(b). If 500 bp is used as the threshold value in [GF-3], then the subsegment marked by *b*_{2} would not be identified by a sliding window technique with window size 500. If the window size is 200 instead, then the moving window technique would pick up the subsegment [1, 407] as marked by *b*_{1}. Then if 408 bp is taken as the starting point, the second cluster on genome segment #1 is likely to be missed. If the two clusters were a little bit further apart, then it is likely that the sliding window technique would miss both of them. This is not an uncommon case.

Hence, from the perspectives of incompleteness and implied computability, an indirect method for finding CpG islands is clearly needed. One indirect approach proposed in this paper consists of two steps. The first step is based on a pattern recognition technique, called the Hierarchical Factor Segmentation (HFS) algorithm, as discussed in the next section. This HFS algorithm identifies a collection of CpG island “cores,” where the CpG dinucleotide is aggregated, without involving any filtering criteria. In the second step, upon each computed CpG island core, a CpG island is computed using a chosen filtering criteria on the target genome sequence.

## 3. HFS approach: identifying CpG cores

The Hierarchical Factor Segmentation (HFS) algorithm was originally proposed in Fushing et al (2006), for analyzing animal behavioral time series. It was used there as a pattern recognition platform generically called “non-parametric decoding.” The version of the HFS algorithm employed here consists of three coding levels, so it is denoted by ^{3}*HFS* throughout this paper. A schematic illustration of this algorithm is given in Figure 2. Intuitively, the HFS algorithm performs a CpG di-nucleotide intensity change-point analysis without prior knowledge of the number of changes along the genome sequence. The algorithm is described as follows.

- [
^{3}*HFS***algorithm:**] - [
^{3}*HFS***-1.**] The genome sequence ${X}_{n}$ is transformed into a base 2 digital string using the following coding scheme: code 1 replaces the value of*X*when a CpG ends at time_{i}*i*, otherwise code 0 replaces*X*. This first level 0-1 digital string is denoted by the code sequence $\mathcal{\text{C}}$_{i}^{0}. - [
^{3}*HFS***-2.**] Construct a histogram, say $\mathcal{\text{H}}$*, of the event recurrence time distribution from $\mathcal{\text{C}}$^{0}, and denote the sequence of recurrence times (inter-event spacing) by ${\stackrel{\sim}{R}}_{m}$. - [
^{3}*HFS***-3.**] Choose the first threshold value ħ_{0}on $\mathcal{\text{H}}$* to transform the ${\tilde{R}}_{m}$ sequence into a 0*-l* digital string via the second level coding scheme: (1) a recurrence time less than ħ_{0}is coded by a 0* and (2) a recurrence time greater than or equal to ħ_{0}is coded by a 1*. The resultant base 2 digital string is denoted by the code sequence $\mathcal{\text{C}}$*. - [
^{3}*HFS***-4.**] Upon code sequence $\mathcal{\text{C}}$*, we take code word 1* as another “new” event and construct its corresponding recurrence time histogram $\mathcal{\text{H}}$^{@}and the sequence of inter-l*-event spacing as ${\tilde{R}}^{@}$. - [
^{3}*HFS***-5.**] Choose the second threshold value ħ_{*}on $\mathcal{\text{H}}$^{@}to transform ${\tilde{R}}^{@}$ into another digital string using the top-level coding scheme: (1) an inter-l*-event spacing less than ħ_{*}is coded by 0^{@}; (2) an inter-l*-event spacing greater than ħ_{*}is coded by a 1^{@}. The third base 2 digital string is denoted by $\mathcal{\text{C}}$^{@}. - [
^{3}*HFS***-6.**] The resultant code sequence $\mathcal{\text{C}}$^{@}is mapped back onto the time series $\mathcal{\text{C}}$^{0}or ${X}_{n}$ as a partition of | $\mathcal{\text{R}}$^{@}|(=*m*^{(3)}) segments on [1,*n*]. Denote this partition by ${N}^{(3)}({X}_{n})={\{[{N}_{Li}^{(3)},\hspace{0.17em}\hspace{0.17em}{N}_{Ri}^{(3)})\}}_{i=1}^{m}$.

The ^{3}*HFS* segmentation
$\mathcal{\text{N}}$^{(3)}(
${X}_{n}$) will separate the high CpG-intensity segments from the low CpG-intensity ones. A segment, say
$[{N}_{Li}^{(3)},\hspace{0.17em}{N}_{Ri}^{(3)})$, corresponding to the l^{@}-code, is a period of time points falling in between two widely separated l*-codes. The wide separation of two successive l*-codes implies that there are many 1-codes on the segment of code sequence
$\mathcal{\text{C}}$^{0}, or equivalently, that many CpG events are observed on the segment of
${X}_{n}$. Thus this segment is a cluster of CpGs. We call a segment corresponding to 1^{@}-odes a “core” of a potential CpG island. In contrast, a neighboring segment, say
$[{N}_{Ri}^{(3)},\hspace{0.17em}{N}_{Li+1}^{(3)})$, corresponding to 0^{@}-codes (including segments corresponding to the two l*-codes on both of its ends) would present an extended period consisting of many 0-codes, but very few 1-codes. Such a segment contains only sparse CpGs.

Thus, given (ħ_{0}, ħ_{*}), the ^{3}*HFS* algorithm can very efficiently locate all cores of CpG islands on
${X}_{n}$, even for whole chromosomes. Since the total length of this collection of cores is much smaller than the length of
${X}_{n}$, the construction of a collections of Lexis diagrams on short segments centered at each identified core become computationally feasible. In the next two subsections we discuss an adaptive model selection device for choosing thresholds values (ħ_{0}, ħ_{*}*).*

## 3.1. Extracting information of CpG dynamics via mixture analysis

¿From the point of view of pattern recognition, the segmentation
$\mathcal{\text{N}}$^{(3)}(
${X}_{n}$) of
${X}_{n}$ can also be viewed as a solution to the CpG intensity change-point analysis problem that makes no assumptions about prior knowledge of the number of changes. Since this solution critically depends on the choice of threshold values (ħ_{0}, ħ_{*}), we propose to choose such threshold values by employing an adaptive model selection technique, which will be developed in the next subsection. By adaptive here we mean to choose a suitable penalty function in a likelihood-based model selection procedure such that the resultant thresholds can coherently reflect information about the CpG dynamics embedded in the original genome sequence
${X}_{n}$. An apparent fact which is very relevant to the unknown CpG dynamics is the CpG-intensities inside and outside the CpG islands contained in
${X}_{n}$. For simplicity, we assume that all CpG islands share one common intensity *λ*_{1}, while non-CpG island segments share another common intensity *λ*_{0}. We expect that *λ*_{1} > *λ*_{0}. We extract this intensity information using the maximum entropy principle as follows.

With a given
${X}_{n}$, we know the number *m* of CpG di-nucleotides it contains. We have the average recurrence time *n/m* as an estimate of the mean recurrence time of CpG di-nucleotides. Given this average recurrence time and under the maximum entropy principle, the recurrence time of CpG di-nucleotides along
${X}_{n}$ are independently distributed and follow a geometric distribution *G(·; λ*_{1}*)* inside CpG islands and a geometric distribution *G(·; λ*_{0}) for the remaining bulk of DNA.

Based on these geometric distributions and the independence among recurrence times, we derive the likelihood for geometric mixture analysis of the collection of all recurrence times by collapsing their order. Denote the sequence of recurrence times of CpGs observed in
${X}_{n}$ by
${\tilde{R}}_{m}=\{{R}_{1},\dots ,{R}_{m}\}$. The state-space sequence pertaining to
${\tilde{R}}_{m}$ is denoted by
${\tilde{S}}_{m}=\{{S}_{1}^{R},\dots ,{S}_{m}^{R}\}$. That is, if *R** _{i}* is from a CpG island
${S}_{i}^{R}=1$ then

*R*

_{i}*–*1 is distributed according to

*G*(·;

*λ*

_{1}). Otherwise, ${S}_{i}^{R}=0$

*(R*

*is not in a CpG island), and*

_{i}*R*

_{i}*–*1 is distributed according to

*G*(·;

*λ*

_{0}). The mixture likelihood of

*ϕ*=

*(π*

_{1}

*, λ*

_{1}

*, λ*

_{0}

*)*is computed as:

Denote
${M}_{1}^{*}=\Sigma {S}_{i}^{R}=m-{M}_{0}^{*}$. Then the log-likelihood of *ϕ* is:

To find the MLE for *ϕ*, we apply the EM-algorithm with
${\tilde{S}}_{m}$ as the missing data. The expected log-likelihood in the E-step at the *j*th iteration is:

where we have the so-called responsibility of the state *s* = 1 for the *i*th datum

The M-step at the *j*th iteration involves the following computations:

- Estimate the mixture proportion$${\pi}_{1}^{(j)}=\sum {\gamma}_{i}^{(j)}/m;$$
- Estimate the geometric distribution parameters, for
*s*= 0, 1,where ${\mu}_{s}^{(j)}$ is the mean recurrence time in the$$\begin{array}{c}{\lambda}_{s}^{(j)}=\text{log}\frac{{\mu}_{s}^{(j)}}{{\mu}_{s}^{(j)}-1}\\ {\mu}_{1}^{(j)}=\frac{1}{m}\sum \frac{{\gamma}_{i}^{(j)}}{{\pi}_{1}^{(j)}}({R}_{i}-1),\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}{\mu}_{0}^{(j)}=\frac{1}{m}\sum \frac{1-{\gamma}_{i}^{(j)}}{1-{\pi}_{1}^{(j)}}({R}_{i}-1)\end{array}$$*s*-state.

Let
$({\widehat{\pi}}_{s}^{(M)},\hspace{0.17em}\hspace{0.17em}{\widehat{\lambda}}_{s}^{(M)},\hspace{0.17em}\hspace{0.17em}{\widehat{\mu}}_{s}^{(M)})$ denote the MLE estimates of the mixture proportion, geometric intensity and mean recurrence time parameters, respectively. We can compute the sample Fisher information matrix *Î*_{0} of *ψ* = *(π*_{1}*, λ*_{1}*, λ*_{0}*)** ^{′}* using a variant of the missing information matrix proposed in Louis (1982). See McLachlan and Krishnan (1997) and Meilijson (1989) specifically for the geometric mixture case.

## 3.2. Finding the optimal threshold parameters via adaptive model selection criterion

Next we develop our adaptive model selection methodology as follows. Assuming that the segmentation
$\mathcal{\text{N}}$^{(3)}(
${X}_{n}$) using the HFS algorithm achieves the separation of the CpG island and non-CpG island segments. Then the recurrence time within all
$[{N}_{Li}^{(3)},\hspace{0.17em}{N}_{Ri}^{(3)})$ segments of 1^{@} is G(· : *λ*_{l}) distributed, while recurrence time in the remaining
$[{N}_{Ri}^{(3)},\hspace{0.17em}{N}_{Li+1}^{(3)})$ segments is distributed as G(· : *λ*_{0}). Therefore, any point of
$\left\{[{N}_{Li}^{(3)}\right\}$ and
$\left\{{N}_{Ri}^{(3)})\right\}$ separating 0^{@} and 1^{@} segments can be defined as an intensity change-point. Denote the number of change-points by *ω*(*ħ*_{0}, *ħ*_{*}). Let
${\left\{{R}_{1j\text{'}}\right\}}_{j\text{'}=1}^{{m}_{1}}$ and
${\left\{{R}_{0j}\right\}}_{j=1}^{{m}_{0}}$ denote the observed recurrence times within identified core and non-core segments, respectively, where *m*_{0} + *m*_{1} *= m* is the length of
${\tilde{R}}_{m}$ in step [^{3}*HFS-*2*.]* of the algorithm.

Next, we compute the likelihood function of (*λ*_{0}, *λ*_{l}) based on the observed recurrence times within the two classes of segments as:

where *P*_{0}*(λ*_{s}*)* = (1 – *e ^{−λs}*),

*s*= 1, 0. Note that this estimation of (

*λ*

_{0},

*λ*

_{1}) is independent of the estimation in the mixture geometric analysis in Section 3.1. With

*L*(

*λ*

_{0},

*λ*

_{1}; ${X}_{n}$), the maximum likelihood estimates of (

*λ*

_{0},

*λ*

_{1}) are derived as follows:

where ${\overline{c}}_{1}=\frac{\Sigma {R}_{1j\text{'}}}{{m}_{1}}$ and ${\overline{c}}_{0}=\frac{\Sigma {R}_{0j}}{{m}_{0}}$ are the average recurrence times on core and non-core segments, respectively.

In the null case when no segmentation is imposed on
${X}_{n}$, the maximum entropy principle implies that the the CpG recurrence time distribution is again geometric. Denote this geometric distribution G(· : *λ*) with maximum likelihood estimator
$\text{G}(\cdot \hspace{0.17em}:\hspace{0.17em}\widehat{\lambda})=L(\widehat{\lambda}|\sigma )={P}_{0}^{m}(\widehat{\lambda}){e}^{-\widehat{\lambda}(n-m)}$ with
$\widehat{\lambda}=\text{log}\hspace{0.17em}\frac{\overline{c}}{\overline{c}-1}$, * = n/m,* and *m* = *m*_{0} + *m*_{1}. Since the number of events on
${X}_{n}$ is a fixed constant, the estimator *L(*|*σ)* is the maximum entropy distribution.

Consider the null hypothesis that there is no aggregation of CpGs in ${X}_{n}$. A reasonable and effective test statistic is the log-likelihood ratio

where *l*_{0} + *l*_{1}, = *n*, and *l*_{0} *=* ∑*k*_{0}* _{j}* and

*l*

_{1}

*=*∑

*k*

_{1}

*are the total lengths of the two kinds of segments.*

_{j′}To make use of Δ^{0}*(ħ*_{0}*, ħ*_{*}*)* for selecting an optimal threshold parameter *(ħ*_{0}*, ħ*_{*}*)* in the ^{3}*HFS* algorithm, we perform the following optimization:

where *k* is the penalty for adding one extra parameter. For instance, *k* = 1 is the AIC model selection criterion, and *k* = log *m* corresponds to the BIC criterion. Empirically, we know that *k* = 1 would be likely to give rise to too many cores (0^{@}), while *k* = log *m* tends to produce too few cores. Hence, we propose the adaptation step as the choice of *k* for which the ^{3}*HFS* segmentation based on threshold values (
${\tilde{\u0127}}_{0}^{o(k)}$,
${\tilde{\u0127}}_{*}^{o(k)}$) provides a pair of intensity estimates, say (
${\widehat{\lambda}}_{1}^{(k)}$,
${\widehat{\lambda}}_{0}^{(k)}$), that is closest to (
${\widehat{\lambda}}_{1}^{(M)}$,
${\widehat{\lambda}}_{0}^{(M)}$) derived in the geometric mixture analysis. Specifically, we choose the integer *k* minimizing the following sum of signal-to-noise (S-N) ratios:

Denote the corresponding optimal threshold parameters by (
${\tilde{\u0127}}_{0}^{o({k}^{o})}$,
${\tilde{\u0127}}_{*}^{o({k}^{o})}$) and let the segmentation
${N}^{(3)}({X}_{n}|({\tilde{\u0127}}_{0}^{o({k}^{o})},\hspace{0.17em}\hspace{0.17em}{\tilde{\u0127}}_{*}^{o({k}^{o})}))$ refer to the output of the ^{3}*HFS* algorithm performed using these thresholds.

### 3.3. Local exhaustive search: identifying CpG islands

We suggest the following heuristic for identifying CpG islands based on a set of identified cores. We first define an extended core as a union of all cores which are < 250 bp apart, where 250 bp is an empirically chosen *ad hoc* threshold. Consider a genome subsegment centered at an extended core and extending 250 bp to the left from the left endpoint of the extended core. Construct the Left-to-Right Lexis diagram on this genome subsegment, using a chosen filtering criteriona. The Lexis diagram produces a set of predicted CpG islands.

## 4. Applications of HFS algorithm on real genome sequences

In this section, we demonstrate how the ^{3}*HFS* algorithm (with optimal penalty and threshold parameter values) can be applied to identify CpG island cores in genome sequences {"type":"entrez-nucleotide","attrs":{"text":"NT_000874.1","term_id":"5867303","term_text":"NT_000874.1"}}NT_000874.1 and human chromosome 21.

### 4.1. An illustrative example

We first apply the ^{3}*HFS* algorithm to the human genome sequence {"type":"entrez-nucleotide","attrs":{"text":"NT_000874.1","term_id":"5867303","term_text":"NT_000874.1"}}NT_000874.1, shown in Figure 3. This sequence was analyzed by Wang and Leung (2004) using their CpG island finder CpGIE with filtering criteria: (i) GC content ≥ 50%, (ii) *O/E* ≥ 0.6, and (iii) length ≥ 500 bp. Although the CpGIE algorithm does not quite resemble the Lexis diagram, it provides an intensive search for CpG islands in relatively short genome sequences. Hence, CpGIE is a good benchmark for the performance of our ^{3}*HFS* algorithm. To quantify the relative performance of ^{3}*HFS,* we use the percentage of CpG islands identified by CpGIE that overlap with cores identified by the ^{3}*HFS* algorithm.

**...**

Geometric mixture analysis of {"type":"entrez-nucleotide","attrs":{"text":"NT_000874.1","term_id":"5867303","term_text":"NT_000874.1"}}NT_000874.1 results in MLE estimates
${\widehat{\lambda}}_{0}^{(M)}=0.02178$,
${\widehat{\lambda}}_{1}^{(M)}=0.07839$, and _{1} = 0.6802. The optimal penalty is *k** = 4 and the corresponding optimal threshold values are
$({\tilde{\u0127}}_{0}^{o({k}^{o})},\hspace{0.17em}\hspace{0.17em}{\tilde{\u0127}}_{*}^{o({k}^{o})})=(56,\hspace{0.17em}\hspace{0.17em}15)$. Heuristically, this pair of threshold values indicate that the optimal CpG island “core” in this genome sequence contains at least 15 CpG di-nucleotides separated by no more than 56 bp.

In Figure 4(a), the collection of CpG island cores identified by ^{3}*HFS* are marked on the sequence {"type":"entrez-nucleotide","attrs":{"text":"NT_000874.1","term_id":"5867303","term_text":"NT_000874.1"}}NT_000874.1 with green brackets, and the CpGIE CpG island predictions are shown in red curly brackets. The proportion of ^{3}*HFS* cores overlapping CpGIE hits is 11/12. The Lexis diagram of the one CpGIE-predicted CpG island that does not contain a ^{3}*HFS* core is a single point diagram. This can also be seen through the rather uniformly distributed CpGs in this CpG island. Figure 4(b) is a plot of all identified cores on the two-dimensional coordinate system “GC-content × O/E ratio”. All cores satisfy the [GF-1] and [GF-2] conditions, except one. This core is contained in a long genome subsegment that does satisfy the conditions.

**...**

In Figure 5, we illustrate how CpG islands can be identified from ^{3}*HFS* cores. We found a genome subsegment of length 385 bp satisfying [GF-1] and [GF-2]. Similarly, all 4 cores that do not overlap a CpGIE CpG island produce subsegments of varying length less than 500 bp that satisfy [GF-1] and [GF-2].

In summary, this example has shown that CpG islands identified by the CpGIE algorithm are very likely to overlap ^{3}*HFS* cores. Hence, we can empirically conclude that our ^{3}*HFS* algorithm should provide a nearly exhaustive search for CpG islands on whole chromosomes. We confirm this performance through a computer experiment in the next section. It is also noted that CpGIE is among the most computationally expensive CpG island finders, and hence can only be applied to short genome sequences. ^{3}*HFS* is a competitive alternative that can be run on much longer sequences. Furthermore, the set of CpG islands predicted by CpGIE is somewhat different from the set identified by Lexis diagrams in regions near ^{3}*HFS* cores (with the same filtering criteria). This indicates that our approach may provide complementary information, making it useful even on short sequences.

### 4.2. CpG island cores on human chromosome 21

Next we apply the ^{3}*HFS* algorithm to 15 sequences that make up human chromosome 21. The resulting cores are reported in a supplementary file containing the genome locations, GC content, and *O/E* ratio for each core. The descriptive statistics for the latter two are summarized in Figure 6(a). Figure 6(b) shows that the rate of CpGs varies considerably among the 15 sequences, rising from left to right on the right arm of the chromosome in both predicted CpG islands (high intensity regions) and the non-islands (low intensity regions). This variation suggests that using one constant triplet of threshold values for the filtering criteria [GF-l]-[GF-3] throughout the whole chromosome may not be ideal biologically.

## 5. Numerical evaluations of HFS algorithm

In this section, we examine the performance of the ^{3}*HFS* algorithm on simulated data. It is not clear what model should be used to generate CpG island dynamics along genome sequences. Here we tentatively employ a hidden Markov model (HMM) with two states: CpG island and non-CpG island. The emission variables are the recurrence times of CpGs in the two states, which are assumed to be geometrically distributed as
$G(\cdot ;{\widehat{\lambda}}_{0}^{(M)})$ in the non-CpG island state and
$G(\cdot ;{\widehat{\lambda}}_{1}^{(M)})$ in the CpG island state. We use the values
${\widehat{\lambda}}_{0}^{(M)}=0.022$ and
${\widehat{\lambda}}_{1}^{(M)}=0.078$ from the illustrative example in the previous section.

The unobservable Markov random variable is specified by the 2 × 2 transition matrix: *p*_{00} *= p*_{0} = 1 – *p*_{01} and *p*_{11} *= p*_{1} = 1 – *P*_{10} We further specify *p*_{0} and *p*_{1} through the following equations. Let *K*_{0} be equal to the threshold value of CpG island length in [GF-3], and set

where *υ*_{1} is the expected number of consecutive sites in the CpG island state and _{1} is the expected length of a random variable distributed as
$G(\cdot ;{\widehat{\lambda}}_{1}^{(M)})$. Hence *K*_{0} is close to the expected length (in bp) of a CpG island. This specification ensures that this HMM can give rise to CpG islands with expected length *K*_{0}. In our computer experiment, we take *K*_{0} = 500 and calculate
${\widehat{\mu}}_{1}=\frac{1}{1-{e}^{-{\widehat{\lambda}}_{1}^{(M)}}}=13.26$, *υ*_{1} *= K*_{0}*/*_{1} = 37.70.

The estimated mixture proportion vector (_{0}, _{1}), with _{0} + _{1} = 1, is the stationary probability distribution corresponding to the HMM transition matrix. We define

and set _{1} = 0.6802, based again on the illustrative example. Subsequently, we solve for *p*_{1} = 0.9735 from the equation
${\upsilon}_{1}=\frac{2{p}_{1}-{p}_{1}^{2}}{(1-{p}_{1})}$. Then *p*_{0} = 0.9435 is computed using the equation
$\frac{{\widehat{\pi}}_{0}}{{\widehat{\pi}}_{1}}=\frac{1-{p}_{1}}{1-{p}_{0}}$.

For each replication, this HMM is employed to generate a 0-1 state-space vector of length 2000, which in turn is used to generate an event-time series around 45,000 bp long using the emission variables. The ^{3}*HFS* algorithm is applied to each event-time series to decode the state-space vector. Performance is summarized in terms of the rate of CpG islands (of varying length) overlapping with computed cores (true positives) and the rate of computed cores falling in non-CpG island segments (false positives).

We performed 100 replications in our computer experiment. Performance of ^{3}*HFS* across the 100 data sets is summarized in Figure 7. The overall sensitivity is quite high (Fig. 7(a)), even for relatively short CpG islands: conditional sensitivity is 100% for islands over 350 bp and 95% for islands 150 to 350 bp. The specificity of the ^{3}*HFS* algorithm is also very high, as shown in Figure 7(b). Out of 4,046 cores computed in the 100 replications, only 315 cores (< 7%) fall in non-CpG island segments. On the average, in one replication there are only 3 cores falling in non-CpG island segments (constituting about 35,000 bp of the 45,000 bp simulated sequence). The conditional false positive rate is as low as 1% for cores longer than 400 bp.

An illustrative simulation is shown in Figure 8. We see that almost all simulated CpG islands (in red) overlap cores computed using ^{3}*HFS,* except for two rather short islands. There are 3 cores falling within non-CpG island genome segments about 35,000 bp long. It should be noted that these are not false positive CpG island predictions, as one would conduct an exhaustive search in each extended core to make CpG island predictions. The optimal penalty is 5 times the AICs penalty and the threshold values are (59,10).

In summary, the ^{3}*HFS* algorithm has very high sensitivity and specificity, especially for CpG islands longer than 350 bp. This analysis confirms that our approach has the capacity to produce a nearly exhaustive search for CpG islands.

## 6. Discussion

In this paper we propose a nearly exhaustive search for CpG islands on genome sequences. This two-step methodology is computationally feasible even for whole chromosomes. Globally, the steps of the ^{3}*HFS* algorithm pinpoint nearly all CpG di-nucleotide-rich segments, or CpG island cores, along the genome sequence under study. This pattern recognition technique primarily computes the contrasting sparsity vs. aggregation of CpG di-nucleotides, and is independent of any filtering criteria. Its coding threshold parameter values are chosen to match the mixture information of high and low CpG di-nucleotide intensities by means of an adaptive model selection device. Locally, the step of the Lexis diagram is implemented on every extended genome segment centered around one or several CpG island cores to exhaustively extract all potential CpG islands satisfying a chosen filtering criteria. Hence, from the perspectives of computational feasibility and capability of a nearly exhaustive search, our CpG island search algorithm provides a promising and realistic foundation for unbiased inferences. These inferences may then better facilitate more effective promoter predictions and differential methylation studies.

It is worth emphasizing again that filtering criteria based on a CpG island definition is neither operational, nor mathematically complete. We further illustrate this on the first 7000 bp of the genome segment of {"type":"entrez-nucleotide","attrs":{"text":"NT_000874.1","term_id":"5867303","term_text":"NT_000874.1"}}NT_000874.1, shown in Figure 4(a). One CpG island located at [2109,4220] is identified by the CpGIE algorithm within the segment. Interestingly, the Lexis diagram identifies five different subsegments satisfying the same filtering criteria. They are located at [168, 4771], [599, 4922], [1296, 5151], [1299, 6013] and [1487, 6211], and all contain the CpG island [2109, 4220] identified by CpGIE. Hence the question becomes, which one of these is the CpG island to be used? It is very likely that different researchers might choose differently among the six genome subsegments for distinct functional purposes. This example not only spotlights the problems facing filtering criteria, but more importantly it reveals the merit of Lexis diagrams for a CpG island search.

Next we remark that our computational approach for finding CpG islands, like any filtering criteria-based algorithm, does not bear information about any functional status of CpG islands. Therefore it cannot be used for prediction purposes until proper information is incorporated into the algorithmic computations. For instance, with the increasing availability of methylation data in the literature, a study of the association between genome sequence and methylation can be performed on a collection of CpG island cores extracted from one particular genome sequence. And then, by incorporating these experimental and empirical patterns into our two-step approach, especially the Lexis diagram construction, we may potentially attempt to predict methylation status on other genome sequences. This kind of learning approach could be a promising area for future interdisciplinary research.

Finally we conclude with some comments on modeling CpG island dynamics. A comparison of the genome sequence {"type":"entrez-nucleotide","attrs":{"text":"NT_000874.1","term_id":"5867303","term_text":"NT_000874.1"}}NT_000874.1 in Figure 4(a) with the HMM-generated sequence in Figure 6, suggests that the HMM seems to be a realistic model for CpG island recurrence. However, our analyses on these two sequences results in two different sets of optimal threshold values. Thus we speculate that some dynamic characteristics underlying the real genome sequence still have not been completely captured by the HMM mechanism. More research effort is needed for better understanding of the CpG island dynamics.

## Supplementary Material

#### Supplementary-CpG-core_chr21.xls

A supplemental file of our CpG island cores on Chromosome 21

## References:

- Bajic VB, Tan SL, Suzuki Y, Sugano S. Promoter prediction analysis on the whole human genome. Nature Biotechnology. 2004;22:1467–1473. doi: 10.1038/nbt1032. [PubMed] [Cross Ref]
- Bird AP. CpG island and the function of DNA methylation. Nature. 1986;321:209–213. doi: 10.1038/321209a0. [PubMed] [Cross Ref]
- Bock C, Paulsen M, Tierling S, Mikeska T, Lengauer T, Walter J. 2006. CpG island methylation in human lymphocytes is highly correlated with DNA sequence, Repeats, and predicted DNA structure PLoS Genet 23e26.http://www.plosgenetics.Org/article/info:doi/10.1371/journal.pgen.0020026Accessed 20 January 2006.10.1371/journal.pgen.0020026 [PMC free article] [PubMed] [Cross Ref]
- Bock C, Walter J, Paulsen M, Lengauer T. CpG island mapping by Epifenome prediction. PLoS Computational Biology. 2007;6:1055–1069. [PMC free article] [PubMed]
- Fushing H, Hwang CR, Lee HC, Lan YC, Horng SB. Testing and mapping non-stationarity in animal behavioural processes: ace case study on an individual female bean weevil. Jour Theor Biol. 2006;238:805–816. doi: 10.1016/j.jtbi.2005.06.031. [PubMed] [Cross Ref]
- Gardiner-Garden M, Frommer M. CpG island in vertebrate genomes. J Mol Biol. 1987;196:261–282. doi: 10.1016/0022-2836(87)90689-9. [PubMed] [Cross Ref]
- Laird PW. Cancer epigenetics. Hum Mol Genet. 2005;14:R65–R76. doi: 10.1093/hmg/ddi113. [PubMed] [Cross Ref]
- Saxonov S, Berg P, Brutlag D. A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters. Proc Natl Acad Sci U S A. 2006;103:1412–1417. doi: 10.1073/pnas.0510310103. [PMC free article] [PubMed] [Cross Ref]
- Takai D, Jones PA. Comprehensive analysis of CpG island in human chromosomes 21 and 22. Proc Natl Acad Sci U S A. 2002;99:3740–3745. doi: 10.1073/pnas.052410099. [PMC free article] [PubMed] [Cross Ref]
- Ushijima T. Detection and interpretation of altered methylation patterns in cancer cells. Nature Review:Cancer. 2005;5:223–231. doi: 10.1038/nrc1571. [PubMed] [Cross Ref]
- Wong Y, Leung FCC. An evaluation of new criteria for CpG islands in the human genome as gene markers. Bioinformatics. 2004;20:1170–1177. doi: 10.1093/bioinformatics/bth059. [PubMed] [Cross Ref]
- Yamada Y, Watanabe H, Miura F, Soejima H, Uchiyama M, Iwasaka T, Mukai T, Sakaki Y, Ito T. A comprehensive analysis of allelic methylation status of CpG islands in human chromosome 21q. Genome Res. 2004;14:247–266. doi: 10.1101/gr.1351604. [PMC free article] [PubMed] [Cross Ref]

**Berkeley Electronic Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (644K) |
- Citation

- CpG island mapping by epigenome prediction.[PLoS Comput Biol. 2007]
*Bock C, Walter J, Paulsen M, Lengauer T.**PLoS Comput Biol. 2007 Jun; 3(6):e110. Epub 2007 May 2.* - Particle swarm optimization with reinforcement learning for the prediction of CpG islands in the human genome.[PLoS One. 2011]
*Chuang LY, Huang HC, Lin MC, Yang CH.**PLoS One. 2011; 6(6):e21036. Epub 2011 Jun 28.* - Comprehensive analysis of CpG islands in human chromosomes 21 and 22.[Proc Natl Acad Sci U S A. 2002]
*Takai D, Jones PA.**Proc Natl Acad Sci U S A. 2002 Mar 19; 99(6):3740-5. Epub 2002 Mar 12.* - Prediction of CpG-island function: CpG clustering vs. sliding-window methods.[BMC Genomics. 2010]
*Hackenberg M, Barturen G, Carpena P, Luque-Escamilla PL, Previti C, Oliver JL.**BMC Genomics. 2010 May 26; 11:327. Epub 2010 May 26.* - Evaluation of single CpG sites as proxies of CpG island methylation states at the genome scale.[Nucleic Acids Res. 2012]
*Barrera V, Peinado MA.**Nucleic Acids Res. 2012 Dec; 40(22):11490-8. Epub 2012 Oct 12.*

- A Nearly Exhaustive Search for CpG Islands on Whole ChromosomesA Nearly Exhaustive Search for CpG Islands on Whole ChromosomesThe International Journal of Biostatistics. Jan 1, 2009; 5(1)

Your browsing activity is empty.

Activity recording is turned off.

See more...