• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptNIH Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Comput Stat. Author manuscript; available in PMC May 5, 2011.
Published in final edited form as:
Comput Stat. Mar 2010; 25(1): 17–38.
doi:  10.1007/s00180-009-0159-7
PMCID: PMC3087980
NIHMSID: NIHMS238171

Bayesian model-based tight clustering for time course data

Abstract

Cluster analysis has been widely used to explore thousands of gene expressions from microarray analysis and identify a small number of similar genes (objects) for further detailed biological investigation. However, most clustering algorithms tend to identify loose clusters with too many genes. In this paper, we propose a Bayesian tight clustering method for time course gene expression data, which selects a small number of closely-related genes and constructs tight clusters only with these closely-related genes.

Keywords: Bayesian cluster analysis, Tight clustering, Time course gene expression, Microarray

1 Introduction

Clustering methods can be categorized into heuristic and model-based frameworks. Methods in heuristic frameworks identify clusters based on non-probabilistic measures. The K-means (Hartigan and Wong 1979) and hierarchical (Johnson and Wichern 2002) algorithms belong to this framework. Methods in model-based frameworks cluster objects based on probabilistic measures. As one of the most popular methods in this framework (Basford and McLachlan 1985; Basford et al. 1997), there is the mixture model

f(Y|θ)=i=1nk=1Kξkf(Yi|θk),
(1)

where Yi is the response variable of the ith object, K is the number of components, ξk is the mixing probability of component k and f (Yi|θk) is the probability density function of component k with parameter θk. This model has been studied for microarray data analyses in numerous papers (Ghosh and Chinnaiyan 2002, McLachlan et al. 2002; Datta and Datta 2003; Ouyang et al. 2004). Recently, the product partition model (Crowley 1997) have been proposed as alternative model-based approaches using the cluster likelihood:

f(Y|θ,ω)=k=1c(ω)f(YCk|θk)
(2)

where ω is a fixed unknown partition of n objects, c(ω) is the number of clusters within ω, Ck is a set of object indices for cluster k, and YCk is the data of objects in cluster k. Note that partition ω is a free parameter, which should be estimated, and ω determines c(ω). Then, k=1c(ω)𝒞k={1,2,,n}, and CiCj = [empty] when ij. These methods assume that the data vectors are partitioned into c(ω) clusters according to ω and the clusters are independent of each other. While the mixture model (1) is constructed with a known number of clusters, the cluster likelihood (2) contains a partition ω as a parameter to be estimated. In other words, pre-specification of the number of clusters is not needed in the cluster likelihood approach.

As microarray technology became more easily available, biologists can measure gene expressions consecutively over time and examine temporal changes of these expressions. Naturally, many new statistical methods have been developed to cluster genes based on these temporal changes (profiles) in both heuristic (Peddada et al. 2003; Hakamada et al. 2006; Lukashin and Fuchs 2001) and model-based (Schliep et al. 2003; Luan and Li 2003; James and Sugar 2003; Tseng and Wong 2005; Ma et al. 2006; Leng and Muller 2006; Ng et al. 2006) frameworks as follows.

  1. Heuristic framework: Peddada et al. (2003) grouped profiles into, so-called, clusters of inequality profiles. For example, suppose a cluster contains profiles with monotonically increasing temporal trends, which can be characterized with inequalities among means at each time point. Similarly, various types of clusters are pre-specified with inequalities, then profiles are clustered based on bootstrap-based criterion. Hakamada et al. (2006) developed a method that cluster profiles based euclidian distances. Lukashin and Fuchs (2001) applied K-means algorithm to temporal profiles.
  2. Model-based framework: Leng and Muller (2006) applied functional discriminant analysis considering profiles as independent realizations of a smooth stochastic process. Luan and Li (2003), James and Sugar (2003), Ng et al. (2006), Ma et al. (2006) developed the mixture of mixed effect models to cluster temporal profiles. In their models, f (Yik) in (1) is specified with a mixed effect model. Particularly, James and Sugar (2003) emphasized application of their model to sparsely and irregularly measured time course data. The Bayesian objective function approach in Booth et al. (2008) and the hidden Markov model in Schliep et al. (2003) are developed based on the cluster likelihood function (2).
  3. Costa et al. (2004), Thalamuthu et al. (2006) and Ma et al. (2006) compared clustering methods for time course data using simulation studies.

As the main example of this paper, we analyze the corneal wound healing data, in which expressions of 646 genes are measured twice at irregular 12 time points, using 24 rats (= 2 replicates × 12 time points). The goal of this paper is to identify clusters that contain a small number of genes with very similar temporal patterns. We consider that two types of gene sets, closely-related and weakly-related genes, are included in a microarray experiment either intentionally or unintentionally. If a gene has a close relationship with any other gene(s), we will call it a closely-related gene. Otherwise, we will call it a weakly-related gene. Usually, among thousands of genes on a microarray slide, a large portion of genes are weakly-related. These weakly-related genes tend to increase noise in search of the optimal partition (or clusters) without providing significant amounts of information (Tseng and Wong 2005). Therefore, direct application of conventional methods will provide large and loose clusters that consist of both closely- and weakly-related genes. However, this is not a desirable result, because biologists typically want to conduct further biological research on a small number of closely-related genes after gene expressions are explored with microarray analyses. To overcome this problem, Tseng and Wong (2005) proposed so-called tight clustering method for cross-sectional data. The main idea of Tseng and Wong (2005) is to construct tight clusters only with closely-related genes, which is a small portion of the whole data set. Their algorithm can be summarized as follows:

  • Step 1. With the given number of clusters κ, apply K-means algorithm to subsets (e.g. 70% of genes) of the original data from resampling.
  • Step 2. Construct candidate tight clusters with genes tends to be together in these resampled subsets.
  • Step 3. Apply Step 1 and 2 with different κ in a certain range.
  • Step 4. Identify the final tight clusters that tend to be stable even when κ changes.

Performance of this tight clustering method (Tseng and Wong 2005) was demonstrated by Thalamuthu et al. (2006). In this paper, we propose a new tight clustering algorithm for time course gene expression data. Our tight clustering algorithm selects closely-related genes that have high relevance probabilities and then identifies clusters of only closely-related genes using a Bayesian objective function approach (Booth et al. 2008).

In Sect. 2, the Bayesian model and the objective function (Booth et al. 2008) are described in detail with an example of the corneal wound experiment. In Sects. 3 and 4, we discuss a stochastic search algorithm that maximizes a Bayesian objective function and provides simulation studies on this search algorithm. In Sect. 5, we explain how to calculate relevance probabilities and propose the tight clustering algorithm for time course data. To distinguish from tight clustering, we will call algorithms that do not employ the idea of tight clustering, plain clustering methods. In Sect. 6, our tight clustering method is applied to the corneal wound data.

2 Bayesian modelling and objective function

We look at an example of the corneal wound healing data to explain the Bayesian model and the objective function by Booth et al. (2008). In the experiment, 2 replicates of 646 gene expressions were measured at each of 12 time points (day 0, 1, 2, 3, 4, 5, 6, 7, 14, 21, 42 and 98) with a corneal wound. In our cluster analysis, the time variable is relabeled with orders 1, 2, …, 12. Justification of relabeling will be discussed in Sect. 6. Because we are interested in the temporal changes of gene expressions rather magnitudes of expressions, gene profiles are centered at the mean of each profile for our analysis.

Averages of two replicates are calculated for each gene at each time point. Then, average expressional profiles of 646 genes are drawn in Fig. 1. Although the patterns of profiles are not easily distinguishable, it seems that there are at least two clusters with increasing and decreasing gene expression patterns in the right end part of the profiles. Also, temporal patterns are not simple enough to use a parametric regression approach and there does not seem to be any periodic pattern. Therefore, within the clustering model, we use the penalized regression spline to explain the mean temporal trend of gene profiles within each cluster.

Fig. 1
Average gene-expressional profiles

Given ω, let θ=(θk)k=1c(ω) denote a set of cluster-specific parameter vectors θk. Then, the marginal posterior distribution of ω is

π(ω|Y)f(Y|θ,ω)π(θ|ω)π(ω)dθ=k=1c(ω){f(YCk|θk)π(θk|ω)}π(ω)dθObj(ω),
(3)

where f (Y|θ, ω) is the sampling distribution of the whole data set, f (YCk|θk) is the cluster specific sampling distribution, and π(·) denotes a prior distribution. In the analysis of corneal wound data, f (YCk|θk) is set to be the normal probability density function, of which the mean is the cluster-specific penalized regression spline. Booth et al. (2008) proposes obtaining the optimal partition ω* that maximizes the marginal posterior probability π(ω|Y). Because the normalizing constant of π(ω|Y) is difficult to calculate, Obj(ω) is used as the actual objective function in optimizing π(ω|Y). In Sect. 2.1, f (YCk|θk) is specified in detail with the penalized regression spline. In Sect. 2.2, π(θk|ω) and π(ω) are specified. Finally, in Sect. 2.3, the Bayesian objective function is calculated for our clustering model.

2.1 Penalized regression spline for profiles within a cluster: f (YCkk)

In our clustering model, all profiles in cluster k are assumed to have a common smooth underlying trend and independent and identically distributed normal errors with a common variance. Detailed explanation of modeling profiles within a cluster is given as follows. Denote the gene expressions by

Y=(Y1T,,YiT,,YnT)T,

where Yi=(Yi1T,,YijT,,YirT),YijT=(Yij1,,Yijt,,Yijp), and Yijt is the expression of gene i in the jth replication at time point t. Similarly, define the time variables xi=(xi1T,,xijT,,xirT),xijT=(xij1,,xijt,,xijp), and xijt = t, and error terms εi=(εi1T,,εijT,,εirT),εijT=(εij1,,εijt,,εijp) and εijt~N(0,σk2). To explain a temporal profile in cluster k, we use the penalized regression spline:

Yij=β0k+β1kxij++βqkxijq+l=1Lulk(xijτl)+q+εij=Xβk+ZUk+εij,
(4)

where X=(1,xij,,xijq),βk=(β0k,β1k,,βqk)T,Z=(zij1,,zijl,,zijL),zijl=(xijτl)+2 with knots τl’s, m+ = max(0, m), Uk = (u1k,…, uLk)T, 0p is the column vector with p zeros, Ip is the p × p identity matrix, and εij~MVN(0p,σk2Ip). If profiles consist of a short time course data, such as 3 time points, the regression spline is not recommended. Then, the simple linear or quadratic regression had better be used instead of the regression spline. The time variable xij is the same for every gene i and replicate j in the corneal wound data. For notational simplicity, assume that Ck = {1,…, nk}. Also, let Y𝒞k=(Y1T,,YnkT)T XCk = 1nkr [multiply sign in circle] X and ZCk = 1nkr [multiply sign in circle] Z, where 1nkr is a column vector of ones with nkr elements. Then, within cluster k, the penalized regression spline method estimates parameters by minimizing

YCkXCkβkZCkUk2+1λ2Uk2.
(5)

To implement this penalization in the Bayesian framework, two approaches have been widely used. First, the Bayesian Lasso approach of Tibshirani (1996) uses double exponential priors for Uk, which makes finding the posterior distribution mode equivalent to minimization of (5). Second, Ruppert et al. (2003) suggests using the mixed model, of which BLUP (best linear unbiased prediction) is equivalent to the estimates of the penalized regression spline. We use the second approach in this paper by changing a fixed effect Uk in (4) to a cluster-specific random effect Uk~N(0,λ2σk2IL) where IL is the L × L identity matrix.

For data analysis and simulation studies in this paper, we employ a flexible quadratic regression spline by setting q = 2 and L = p − 2 (one knot at each interior time point). Because the cluster memberships are unknown, it is difficult to get a good prior information on the mean temporal trend of profiles in each cluster. Therefore, the clustering model should contain a flexible spline function with a large number of knots so that it can explain any temporal trend. Also, to prevent a unnecessary wiggly fit, we constrained the influence of knots with the penalty function (Ruppert et al. 2003).

Let Jr=1r1rT. Then, the linear mixed model for cluster k can be expressed as:

YCk=XCkβk+ZCkUk+εCk=XCkβk+εCk
(6)

where ε𝒞k=(ε1T,,εnkT)T~N(0,σk2Inkrp),ε𝒞k~N(0,Σ𝒞k),Σ𝒞k=σk2Ω𝒞k and Ω𝒞k=InkIrIp+JnkJrλ2Z𝒞kZ𝒞kT.

Then, within cluster k, the likelihood function is

f(YCk|βk,σk2)=f(YCk,Uk|βk,σk2)dUk=1(2π)nkpr/21|Ck|1/2exp{12(YCkXCkβk)TCk1(YCkXCkβk)}.

2.2 Priors: π(ω) and π (β, σ2|ω)

As for the partition parameter ω, we use Crowley’s prior (Crowley 1997):

π(ω)ϱc(ω)k=1c(ω)(nk1)!,
(7)

where nk is the number of genes in cluster k and [var rho](> 0) is the tuning parameter for the size of clusters. Large values of [var rho] makes the prior give high probabilities to partitions with a large number of clusters.

For β=(βk)k=1c(ω) and σ2(σk2)k=1c(ω), we use a non-informative prior,

π(β,σ2|ω)=k=1c(ω)π(βk,σk2|ω)k=1c(ω)(1/σk2)α+1.
(8)

2.3 Bayesian objective function: Obj(ω)

The marginal posterior distribution of ω is

π(ω|Y)f(Y|β,σ2,ω)π(β,σ2|ω)π(ω)dβdσ2=π(ω)k=1c(ω)f(YCk|βk,σk2)π(βk,σk2|ω)dβdσ2=π(ω)k=1c(ω)f(YCk|βk,σk2)π(βk,σk2|ω)dβkdσk2.

Also,

f(YCk|βk,σk2)π(βk,σk2|ω)dβkdσk2=1(2π)nkpr/21|ck|1/2exp{12(YCkXCkβk)TCk1(YCkXCkβk)}×(1/σk2)α+1dβkdσk2=2απ(q+1)/2Γ(νk/2)Skνk/2|ΩCk|1/2|XCkTΩCk1XCk|1/2,

where νk=nkpq1+2α,Sk=(Y𝒞kX𝒞kβ^k)TΩ𝒞k1(Y𝒞kX𝒞kβ^k), and β^k=(XCkTΩ𝒞k1X𝒞k)1(X𝒞kTΩ𝒞k1Y𝒞k). Thus,

π(ω|Y)ϱc(ω)k=1c(ω)(nk1)!2απ(q+1)/2Γ(νk/2)Skνk/2|ΩCk|1/2|XCkTΩCk1XCk|1/2Obj(ω)
(9)

Note that the marginal posterior distribution (9) varies with the scale of Y. For example, when the scale of Y changes to aY

π(ω|aY)π(ω|Y)ac(ω)((q+1)2α).

Therefore, to make (9) invariant to the scale of Y, we set α = (q + 1)/2.

3 Stochastic search for the optimal partition

Because ω is not a continuous variable, many popular numeric search algorithms, i.e. the Newton–Rapson method, may not be applied. Also, the space of ω is very large, because the Bell number (Bn, the number of all possible partitions) grows rapidly with n (super-exponentially). For example, when n is 10, the Bell number is approximately 105. When n is 75, the Bell number becomes approximately 1078, which is a rough estimate for the number of atoms in the observable universe. The corneal wound data has n = 646 genes, which, in a practical sense, have an infinite number of possible partitions. Therefore, searching for the optimal partition ω* is a challenging problem.

To search for the optimal partition, Booth et al. (2008) proposed to generate random partitions from the posterior function π(ω|Y), which is proportional to our objective function Obj(ω), and select a partition that has the highest Obj(ω). This algorithm tends to generate many partitions that are close to the mode of the posterior distribution, which makes the search algorithm efficient. This is so-called Markov Chain Monte Carlo (MCMC) optimization (Jerrum and Sinclair 1996). If the number of objects is not large or only a small number of partitions have high posterior probability π(ω|Y), then the optimal partition can be easily found. Otherwise, the algorithm may work slowly or may find only a suboptimal partition within a reasonable time. However, according to our experience, these suboptimal partitions are close enough to the optimal partition in practical applications.

Here is a detailed description on the MCMC optimization algorithm (Booth et al. 2008). Within the optimization algorithm, a biased random walk (Metropolis-Hastings algorithm) generates the targeting Markov chain of random partitions from π(ω|Y). Suppose that the biased random walk is iterated R times. Let ωi be the partition in the ith iteration, ci) be the number of clusters in ωi, ω˜i* be the partition with the highest value of the objective function during the first i iterations, and ω˜*(=ω˜R*) be the best partition that is found by an application of the optimization algorithm. Then, consider the following algorithm:

Algorithm 1

  • Step 1. Choose an initial partition ω1 and set ω˜1*=ω1 and i = 1.
  • Step 2. Generate a candidate partition ω′:
    • – If ci)=1, select one uniform random object out of n and make it as a new singleton cluster (one profile in one cluster). This makes two clusters with 1 and n − 1 objects.
    • – If ci) ≥ 2, select one uniform random object.
    • – If the selected object is a singleton cluster, move it to one of ci) − 1 clusters with probability 1/(ci) − 1).
    • – Otherwise, move it to one of other clusters with probability 1/ci) or make its own singleton cluster with probability 1/ci).
  • Step 3. Accept ω′ with probability min(1, π (ωi) /π(ω)). If accepted, ωi+1 = ω′. Otherwise, ωi+1 = ωi.
  • Step 4. If Obj(ωi+1)>Obj(ω˜i*),ω˜i+1*=ωi+1. Otherwise, ω˜i+1*=ω˜i*. Set i = i + 1 and repeat Steps 2, 3 and 4.

Note that, if R is not large enough, [omega with tilde]* may not be the same as the optimal partition ω*. In this paper, Algorithm 1 is applied three times for each data set of interest. If all chains provide the same [omega with tilde]*, then [omega with tilde]* is considered as ω*.

4 Simulation studies on stochastic search algorithm

Simulation studies are conducted with the corneal wound data, to check the convergence-in-distribution of the biased random walk in Algorithm 1 and to examine the needed number of iterations before Algorithm 1 finds the optimal partition.

4.1 Convergence of the biased random walk

To examine the convergence or performance of the biased random walk algorithm, it is necessary to start chains with fair and non-informative initial partitions. As one of the most reasonable initial partitions, we may consider a uniform random partition that is drawn randomly with a probability of 1/Bn. However, the generation of a uniform random partition is a challenging problem when the partition space is very large. In this subsection, we describe an algorithm to generate uniform random partitions with a large n, and demonstrate the convergence of chains using the corneal wound data.

Generation of uniform random partitions with a large n

Let Pn be the set of all possible partitions of Nn := {1, …, n}. Also, let Π denote a uniform random partition such that P(Π = π) = 1/Bn for all π [set membership] Pn, so Π has the uniform distribution on Pn. Here we describe a method of simulating a random uniform partition Π.

Let M be a random variable on the set {1, 2,…, n} with probabilities given by

P(M=m)=mnBn[1m!s=0nm(1)ss!]
(10)

for m = 1, 2,…, n. Pitman (1997) gave an algorithm for drawing a value of Π, which goes as follows. First, draw a value of M = m from (10), then randomly distribute n balls with labels Nn into the m different urns. Note that some of the urns may end up empty. After excluding empty urn(s), the resulting partition has the uniform distribution on Pn.

Unfortunately, this method does not work well with large n because computation of the distribution (10) requires the evaluation of large factorials, which results in numerical difficulties. However, it is possible to circumvent this problem by approximating (10) to a high degree of accuracy and drawing M from this approximation. Based on this this idea, we propose a new algorithm as follows (See the Appendix for detailed calculation of this algorithm).

Algorithm 2
  • Step 1. Choose ε > 0, which controls the degree of approximation. Here we use ε = 10−30
  • Step 2. Calculate lm = n log m − log Γ(m + 1) for m [set membership] {1, 2,…, n} and set
    l*=max{l1,l2,,ln}.
  • Step 3. Find
    N=inf{m>n:l*lm>log(2n+1n!/ε)}.
  • Step 4. Draw M* with probabilities given by
    P(M*=m)=exp{lm}j=1Nexp{lj}=exp{lml*}j=1Nexp{ljl*}
    (11)
    for m [set membership] {1, 2, …N}.

Uniform random partitions are generated with n = 646 as in the corneal wound data. Then, the number of clusters in each partition is counted to draw the histogram in Fig. 2. Most uniform random partitions have from 120 to 143 clusters with mean 131.3 and variance 4.62. The gray line in Fig. 2 shows the distribution of the number of bins, m in (11), which has mean 132.3 and variance 4.72.

Fig. 2
Histogram of uniform random partitions. Distribution of m in (11) is drawn with a gray line

Convergence-in-distribution with the corneal wound data

When multiple chains are simulated with a good MCMC algorithm, they get mixed well regardless of the initial partitions. Using the biased random walk, five chains are started from four non-informative partitions (two uniform random partitions, one partition with n = 646 singleton clusters and one partition that has only one cluster for all objects) and one informative partition (with 200 clusters found using the K-means algorithm).

The Bayesian objective function (9) has two parameters, λ and [var rho], that should be set before Algorithm 1 starts. To estimate λ, which is the smoothing parameter or the ratio of two variances in the linear mixed model (6), we clustered profiles into an arbitrary number of groups, K = 40, with K-means algorithm, and fitted the linear mixed model to the grouped data assuming a homogeneous error variance across clusters. From this procedure, we got an estimate [lambda with circumflex] = 1.31. Even though [lambda with circumflex] depends on K, [lambda with circumflex] is robust to K as long as K is not close to 1 or n. For example, [lambda with circumflex] = 1.27 when K = 20, and [lambda with circumflex] = 1.33 when K = 200. Also, the resulting optimal partition is usually not sensitive to a small variation of [lambda with circumflex], such as ±0.2. For the tuning parameter in Crowley’s prior, log([var rho]) = 8 is chosen. Detailed discussion about the selection of log([var rho]) is in Sect. 6.

Simulation histories of the five chains are shown with the number of clusters on Fig. 3a, and with the objective function Obj(ω) on Fig. 3b. Because most uniform random partitions have a similar number of clusters, around 131, and similar values of Obj(ω1) ≈ −2,300, two initial partitions are overlapped on Fig. 3(a, b). Note that uniform random partitions have very low initial values of the objective function, which indicates that they are very non-informative partitions. On the contrary, the initial partition from K-means algorithm has the highest objective function value, Obj(ω1) = 4,597. Even when five chains have very different initial partitions, Fig. 3 shows that the biased random walk results in excellent agreement at convergence, and good mixing over the stationary distribution after 106 iterations. Therefore, we consider that 2 × 106 iterations are enough to be a burn-in period in analyzing the corneal wound data.

Fig. 3
Convergence of chains with different initial partitions. a, b demonstrates the convergence with the number of clusters and the objective function Obj(ω). Simulation history in the last 2 × 106 iterations is magnified in an insert of each ...

The convergence-in-distribution does not guarantee that Algorithm 1 can find the optimal partition ω* within a reasonable time. For example, in Table 1, the algorithm (R = 3 × 106) is applied to the corneal wound data twice with log([var rho]) = 0, 3, 5, 8, 10, 13, 15 and 20. Regardless of the value of log([var rho]), two chains provide different [omega with tilde]*’s, which have different Obj ([omega with tilde]*)’s and different numbers of clusters. However, two Obj ([omega with tilde]*) are very close to each other, considering that this search is conducted in almost infinite space of a discrete parameter. More details of Table 1 will be discussed in Sect. 6.

Table 1
Convergence of algorithm with different log([var rho])’s and stable genes in the last 106 iterations

4.2 Number of iterations before the optimal partition is found

To examine the relationship between the number (n) of gene profiles to be clustered and the number (R*) of iterations before the optimal partition ω* is found by Algorithm 1, data sets are simulated by randomly selecting n gene profiles from the corneal wound data without replacements. Then, three long chains with 5 × 107 iterations are simulated for each data set using the biased random walk. If these three chains have the same [omega with tilde]*, we consider that the optimal partition ω* = [omega with tilde]* is found for a simulated data set. The average of three R*’s, Avg(R*), is reported in Table 2. For example, for the first simulated data set with randomly selected n =20 profiles, the same [omega with tilde]* is found by three chains after 6,220 iterations on average. In total, 5 data sets are simulated with n = 20. The other four data sets have Avg(R*) = 1,637, 663, 4,439 and 934. All Avg(R*)’s are less than 104 when n = 20. When n increases to 50, Avg(R*)’s have magnitude 106 or 107 with the first four simulated data sets. With the fifth simulated data set, the same optimal partition is not found by three chains, which indicates that more than 5 × 107 iterations are needed. When n = 100, the optimal partition is not found with any data set within 5×107 iterations. It seems that Avg(R*) increases super exponentially as the Bell number does (Bn=100 ≈ 10115). Our C program spent about 8 h to iterate 5 × 107 times for each data set with n = 100. Therefore, it is practically impossible to find the optimal partition of 646 gene profiles within a reasonable time. There can be two major reasons that make the optimization algorithm slow. First, the discrete space of partitions is too large with n ≥ 100. Second, there seems to be many suboptimal partitions, of which Obj(ω)’s are close to Obj(ω*). This may happen when many weakly-related genes are included in the data set and cause loose clusters. Also, weakly-related genes have a tendency to change cluster memberships easily during MCMC iterations, because they don’t make significant contributions to the objective function. In the next section, we propose a tight clustering algorithm that selects a small number of closely-related genes and applies Algorithm 1 to only these genes.

Table 2
Average number of iterations before the optimal partition is found (Avg(R*))

5 Tight clustering algorithm

Recall that, during the biased random walk of Algorithm 1, a candidate partition ω′ is generated by moving one object (gene) in the current partition ωi. If a gene is closely-related with other genes in the current cluster and build a tight cluster, then this gene have low chances to move during iterations. We will consider a gene to be stable, when a gene doesn’t move at all or change only a small number of times over MCMC iterations of Algorithm 1.

Let the relevance probability (RP) of a pair of genes be the probability of an event that two genes in a pair belong to one cluster together in a random partition. In the Bayesian objective function approach framework, RP can be estimated easily by counting how often two genes are together in a chain of the biased random walk. Because closely-related genes tend to stay together within a cluster over the course of MCMC iterations, they have high RP’s. Our definition of relevance probability is slightly different from the original concept by Hartigan (1991) where the relevance probability is the probability having a set S of genes (objects) as a cluster in a random partition. It is inappropriate to use Hartigan’s definition of RP for finding closely-related genes because there can be too many possible clusters, S. For example, when Hartigan’s definition of RP is used to find closely-related genes in corneal wound data, RPs of 2646 − 2 ≈ 2.9 × 10194 (the number of all possible non-empty subsets) possible clusters must be calculated for a simulation chain. When our definition of RP is used, RPs of only (6462)=208,335 pairs must be calculated.

Here, we propose the following tight clustering algorithm:

Algorithm 3

  • Step 1. Select log([var rho]*) that makes a small number of genes stable in a chain of the biased random walk.
  • Step 2. Apply Algorithm 1 with log([var rho]*) and estimate RP’s of all possible pairs.
  • Step 3. Apply Algorithm 1 with only closely-related genes (RP≥ η) to construct tight clusters.

In Step 1, we suggest selecting a value of log ([var rho]*) that makes a small number of genes stable because this makes a small number of genes have high RP’s in Step 2. However, an analytic optimization for log ([var rho]*) is very difficult. Therefore, we suggest running Algorithm 1 with different log([var rho]) values to calculate the number of stable genes. The cutoff value η of RP in Step 3 can be considered as a tuning parameter that determines the tightness of clusters. If η is close to 1, Algorithm 3 will select a small number of genes that construct very tight clusters. If η is close to 0, Algorithm 3 will select most genes without constructing tight clusters. Based on our experience with real data and simulation studies, we suggest setting η = 0.80 for reasonably tight clusters.

To check performance of the tight clustering algorithm (Algorithm 3), we conducted simulation studies with the following six true clusters:

C1:yi=sin(xi/2)×log(xi)2+ε1iC2:yi=1+exp(xi/5)×cos((xi2)/2)2+ε2iC3:yi=2xi+exp(xi/5)×cos(xi/2)2+ε3iC4:yi=1+1.5×sin(xi/2)+ε4iC5:yi=exp(xi/5)+ε5iC6:yi=3xi+cos(xi/2)2+ε6i,

where ε1i, ε2i, ε3i, ε4i and ε5i have independent and identical normal distributions with mean 0 and variance σA2, ε6i has the independent and identical normal distribution with mean 0 and variance σB2, and xi = 1, 2,…, 6. Four sets of simulations are considered with different σA and σB. In each set, we simulated 60 profiles (10 profiles from each cluster) 100 times. Then, our tight clustering and plain Bayesian objective function approaches (Booth et al. 2008) with various tuning parameter log([var rho])’s are compared in Table 3. In Step 1 of the tight clustering algorithm, we chose log([var rho]*) for each simulated data set by examining the number of stable genes when log([var rho]) = −10, −8, −6, −4, −2 and 0. In general, log([var rho]) is a very important parameter in the clustering algorithm, controlling the number of clusters. However, when two values of log([var rho]) differed by less than 2 in our simulation experience, the same or very similar clusters were found. In this paper, we will consider a pair of profiles to be correctly clustered if the pair is in the same cluster of the simulation model and is grouped together by the clustering algorithm or if two profiles in the pair are in different clusters of the simulation model and are not grouped together. Otherwise, we will consider the pair to be incorrectly clustered. There are Npairs = n(n − 1)/2 possible pairs, where n is the number of profiles to be clustered. In the plain clustering, n = 60 and Npairs = 1,770 for every simulated data set. However, in the tight clustering, n and Npairs varies with simulated data because only closely-related profiles are interested for clustering and the number of closely-related profiles varies with data. We define that the percentage classification error rates of pairs is:

ER=100×iNpairsIiNpairs,

where i = 1, …, Npairs and

Ii={1,ifith pair is incorrectly clustered;0,otherwise.
Table 3
Comparisons of tight and plain Bayes clustering algorithm based on ER: the standard deviation of ER is denoted inside parentheses

See Table 3. For the first set of simulations, we set σA = σB = 0.5. These are relatively small standard deviations that make clustering algorithms find correct clusters easily. Tight clustering method has the lowest error rate of 6.0% with a standard error of 0.8%. Among plain clusterings, the lowest error rate, 8.8%, is gained when log([var rho]) = −10. Recall that a higher [var rho] makes the algorithm choose a large number of small clusters. When log([var rho]) = 5, the algorithm provides the same optimal partition of n singleton clusters for all 100 simulated data sets. A simulated data set has 270 (= 45 × 6) pairs that have two profiles from the same true cluster. Therefore, when all profiles are separated as singleton clusters, the error rate becomes 15.3%(= 100 × 270/1,770). When we increase the standard deviations to σA = σB = 0.7, the tight clustering algorithm achieves the lowest error rate again, 13.7%. When the standard deviations are as large as σA = σB = 1.0, it is practically impossible to get good clustering results because most parts of the true clusters overlap each other. Therefore, the error rates are high for both tight and plain clustering methods. In this case, the plain clustering method with log([var rho]) = 5 works the best, only because it separates all profiles as singletons and gains an error rate of 15.3%. However, this is not a desirable result because it does not provide any non-singleton cluster. As the last set of simulations, we considered σA = 0.7 and σB = 3.0, which makes cluster C6 relatively diffused. The tight clustering method worked the best in this case. Compared to simulations with σA = σB = 0.7, it seems clustering was easier because C6 with σB = 3.0 is not similar to other clusters with σA = 0.7.

Overall, we found that the tight clustering method has lower error rates compared to plain clustering when true clusters are reasonably separated. This result can be easily expected because tight clustering uses only closely-related profiles (genes) and excludes weakly-related profiles that may cause erroneous clusters.

6 Analysis of the corneal wound data

In the corneal wound experiment, gene expressions are measured at day = 0, 1, 2, 3, 4, 5, 6, 7, 14, 21, 42 and 98 because expressions are expected to change more intensely in the first week and then get stabilized in the later part of experiment. In other words, a less smooth pattern is expected in the first week. When spline knots are located with equal interval on the real time scale (days), this change of smoothness will not be reflected. In this paper, we suggest relabeling time points with 1,…,12 before the cluster analysis. This may not be the optimal solution, but an easy and reasonable solution to the unequal smoothness problem. Here, this argument is illustrated with an example of a gene profile in our data. Using the same smoothing parameter (λ = 1.3) and knots at all interior time points, the gene profile is fitted on real scale in Fig. 4a and on relabeled equal interval scale in Fig. 4c. Also, to make comparison of these fits easier, Fig. 4a is redrawn with equal interval time scale in Fig. 4(b, c) is redrawn with real time scale in Fig. 4d. When Fig. 4(a, d) (or, equivalently, Fig. 4b, c) is compared, the difference between two spline fits are negligible from 1st (day=0) to 7th time points (day=6). However, the later part of the profile (i.e. from 7th to 12th time points) is overfitted in Fig. 4a, while it is fitted with a proper smooth line in Fig. 4d. Similar patterns are found with other gene profiles. If analyst wants to use the real time scale, this overfitting problem may be fixed by carefully choosing the number and locations of knots.

Fig. 4
Comparisons of spline fits of a randomly selected gene profile when the real time scale or the relabeled equal interval time scale is used: (a) gene profile is fitted on the real time scale (days), (b) plot a is redrawn with the relabeled time scale, ...

Now on, we will provide a detailed discussion about application of the tight clustering algorithm to the corneal wound data.

Step 1

Simulation studies are conducted with log([var rho]) =0,3,5,8,10,13,15 and 20. For consistency of comparisons, the same initial partition with 200 clusters from the K-means algorithm is used for every simulation. Algorithm 1 is applied twice with each log([var rho]) and the movements of genes are monitored in the last 106 iterations out of 3 × 106 total iterations. Table 1 describes Obj ([omega with tilde]*), c ([omega with tilde]*), and the number of stable genes, which do not move at all during the last 106 iterations. For example, from the first run with log([var rho]) = 0, we got Obj ([omega with tilde]*) = 5,280, c ([omega with tilde]*) = 27 and 58 stable genes out of 646. When log([var rho]) gets larger, the number of clusters c ([omega with tilde]*) increases because a larger value of log([var rho]) makes the objective function support a larger number of clusters, and hence smaller numbers of genes per cluster. In Fig. 5, the number of stable genes is plotted with log([var rho]). It shows that a relatively small number of genes are stable (do not switch clusters in the last 106 iterations) when log([var rho]) is between 3 and 13. When log([var rho]) is as large as 20, most genes have a strong tendency to stay as singleton clusters during the iterations. This makes many genes appear stable. When log([var rho]) is as small as 0, even weakly-related genes may stay together easily within large clusters. This also causes many genes to appear stable. Figure 5 shows that the number of stable genes is the minimum around log([var rho]) = 8. Therefore, log([var rho]*) = 8 is selected for the corneal wound data.

Fig. 5
The effect of log([var rho]) in the Crowley prior on the number of stable genes. A quadratic regression fit is provided as a reference line

Step 2 and Step 3

After applying Algorithm 1 to the whole data with 2 × 107 iterations, the 1,169 pairs are found to have RP≥ 0.80. Then, Algorithm 1 is applied 3 times to only the 139 closely-related genes that compose these 1,169 pairs. The same [omega with tilde]* is found easily in all three applications with less than 106 iterations. Recall that, when a data set of 100 profiles is simulated in Sect. 4.2, the optimal partition ω* is not found even after 5 × 107 iterations. The convergence is fast with closely-related genes because only a small number of genes are considered in tight clustering and close relationships among selected genes make Obj(ω*) much higher than any Obj(ω), where ω ≠ ω*. See Fig. 6 for the final result of our analysis. Each cluster has a distinctive pattern and gene profiles build tight clusters. Cluster 3 and 9 seem similar, but scales distinguish them.

Fig. 6
Clustered gene profiles of closely related genes. The number of genes in each cluster is described inside parentheses. The light-colored thick lines are from the BLUP of the mixed model (6)

7 Concluding remark

A typical purpose of cluster analysis with microarray data is to identify a small number of closely-related genes that biologists can study further in future studies. Also, weakly-related genes make the clustering algorithm work slowly and build large and loose clusters. Therefore, we propose to select closely-related genes using relevance probabilities and get tight clusters only with closely-related genes.

In our tight clustering, the stochastic search algorithm (Algorithm 1) is used three times for different purposes. In the first two applications with all gene profiles, the stochastic algorithm is used for the selection of log([var rho]*) and the calculation of relevance probabilities of all pairs, rather than for the search of the optimal partition. Because the stochastic algorithm is implemented by MCMC simulation, the calculation of relevance probabilities could be done easily. Instead of the Bayesian objective function approach, if the mixture model is used in tight clustering, it will be difficult to calculate relevance probabilities because the mixture model requires prior knowledge on a fixed number of components, K. Even though K can be determined with a model selection criterion (i.e. BIC), it is difficult to measure uncertainties in determining K and take account of it when calculating relevance probabilities.

In addition to closely-related genes, if a small number of weakly-related genes are interesting for biological reasons, then they can also be included in Step 3 of tight clustering algorithm. We expect that inclusion of these weakly-related genes will not make significant impact on the optimal partition of tight clusters, because closely-related genes build a stable structure of clusters.

In this paper, we clustered expressions of 646 genes that are preselected by biologists. However, in general, most microarray experiments generate 10,000–40,000 gene expressions at once. When the number of genes are such high, computational time can be considered as a limitation of our approaches. For example, if genes are measured twice at 12 time points as in corneal wound data set, it takes about a week with 3.0Hz PC for our tight clustering method to handle 1,000 or less genes (or objects). Therefore, we recommend preselecting gene profiles that vary largely over time, using the one-way ANOVA model with time as the covariate. For example, we may preselect genes that have the 1,000 highest F-values. A similar approach was also proposed in Peddada et al. (2003).

Appendix

It turns out that the Pitman’s algorithm (Pitman 1997) still works if we take M to be the random variable on the set N:= {1, 2, 3,…} with probabilities given by

P(M=m)=mne1Bnm!

for m = 1, 2, 3,…. We now describe a method for sampling from (an arbitrarily good approximation of) M without having to calculate Bn. Specifically, given ε > 0, we find a positive integer N = N(ε) such that P(MN + 1) < ε and we approximate M with a random variable M* with probabilities given by P(M* = m) [proportional, variant] mn/m! for m [set membership] {1, 2,…, N}. If we choose ε small enough, there will be no discernible difference between draws from M and draws from M*.

Fix N > 1 and note that

j=N+1P(M=j)=j=1P(M=N+j),

and

P(M=N+j)P(M=N)=(N+j)n(N+j)!N!Nn=(1+j/N)nN!(N+j)!=(1+j/N)n1(N+j)(N+j1)(N+1)(1+j)n1Nj(2j)nNj.

Therefore,

P(M=N+j)(2j)nNjP(M=N),

so

j=N+1P(M=j)=j=1P(M=N+j)2nP(M=N)j=1jnNj2nP(M=N)20xnexlogNdx=2n+1P(M=N)n!(logN)n+12n+1n!P(M=N).

Thus, if we can find N such that P (M = N) < ε/(2n+1n!), then

P(MN+1)=j=N+1P(M=j)<ε.

It’s easy to show that P(M = m) is decreasing for m > n. Define lm = n log m − log Γ(m + 1) so that log P(M = m) = lmc, where c = 1 + log Bn. Now define l* = max{l1, l2,…, ln} and note that, for any m [set membership] N,

P(M=m)=exp{lm}j=1exp{lj}exp{lm}exp{l*}=exp{lml*}.

Therefore, if l* − lm > log(2n+1 n!/ε), we have

P(M=m)exp{lml*}ε/(2n+1n!).

Define

N=inf{m>n:l*lm>log(2n+1n!/ε)}.

We then calculate the probabilities of M* as

P(M*=m)=exp{lm}j=1Nexp{lj}=exp{lml*}j=1Nexp{ljl*}

for m [set membership] {1, 2,…, N}.

Contributor Information

Yongsung Joo, Department of Statistics, Dongguk University, Seoul 100-715, Korea, ude.kuggnod@oojgnusgnoy.

G. Casella, Department of Statistics, University of Florida, Gainesville, FL 32611, USA.

J. Hobert, Department of Statistics, University of Florida, Gainesville, FL 32611, USA.

References

  • Basford KE, McLachlan GJ. Likelihood estimation with normal mixture models. Appl Stat. 1985;34:282–289.
  • Basford KE, Greenway DR, McLachlan GJ, Peel D. Standard errors of fitted means under normal mixture models. Comput Stat. 1997;12:1–17.
  • Booth J, Casella G, Hobert J. Clustering using objective functions and stochastic search. J R Stat Soc B. 2008;70(1):119–140.
  • Costa IG, Carvalho FAT, Souto MCP. Comparative analysis of clustering methods for gene expression time course data. Genet Mol Biol. 2004;27:623–631.
  • Crowley EM. Product partition models for normal means. J Am Stat Assoc. 1997;92:192–198.
  • Datta S, Datta S. Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics. 2003;19:459–466. [PubMed]
  • Ghosh D, Chinnaiyan AM. Mixture modelling of gene expression data from microarray experiments. Bioinformatics. 2002;18:275–286. [PubMed]
  • Hakamada K, Okamoto M, Hanai T. Novel technique for preprocessing high dimensional time-course data from DNA microarray: mathematical model-based clustering. Bioinformatics. 2006;22:843–848. [PubMed]
  • Hartigan JA. Partition models. Commun Stat Theory Methods. 1991;19:2745–2756.
  • Hartigan JA, Wong MA. Algorithm AS 136: a K-means clustering algorithm. Appl Stat. 1979;28:100–108.
  • James GM, Sugar CA. Clustering for sparsely sampled functional data. J Am Stat Assoc. 2003;98:397–408.
  • Jerrum M, Sinclair A. Approximation algorithms for NP-hard problems. Boston: PWS Publishing; 1996. The Markov Chain Monte Carlo method: an approach to approximate counting and integration.
  • Johnson RA, Wichern DW. Applied multivariate statistical analysis. 5th edn. Prentice Hall, Upper Saddle River: 2002.
  • Leng X, Muller H. Classification using functional data analysis for temporal gene expression data. Bioinformatics. 2006;22:68–76. [PubMed]
  • Luan Y, Li H. Clustering of time-course gene expression data using a mixed-effects model with B-splines. Bioinformatics. 2003;19:474–482. [PubMed]
  • Lukashin AV, Fuchs R. Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters. Bioinformatics. 2001;17:405–414. [PubMed]
  • Ma P, Castillo-Davis CI, Zhong W, Liu JS. A data-driven clustering method for time course gene expression data. Nucleic Acids Res. 2006;34:1261–1269. [PMC free article] [PubMed]
  • McLachlan GJ, Baford KE. Mixture models: inference and applications to clustering. New York: Marcel Dekker, Inc; 1988.
  • McLachlan GJ, Bean RW, Peel D. A mixture model-based approach to the clustering of microarray expression data. Bioinformatics. 2002;18:413–422. [PubMed]
  • Ng SK, McLachlan GJ, Wang K, Jones LB, Ng SW. A mixture model with random-effects components for clustering correlated gene-expression profiles. Bioinformatics. 2006;22:1745–1752. [PubMed]
  • Ouyang M, Welsh WJ, Georgopoulos P. Gaussian mixture clustering and imputation of microarray data. Bioinformatics. 2004;20:917–923. [PubMed]
  • Park T, Yi S, Lee S, Lee SY, Yoo D, Ahn J, Lee Y. Statistical tests for identifying differentially expressed genes in time-course microarray experiments. Bioinformatics. 2003;19:694–703. [PubMed]
  • Peddada SD, Lobenhofer EK, Li L, Afshari CA, Weinberg CR, Umbach DM. Gene selection and clustering for time-course and dose response microarray experiments using order-restricted inference. Bioinformatics. 2003;19:834–841. [PubMed]
  • Pitman J. Some probabilistic aspects of set partitions. Am Math Mon. 1997;104:201–209.
  • Ruppert D, Wand MP, Caroll RJ. Semiparametric regression. New York: Cambridge University Press; 2003.
  • Schliep A, Schonhuth A, Steinhoff C. Using hidden Markov models to analyze gene expression time course data. Bioinformatics. 2003;19:i255–i263. [PubMed]
  • Thalamuthu A, Mukhopadhyay I, Zheng X, Tseng GC. Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics. 2006;22:2405–2412. [PubMed]
  • Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc B. 1996;58:267–288.
  • Tseng GC, Wong WH. Tight clustering: a resampling-based approach for identifying stable and tight patterns in data. Biometrics. 2005;61:10–16. [PubMed]
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles