- Journal List
- Cancer Inform
- v.7; 2009
- PMC2730180

# The Bimodality Index: A Criterion for Discovering and Ranking Bimodal Signatures from Cancer Gene Expression Profiling Data

^{1}Department of Bioinformatics and Computational Biology, The University of Texas M.D. Anderson Cancer Center, 1515 Holcombe Blvd., Houston, TX 77030-4009

^{2}Department of Biostatistics, The University of Texas M.D. Anderson Cancer Center, 1515 Holcombe Blvd., Houston, TX 77030-4009

^{3}Department of Pathology, The University of Texas M.D. Anderson Cancer Center, 1515 Holcombe Blvd., Houston, TX 77030-4009

^{4}Department of Breast Medical Oncology, The University of Texas M.D. Anderson Cancer Center, 1515 Holcombe Blvd., Houston, TX 77030-4009. Email: gro.nosrednadm@sebmoock or ; Email: gro.nosrednadm@gnawgnij

## Abstract

### Motivation

Identifying genes with bimodal expression patterns from large-scale expression profiling data is an important analytical task. Model-based clustering is popular for this purpose. That technique commonly uses the Bayesian information criterion (BIC) for model selection. In practice, however, BIC appears to be overly sensitive and may lead to the identification of bimodally expressed genes that are unreliable or not clinically useful. We propose using a novel criterion, the bimodality index, not only to identify but also to rank meaningful and reliable bimodal patterns. The bimodality index can be computed using either a mixture model-based algorithm or Markov chain Monte Carlo techniques.

### Results

We carried out simulation studies and applied the method to real data from a cancer gene expression profiling study. Our findings suggest that BIC behaves like a lax cutoff based on the bimodality index, and that the bimodality index provides an objective measure to identify and rank meaningful and reliable bimodal patterns from large-scale gene expression datasets. R code to compute the bimodality index is included in the ClassDiscovery package of the Object-Oriented Microarray and Proteomic Analysis (OOMPA) suite available at the web site http;//bioinformatics.mdanderson.org/Software/OOMPA.

## Introduction

Identifying genes with bimodal expression patterns from large-scale expression profiling data is an important task. Bimodal expression patterns can result naturally from differential expression, with the two modes centered on the mean expression of a gene in two distinct subgroups of samples. In the context of cancer, bimodal expression patterns can result from genomic lesions that occur in some patients but not others. For example, Tomlins and colleagues^{1} noticed that the ETV1 gene was overexpressed in 10%–20% of prostate cancer samples in multiple data sets and that the ERG gene was overexpressed in about 40% of prostate cancer samples in the same datasets. They further noticed that overexpression of these two genes was mutually exclusive. These findings led them to discover that the overexpression was driven by recurrent translocations that fused the androgen-responsive gene TMPRSS2 either with ETV1 (in some cases) or with ERG (in other cases).

The definition of a bimodal distribution can be vague: the term typically refers to a mixture of two populations with distinct means. In density estimation, a bimodal distribution can be recognized by the presence of two modes, each with a characteristic peak. Determining the factor that characterizes the samples that belong to each of the two distributions can be difficult. In the realm of large-scale gene expression profiling, finding bimodal expression patterns is an important analytical process. In oncology, this process may be part of the search for clinically important therapeutic targets within tumors. The process can also reveal molecular signatures that distinguish tumor subtypes, which contributes to our understanding of the clinical and biologic characteristics of cancer.

A two-component normal mixture-model-based clustering algorithm is commonly used to discover bimodal expression patterns. In contrast to other clustering methods, this approach is based on a mixture of statistical distributions in which each component represents a cluster. The method converts a classification problem into a statistical estimation of the mixture density. The approach has proven to be useful in a wide range of applications, including microarray gene expression analysis.^{2}^{–}^{6} The technique is sensitive and allows for rapid computation. A major benefit of the mixture model-based clustering technique is that the algorithm characterizes each cluster and provides probabilities of cluster membership. In addition, the algorithm provides point estimates of the statistical parameters including the means, standard deviations, and sample proportions in each group.

One difficulty with its application is finding an appropriate test statistic to estimate p-values and choosing a suitable cutoff to minimize the false discovery rate. The log likelihood ratio test (LRT) statistic could be adopted to test the hypothesis H_{1} that gene expression distribution is a bimodal, against the null hypothesis H_{0} that the distribution is unimodal. Under the null hypothesis, the LRT asymptotically has a chi-squared distribution (−2 log λ ~ χ^{2}). However, in practice, the chi-squared distribution, with two degrees of freedom, seems to converge slowly. Consequently, the critical value from chi-square table is too small, so it over-rejects, or inflates the false positive rate.^{7} Recently Ertel and Tozeren applied a two-component normal mixture model to identify bimodal genes and their potential roles in cell signaling and disease progression.^{8} They used LRT, with p-values estimated by evaluating the chi-square distribution with six degree of freedom, in order to get more conservative p-values. To identify significant bimodal genes, an ad-hoc p-value was selected.

Researchers combine mixture-model-based clustering with either the Akaike information criterion (AIC) or the Bayesian information criterion (BIC). Both criteria impose arbitrary penalties based on the number of parameters to determine whether a unimodal or bimodal model is a better fit to the observed data. A dataset is identified as bimodal if the parameter penalties are outweighed by the increased likelihood of the bimodal model. Although useful, according to our experience, the reliance on AIC or BIC appears to be overly sensitive in its application to gene expression profiling data and may lead to the identification of numerous genes whose bimodal patterns cannot be confirmed in follow-up studies. In order to be clinically useful in practice, a bimodal pattern should exhibit significant separation between the means of the two groups and should have adequate sample sizes in each group. In applications of the mixture-model-based clustering technique, researchers frequently find it necessary to use subjective ad hoc cutoffs in addition to the AIC or BIC in order to reveal reliable and meaningful bimodal patterns.

For identifying bimodal expressed genes, a practical method was introduced previously by Andrew and colleagues,^{9} called Profile Analysis using Clustering and Kurtosis (PACK). PACK has two steps; (1) using the expectation-maximization (EM) algorithm, a common method for finding maximum likelihood estimates of parameters in probabilistic models, and BIC for model selection to determine the number of clusters within the expression data. This step is the same as the model-based-clustering technique. (2) Using kurtosis to characterize features and find relevant classifiers. Depending on the sign of kurtosis, major bimodal pattern (negative kurtosis) or outlier bimodal pattern (positive kurtosis) can be identified. PACK has been applied to number of cancer gene expression datasets in breast successfully.^{9}^{,}^{10} The use of kurtosis for identifying bimodally expressed genes, however, has a blind spot. If the proportions of samples in the two groups are close to a 20%–80% split, then the kurtosis is close to zero. Negative kurtosis strongly favors a perfect 50–50 split, which we think has a chance to miss biologically interesting genes. Positive kurtosis, on the other hand, shows a strong preference for extremely unbalanced splits on the order of 95–5 or even 99–1.

In this investigation, we propose an approach that replaces (or supplements) the use of AIC or BIC with a criterion that provides for finer distinctions between the bimodally expressed genes. By assigning a continuous value (the bimodality index) to each gene instead of a simple yes-or-no answer to the question of bimodality, we can provide researchers with a tool to rank genes and thus focus their interest on those with the strongest evidence of “useful” bimodality. We evaluate the estimation of the bimodality index of a set of measurements using both mixture model-based clustering and Markov chain Monte Carlo (MCMC) techniques to estimate the statistical parameters (means, standard deviation, and sample proportions) of the mixture. MCMC sampling techniques are common in the field of Bayesian analysis, and have emerged as popular tools for the analysis of complex statistical problems, including the analysis of microarray gene expression profiling data.^{11}^{,}^{12} We apply the proposed algorithm to simulated gene expression data and to a dataset from a microarray gene expression profiling study of breast cancer.

## The Algorithm in Brief

We assume that, for a gene with bimodal expression, the distribution can be expressed as a mixture of two normal distributions:

where *y* is the expression measurement; *π* is the proportion of samples in one group; *μ*_{1} and *μ*_{2} are the means of the expression level of the two modes; and *σ*_{1} and *σ*_{2} are the standard deviations. In analyzing gene expression data, equal variance between groups is frequently assumed. If we assume that *σ*_{1} = *σ*_{2}, the equation becomes

where *σ* is the common standard deviation. We define the standardized distance, *δ*, between the two populations as

For identifying genes with bimodal expression, the null hypothesis is *δ* = 0 and the alternative hypothesis is *δ* > 0.

To illustrate how the shape of a bimodal density changes as *π* and *δ* vary (with *σ* = 1), we plotted a set of theoretical distributions. Figure 1a arranges the density plots in the *δ – π* plane. Because of symmetry in *π* from 0.0 to 1.0, we only illustrate the plots using *π* from 0.50–0.95. The plots indicate that bimodality is visually obvious when *δ* is sufficiently large or when the proportion *π* in each group is adequate. When *π* and *δ* reach certain critical values, bimodality is no longer visually distinguishable. The plots also suggest that the distinguishable bimodal patterns, as indicated by the plots in red in Figure 1a, are approximately bounded by a curve in the *δ – π* plane. Although this observation about distinguishability is subjective, it can be made objective by reference to a standard sample size computation.

*π*and

*δ*.

**(A)**Density plots of bimodal data as

*π*and

*δ*vary. These plots indicate that bimodality is obvious when both

*π*and

*δ*are sufficiently large, but difficult to

**...**

### Defining the bimodality index

Consider an experiment that involves two normally distributed populations with means *μ*_{1} and *μ*_{2} and common standard deviation *σ*. The usual sample size computation tells us that the formula

where *z*_{α/}_{2} and *z** _{β}* are the percentiles of the standard normal distribution that yield the desired significance and power, gives the total number

*N*of samples needed to detect the standardized difference

*δ*in an experiment with equal-sized groups. To achieve the same power with unequal sizes

*Mπ*and

*M*(1 –

*π*), we should choose

*M*to make the variance of the estimated standardized difference with unequal group sizes the same as the variance with equal sizes; that is,

So, the total number of samples required when the groups are unequally sized is:

Selecting reasonable values for *α* and *β* for microarray experiments has been addressed by Simon et al,^{13} so we will not provide a detailed description here. Rearranging equation (6), we obtain

Motivated by equation (7), we define the bimodality index *BI* as

In practice, we can estimate *δ* and *π* for a given dataset, then use these estimated values to compute *BI*.

Combinations of *δ* and *π* that give the same value of *BI* describe bimodal distributions that are “equally separable” in the sense that experiments to distinguish the two subgroups at a given significance and power would require the same total number of samples. The right hand side of equation (7) shows that larger values of *BI* correspond to smaller sample sizes and thus represent bimodal distributions that are easier to distinguish. Constant *BI* values in equation (8) define curves in the *δ – π* space (Fig. 1b). Because of symmetry in *π* (0.0–1.0), we only show the plot for *π* from 0.5–1.0. The curves with constant bimodality index take on their minimum value at *π* = 0.5 (when the sizes of the two subgroups are the same), which gives the most power to distinguish a bimodal pattern for a given total sample size. When the group sizes are very unequal, for example, when *π* is close to 0.1 or 0.9, the power is much weaker for the same total sample size. Because the curve defined by *BI* = 1.1 perfectly separates the red “visually bimodal” distribution curves from the black distribution curves in Figure 1a, we recommend this cutoff to select bimodally expressed genes. Other cutoffs can be chosen by using equations (7) and (8) to compute the sample size that would be needed to validate this degree of bimodality in an independent data set.

The key issue that remains, then, is how to accurately estimate the parameters *δ* and *π* (and thus *BI*) from a given dataset. One approach is to use the expectation-maximization (EM) method for maximum-likelihood estimation in parameterized normal mixture models. We can then combine the EM method for mixture estimation with BIC for identifying bimodal distributions. We can use the R package MCLUST^{©} (University of Washington, Seattle, Washington), which follows this approach. We refer to this approach as a normal mixture-model-based clustering algorithm. Alternatively, we can use MCMC sampling techniques for the estimation in a parameterized normal mixture model. We applied both algorithms to estimate the statistical parameters of the measurements. Then, based on the estimated parameters obtained from both algorithms, we computed *BI* in order to identify and rank more reliable bimodal patterns of expression. The difference between the two algorithms is their output: the mixture model-based clustering technique provides point estimates, whereas the MCMC technique provides distributions of the estimated parameters. Knowing the distributions of the estimated parameters allows us to estimate the posterior probability of the classification of each sample.

## Simulation Studies

To evaluate the performance of the bimodality index for identifying genes with meaningful bimodal expression, we first performed simulation studies. We adopted the R package MCLUST^{©} (University of Washington, Seattle, Washington), to perform mixture-model-based clustering and to obtain the estimated statistical parameters. From these estimated values, we computed the bimodality index *BI* for each measurement, and then selected bimodal measurements by setting a cutoff on the *BI* value. For comparison, we applied a hybrid MCMC technique to the same simulated dataset to estimate the measurement parameters and compute *BI*. To compute *BI*, we used point-estimated parameters from MCLUST and the posterior mean estimates from MCMC.

### Unimodally distributed measurements

To obtain unimodal distributions, we generated expression datasets at four sample sizes: *n* = 50, 100, 200, and 300. Each simulated dataset consisted of 500 randomly generated samples from a unimodal normal distribution, *N*(*μ*= 5, *σ* = 1). Such a study is used to evaluate the rate of false positives or to evaluate the specificity of the proposed method.

The results of our null simulation are presented in Table 1. When the sample size was sufficiently large (≥200), our proposed method using estimated parameters from either MCLUST or MCMC performed equally well, with almost no false discoveries. Similarly, the rate of false positives was low when using BIC. When the sample size was relatively small (*n* < 100), the MCMC algorithm provided slightly better results. All three methods performed with high specificity in the simulated null datasets, which indicates very low rates of false positives.

### Bimodally distributed measurements

Next, we evaluated the proposed method on simulated measurements that were truly bimodally distributed. Here the simulations are more elaborate, since *δ* and *π* will affect the “strength” or “reliability” of the bimodal expression. To fully evaluate the performance of the proposed method, we simulated expression datasets with different parameter settings.

#### Various *δ* with sufficient *π* in each group (*π* = 0.3–0.7)

To evaluate how the proposed method detected bimodal measurements for different values of *δ*, we simulated datasets as *δ* varied from 2 to 5. Here, *δ* ≥ 4 indicates a strong bimodal pattern; *δ* = 3 indicates a weak bimodal pattern; and *δ* = 2 corresponds to a very weak bimodal pattern. We assumed equal variance in each group. For simplicity, we set *σ*_{1} = *σ*_{2} = 1, in which case *δ* is equivalent to Δ*μ*. We again used four sample sizes, *n* = 50, 100, 200, and 300. For this simulation, we let *π* range from 0.3 to 0.7 by steps of 0.1. For each *π*, we simulated 100 bimodal measurements. Therefore, we had 500 bimodal variables associated with different values of *π* for each sample size. We applied both MCLUST and MCMC to estimate the statistical parameters *μ*_{1}, *μ*_{2}, *σ*, *π*, and *δ*. Based on these estimated parameters, we computed *BI* for each measurement.

The results from this simulation study indicate that for a large *δ* (*δ* ≥ 4), the percentage of identified bimodal measurements using MCLUST was close to the percentage obtained using MCMC. Both techniques demonstrated high accuracy (Table 2a). When *δ* = 5, the MCLUST algorithm facilitated the identification of the bimodal measurements with no false negatives (Fig. 2). This was also true when BIC was used. Even with a small sample size (*n* = 50) we were able to identify nearly all of the samples as bimodal. As expected, when the separation of the means between the two groups was large and the sample population in each group was sufficient (30% ≤ *π* ≤ 70%), any of the three methods allowed us to easily identify the bimodal measurements. For *δ* = 3, the identification of bimodal measurements was the same using any of the three methods. When the sample size was small (*n* ≤ 50), all three methods identified about 50% of the bimodal measurements. As the sample size increased, all three methods performed equally well and identified nearly 100% of the bimodal measurements.

*δ*= 5 and

*π*= 0.3–0.7. The dataset contains 50 samples and 500 measurements. The black spots represent individual

**...**

*π*= 0.3 to 0.7, in 0.1 intervals).

When *δ* = 2, which indicates a very weak bimodal pattern and is slightly below our defined detection limit (bimodality index, *BI* = 1.1), we expect that most of the measurements will be identified as unimodal. Our study showed that only a few measurements were identified as bimodal under this setting (Fig. 3 and Table 2a). The results suggest that the MCMC method is slightly less sensitive than the MCLUST algorithm, but that the difference is almost negligible. Table 1 also shows that the BIC is more sensitive, especially with a larger sample size. This is not surprising. Because *δ* = 2, these simulations do arise from bimodal distributions. In effect, the BIC behaves like a cutoff on the bimodality index at a much smaller value than our limit of *BI* = 1.1 and thus will identify more samples as bimodal.

#### Various *δ* with small proportions in one group (*π* = 0.1, 0.2, 0.25, or 0.8, 0.9)

We then focused on the issue of highly unbalanced group sizes (summarized in Table 2b). The findings indicate that the bimodality index performed similarly when using either MCLUST or MCMC in settings with a strong bimodal pattern, i.e. when *δ* was large (*δ* ≥ 4) and the sample size was sufficient (*n* ≥ 100). Using MCMC, the performance was consistently more conservative. The BIC method was associated with a higher percentage of identified bimodal measurements, regardless of the sample size. The study also indicated that, when *π* was less than 0.1 or greater than 0.9 in one group, some bimodal measurements were detected as unimodal (Fig. 4). In other words, when *π* ≤ 0.1 or *π* ≥ 0.9 and the sample size is small (*n* ≤ 100), even when *δ* is sufficiently large, the false-negative rate will be high.

*δ*= 4 and

*π*= 0.1, 0.2, 0.25, 0.8 and 0.9. The simulated dataset contains 50 samples and 500 measurements. The set of quadratic curves

**...**

## Application to Cancer Microarray Gene Expression Measurements

Microarray gene expression measurements are more complicated than simulated data. To evaluate the usefulness of our proposed method for identifying genes with bimodal expression, we applied the algorithm to a dataset of microarray gene expression profiles from a study of breast cancer.

### Briefs of expression profiling and data processing

The gene expression profiling array dataset was produced by the Breast Cancer Pharmacogenomic Program at The University of Texas M. D. Anderson Cancer Center using Affemetrix U133A GeneChip. The dataset contains 133 human breast cancer samples, with each array containing 22,283 probe sets. (The expression profiles and clinical information are available at http://bioinformatics.mdanderson.org/pubdata.html.) The original purpose of the investigation was to develop multi-gene predictors of pathologic complete response (pCR) to preoperative therapy.^{14} The clinical variables associated with the dataset include disease stage, histologic grade, and routine clinical markers such as the estrogen receptor (ER), the human epidermal growth factor receptor 2 (HER-2), and the progesterone receptor (PR) status. Response to preoperative chemotherapy was also available.

For gene expression measurements, the signal intensities at the level of the probe sets were quantified by dChip 1.3 (http://dchip.org) using the PM-model only. Normalization was performed by dChip using the array with median brightness. The normalized expression measurement was logarithm transformed (base 2) for analysis. Because the expression data were produced on different dates, we considered the possibility of a “batch effect,” which occurs as a result of differences in the hybridization environment on different days. Although common in microarray gene expression profiling investigations, the batch effect necessitates correction of the data. In order to obtain meaningful results, we performed a gene-by-gene adjustment of the means to put the batches created on two different dates on the same scale. Briefly, we adjusted the means on each gene from one batch to match the means of the same genes in the other batch.

### Identifying genes with bimodal expression

Once data processing was complete, we applied the proposed method with the MCLUST and MCMC algorithms to identify the genes with bimodal expression. In both approaches, we estimated *μ*_{1}, *μ*_{2}, *σ*, *π*, and *δ* from the expression measurement of each gene across all samples, and computed *BI* for each gene. Figure 5 shows the genes from the breast cancer dataset identified as having bimodal expression.

*π*represent the population sizes;

*δ*is the difference between the means divided by standard deviation

**...**

Microarray gene expression profiling data contain significant levels of technical and biologic variation. Many measurements are just “noise,” containing no useful information. In order to identify genes with interesting expression patterns, it is common to apply a filtering criterion to eliminate some “noise.” The selection of a filtering condition involves some subjectivity. In this investigation, we filtered at several different levels. Specifically, we first computed the expression levels at several percentiles (0.25, 0.30, 0.35, 0.40, 0.45, and 0.50) using all expression measurements across all samples. We then computed the maximum expression value of each gene across all samples. If the maximum was less than the expression level at a selected filtering percentile, we considered that expression measurement to be “noise” and eliminated it from the analysis. For example, the overall 25th percentile of expression was 6.771. If a gene had a maximum expression of less than 6.771, we removed that gene from the analysis. Once the “noise” measurements were filtered out, we performed the analysis for identifying bimodal measurements using the MCLUST and MCMC techniques at each of the defined filtering percentiles and at each of the *BI* values. The results are summarized in Table 3.

The results show that using our method combined with MCLUST identified more genes with bimodal expression than using our method combined with MCMC at the same filtering conditions and the same bimodality index cutoff. This suggests again that MCLUST is more sensitive than MCMC. However, at every level of the bimodality index and at every filtration level based on a more stringent definition of noise, a larger percentage of the genes identified by MCMC remain above the noise, as compared to MCLUST. We also compared the genes with bimodal expression that were identified from our method combined with the MCMC versus the MCLUST algorithms. The results from that comparison show extensive overlap in the genes identified with bimodal expression by both algorithms (see Table 4). This suggests that using our method with either technique yields similar results when estimating the statistical parameters and computing the bimodality index for detecting genes with bimodal expression.

The results also show that vastly more genes (often by an order of magnitude) were identified as bimodal using BIC, but a smaller percentage of these genes remain above the filtration noise cutoffs. Because the bimodality index for these genes is small, this finding strongly suggests that the majority of the genes identified by BIC either have small separations between the modes or have highly imbalanced proportions of samples in the two groups. Figure 6 illustrates that 70% of genes called bimodal by BIC that have *BI* < 1.1 also have *π* < 10%, and 97% of these genes have *π* < 20%. Thus, most of these genes are driven by a relatively small number of samples in one of the two groups.

### Example of genes identified as having bimodal expression in the human breast cancer data

With known clinical information from the dataset, we checked three probesets related to three breast cancer genes, ERS1, PGR and HER2, which we expect to be expressed bimodally in our dataset. Previous investigation suggested that the three selected probesets correspond to the three genes strongly.^{15}^{–}^{17} In addition to the three known genes in breast cancer, we also present here two bimodal genes (CKB and BST2) discovered by our method.

The first example is the estrogen receptor (ERS1) with probe set ID 205225_at. From a previous analysis, we know that this probe set correlates highly with the clinical estrogen receptor status.^{16} Figure 7 illustrates the density plot for the ERS1 gene (top left). The estimated parameters are provided in Table 5. The point estimate is *BI* = 1.955 and the posterior probability of being bimodal is > 99.9% (based on *BI* = 1.1). This value of *BI* was the 16th largest among the 22,283 probe sets on the array. The clinical information associated with this experiment indicates that there are 51 patient samples with negative ER status and 82 patient samples with positive ER status in this dataset. The estimated proportions are close to the clinical ER status obtained from both the MCMC and MCLUST techniques. In comparison, MCLUST provided slightly better results.

**Left panel**: histograms of three identified bimodally expressed genes from breast cancer dataset; estrogen receptor (top), progesterone receptor (middle), and HER-2 (bottom).

**Right panel**: posterior distributions of bimodality index on the three genes,

**...**

The second example of a gene with bimodal expression involves the progesterone receptor (PGR), for which the probe set ID is 208305_at. PGR is an intercellular steroid receptor that specifically binds progesterone and is located at 11q22. The point estimate is *BI* = 1.733, and the posterior probability of it being bimodal is 100%. The density plot of PGR is illustrated in Figure 7 (middle left). The estimated measurement parameters are provided in Table 4. The clinical information indicates that the dataset contains information on 55 patient samples with positive PGR status, 75 patient samples with negative PGR status, and 3 patient samples with unknown PGR status. The estimated proportions in each group are close to those of the true sub-population (see Table 5).

The third example of bimodal gene expression involves the human epidermal growth factor receptor 2 (ERBB2 or HER2), for which the probe set ID is 216836_s_at.^{15}^{,}^{17} HER2 is important for its role in the pathogenesis of breast cancer and as a current target of treatment. The density plot of HER2 is illustrated in Figure 7 (bottom left). The estimated measurement parameters are provided in Table 5. The point estimate is *BI* = 1.634 and the posterior probability of HER2 being bimodal is >99.9% (at *BI* = 1.1). The clinical information indicates that the dataset includes information on 33 patient samples with positive HER2 tumor status, 99 patient samples with negative HER2 status, and one patient sample with unknown HER2 status. The estimated proportions in each group approximate those of the true sub-population. The results from these three examples strongly indicate that the proposed method works accurately with the MCMC and MCLUST algorithms. Importantly, all three of these bimodally expressed genes represent therapeutic targets for anti-estrogen and anti-HER2 therapies that are currently used in the clinic. We hope that the other bimodally expressed genes contain similarly important but novel drug targets.

In addition to the three “standard” breast cancer therapeutic target genes, we present two bimodal genes identified by our method, but previously unreported. The two genes are (1) Creatine kinase, brain (CKB; with probe set ID 200884_at), and (2) Bone marrow stromal cell antigen 2 (BST2; with probe set ID 201641_at). Both genes exhibit strong bimodal patters, with bimodality index 1.619 for CKB and 1.602 for BST2 (Fig. 8). Their roles in breast cancer are poorly understood and further investigation will be carried out.

### Identifying genes with strong bimodal expression

We also applied more stringent conditions to identify genes with bimodal expression. Using *BI* ≥ 1.5 as a cutoff and without filtering, we identified 181 and 213 genes with bimodal expression using the MCMC and MCLUST algorithms, respectively. There were 151 genes with bimodal expression that were identified using both algorithms. To test the hypothesis that bimodally expressed genes could be used as outcome or disease phenotype markers, we performed two-way hierarchical cluster analysis using these 151 genes (Fig. 9). The results of the analysis suggest that the genes with the strongest bimodal expression are closely related to two clinical types of breast cancers: ER-positive cancers and ER-negative cancers. As these two neoplastic diseases of the breast may originate from different cell types (luminal and basal epithelial cells, respectively) this association is not surprising and supports the validity of our method.^{18} We also correlated the clusters with HER2 status and with patient response to treatment, recorded as either pCR or residual disease (RD) (see Fig. 10). As noted, the major split in the dendrogram correlates with hormone receptor status. The second split in the dendrogram correlates with HER-2 status. These associations have already been noted in the clinical literature.^{17} However, correlations between these routine clinical markers and the other genes on our list will warrant further biological exploration.

**...**

## Discussion and Conclusions

In this paper, we have introduced a new approach to identifying and ranking meaningful and reliable bimodal measurements from large-scale gene expression measurements. The key to this approach is to define a bimodality index that provides an objective measure of the “strength” of the bimodal separation. Comparing with AIC or BIC, the major difference is that our approach is not just for selecting bimodal genes, but also ranks the genes. The bimodality index can be evaluated empirically, and its interpretation is justified by standard sample size and power calculations. To apply this method to large-scale gene expression data, we need to estimate the model parameters *μ*_{1}, *μ*_{2}, *σ*, and *π* in order to compute the bimodality index *BI*. Suitable techniques for performing such an analysis include a mixture-model-based clustering technique and the MCMC technique.

When applying a mixture-model-based clustering technique, the number of components (clusters) needs to be predefined. For bimodal measurements, a two-component mixture model is applied. The model fits the data with a mixture of two normal distributions, and provides point estimations of the measurement parameters *μ*_{1}, *μ*_{2}, *σ*, and *π*. The estimated parameters are then used to compute the bimodality index. The major advantage of the mixture-model-based clustering technique is computational efficiency. For large-scale gene expression data, that process can be completed in a very short time with high accuracy in the estimated parameters.

The advantage of the MCMC technique is that it provides distributions of the parameters *μ*_{1}, *μ*_{2}, *π*, and *σ*. Applying the MCMC technique with our method, we can use either the posterior mean or the full posterior distribution of the bimodality index. Because the process involves estimating the distributions, the MCMC technique takes longer and is more computationally intensive. This increased time can be especially problematic for large-scale gene expression data, which makes MCMC less popular in the analysis of high-throughput genomic data. That problem, however, can be overcome by using a parallel processor system, which involves breaking the expression dataset into several small subsets and performing the analysis in parallel for each subset. In this way, the MCMC algorithm can be carried out in a relatively short time. Although a number of MCMC algorithms are available, we only applied one MCMC technique in this investigation. We chose the hybrid MCMC technique because it is not difficult to implement and was readily available. (We did not intend to evaluate which of the MCMC algorithms performed better).

One potential objection to our method is that it assumes both components are normally distributed with the same variance. Both model-based clustering via the EM algorithm and MCMC techniques can be extended to use mixtures of t-distributions^{19} or mixtures of other distributions. As long as the distributions being used have a “central parameter” that plays the role of the mean for the normal distribution, the definition of the bimodality index proposed here carries over directly to these more general mixture distributions. One can also accommodate different standard deviations by defining the standardized distance between means to be *δ* = *|μ*_{1}/*σ*_{1} *– μ*_{2}/*σ*_{2}*|*The performances of these kinds of extensions to the method deserve further study.

We evaluated the sensitivity and specificity of the proposed method in identifying meaningful bimodal measurements through simulation studies. We compared the results of computing the bimodality index by two different algorithms, MCLUST and MCMC. We also compared the results obtained from our method with those obtained from the commonly used BIC. Due to the similarity in application between AIC and BIC, we did not include AIC in the comparison. In the null distribution, the results were quite similar; all methods had few false positives. In addition to the null distribution, we evaluated bimodal patterns with various parameter settings. Those simulations indicate that, although BIC is more sensitive for detecting bimodal distributions, using the bimodality index provides fine-grained control over the discoveries, allowing the researcher to rank the genes by their degree of bimodality and thus focus attention on the genes whose bimodal patterns are most believable and most likely to be clinically useful.

We further applied our proposed method to a dataset from a breast cancer gene expression profiling study. Unlike simulated data, actual gene expression data is far more complex and contains a high level of noise; therefore we expected a much higher rate of false discovery in that analysis, particularly when the size of one of the groups was small. When we applied the mixture-model-based clustering technique with BIC, over 35% of the genes were identified as bimodal. We tried to remove some noise prior to the analysis by setting various filtering conditions. After filtering, we still identified a substantial number of bimodal measurements when using BIC. We then applied our method to the filtered datasets. In contrast, our method produced much more reasonable results, which suggests that it is a useful approach for the analysis of real datasets. Moreover, the filtering process helped to remove some unreliable measurements for small values of *BI* (*BI* ≤ 1.3). For a strong bimodal pattern (*BI* ≥ 1.4), the results indicate that the bimodal measurements identified in this process were almost unchanged, regardless of whether or not the filtering conditions were applied. Further, the genes identified as most strongly bimodal appeared to be related to the hormone receptor (ERS1 and PGR) status and HER2 status of the breast cancer patients. Since these characteristics are known to be strong determinants of both gene expression and response to specific treatments in breast cancer, this finding provides evidence that using our method has the potential to focus on biologically important subsets of gene expression profiling data.

In addition to finding the expected genes, we described two additional genes (CKB and BST2) that exhibit strong bimodal expression patterns within breast cancer samples. Creatinine kinase, brain (CKB) is a cytoplasmic enzyme involved in energy homeostasis. It acts as a homodimer (CK-BB) in brain as well as in other tissues, and as a heterodimer with a similar muscle isozyme (CKM) in heart. Rubrey et al reported that 34% of breast cancer patients showed elevated serum levels of the CK-BB homodimer, and that an increased incidence of elevated serum CK-BB levels was associated with advanced stage disease.^{20} Using a more sensitive assay, Zarghami et al measured CK-BB levels in breast tumor cytosols and found that CK-BB was associated with more aggressive tumors but concluded that its value as a prognostic indicator was limited.^{21}

Bone marrow stromal cell antigen 2 (BST2 or CD317) was originally recognized as a surface antigen on bone marrow stromal cell lines; however, it is predominantly expressed in liver, lung, heart, and placenta, and not typically expressed in normal breast. Becker et al using Affymetrix HuGeneFL and Hu95Av2 microarray experiments, identified BST2 as significantly up-regulated in a tamoxifen-resistant cell line derived from the mammary tumor cell line MaCa 3366.^{22} Recently, Cai et al reported that BST2 is up-regulated in breast cancer with bone metastasis, and concluded that BST2 may be a potential biomarker in breast cancer with bone metastasis.^{23}

Our method performed reasonably well in revealing meaningful bimodal patterns of gene expression in comparison with the commonly-used BIC approach. The results from the analyses suggest that our proposed method is a sensible approach for the analysis of large-scale gene expression data, and can be extended for broad application.

## Acknowledgements

*Funding*: This research was supported in part by the National Cancer Institute [P50 CA116199]; and the United States Department of Defense [W81XWH-07-1-0306 and W81XWH-04-1-0142].

## Footnotes

**Author’s Contributions**

**KRC and JW** derived background method, designed the investigation, performed the analyses, and prepared the manuscript. **LP and WFS** participated in conducting breast cancer pharmacogenomic clinical trial and collected all biopsies for the study. **SJW** technical programming support and performed analyses.

**Disclosure**

The authors report no conflicts of interest.

## References

*t*distribution. Statistics and Computing. 2000;10:335–44.

**Libertas Academica**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (3.4M) |
- Citation

- SIBER: systematic identification of bimodally expressed genes using RNAseq data.[Bioinformatics. 2013]
*Tong P, Chen Y, Su X, Coombes KR.**Bioinformatics. 2013 Mar 1; 29(5):605-13. Epub 2013 Jan 9.* - Comparison of scores for bimodality of gene expression distributions and genome-wide evaluation of the prognostic relevance of high-scoring genes.[BMC Bioinformatics. 2010]
*Hellwig B, Hengstler JG, Schmidt M, Gehrmann MC, Schormann W, Rahnenführer J.**BMC Bioinformatics. 2010 May 25; 11:276. Epub 2010 May 25.* - Bimodal gene expression and biomarker discovery.[Cancer Inform. 2010]
*Ertel A.**Cancer Inform. 2010 Feb 4; 9:11-4. Epub 2010 Feb 4.* - Bioinformatics analysis of microarray data.[Methods Mol Biol. 2009]
*Zhang Y, Szustakowski J, Schinke M.**Methods Mol Biol. 2009; 573:259-84.* - Functional genomics and proteomics in the clinical neurosciences: data mining and bioinformatics.[Prog Brain Res. 2006]
*Phan JH, Quo CF, Wang MD.**Prog Brain Res. 2006; 158:83-108.*

- DNA methylation profiling of medulloblastoma allows robust sub-classification and improved outcome prediction using formalin-fixed biopsies[Acta neuropathologica. 2013]
*Schwalbe EC, Williamson D, Lindsey JC, Hamilton D, Ryan SL, Megahed H, Garami M, Hauser P, Dembowska-Baginska B, Perek D, Northcott PA, Taylor MD, Taylor RE, Ellison DW, Bailey S, Clifford SC.**Acta neuropathologica. 2013 Mar; 125(3)359-371* - Systems Approaches to Predict the Functions of Glycoside Hydrolases during the Life Cycle of Aspergillus niger Using Developmental Mutants ∆brlA and ∆flbA[PLoS ONE. ]
*van Munster JM, Nitsche BM, Akeroyd M, Dijkhuizen L, van der Maarel MJ, Ram AF.**PLoS ONE. 10(1)e0116269* - A Peculiar Mutation Spectrum Emerging from Young Peruvian Patients with Hepatocellular Carcinoma[PLoS ONE. ]
*Marchio A, Bertani S, Rojas Rojas T, Doimi F, Terris B, Deharo E, Dejean A, Ruiz E, Pineau P.**PLoS ONE. 9(12)e114912* - Recurrent expression signatures of cytokines and chemokines are present and are independently prognostic in acute myelogenous leukemia and myelodysplasia[Blood. 2010]
*Kornblau SM, McCue D, Singh N, Chen W, Estrov Z, Coombes KR.**Blood. 2010 Nov 18; 116(20)4251-4261* - The Italian primary school-size distribution and the city-size: a complex nexus[Scientific Reports. ]
*Belmonte A, Di Clemente R, Buldyrev SV.**Scientific Reports. 45301*

- The Bimodality Index: A Criterion for Discovering and Ranking Bimodal Signatures...The Bimodality Index: A Criterion for Discovering and Ranking Bimodal Signatures from Cancer Gene Expression Profiling DataCancer Informatics. 2009; 7()199

Your browsing activity is empty.

Activity recording is turned off.

See more...