# QUBIC: a qualitative biclustering algorithm for analyses of gene expression data

^{1,}

^{2}Qin Ma,

^{1,}

^{2}Haibao Tang,

^{3}Andrew H. Paterson,

^{3}and Ying Xu

^{1,}

^{4,}

^{*}

^{1}Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA,

^{2}School of Mathematics, Shandong University, Jinan 250100, China,

^{3}Department of Plant Biology, University of Georgia, USA and

^{4}College of Computer Science and Technology, Jilin University, Changchun, China

## Abstract

Biclustering extends the traditional clustering techniques by attempting to find (all) subgroups of genes with similar expression patterns under to-be-identified subsets of experimental conditions when applied to gene expression data. Still the real power of this clustering strategy is yet to be fully realized due to the lack of effective and efficient algorithms for reliably solving the general biclustering problem. We report a QUalitative BIClustering algorithm (QUBIC) that can solve the biclustering problem in a more general form, compared to existing algorithms, through employing a combination of qualitative (or semi-quantitative) measures of gene expression data and a combinatorial optimization technique. One key unique feature of the QUBIC algorithm is that it can identify all statistically significant biclusters including biclusters with the so-called ‘scaling patterns’, a problem considered to be rather challenging; another key unique feature is that the algorithm solves such general biclustering problems very efficiently, capable of solving biclustering problems with tens of thousands of genes under up to thousands of conditions in a few minutes of the CPU time on a desktop computer. We have demonstrated a considerably improved biclustering performance by our algorithm compared to the existing algorithms on various benchmark sets and data sets of our own. QUBIC was written in ANSI C and tested using GCC (version 4.1.2) on Linux. Its source code is available at: http://csbl.bmb.uga.edu/∼maqin/bicluster. A server version of QUBIC is also available upon request.

## INTRODUCTION

DNA microarrays provide a powerful means for probing the functional states of a cell population by allowing simultaneous observation of mRNA expression patterns of all their genes collected over time and/or under different experimental conditions. By comparing the gene expression patterns under different conditions such as cancerous versus healthy tissues, one can possibly derive information about genes associated with a particular cellular condition (e.g. cancerous cells at a specific developmental stage) or even specific biochemical pathways. To analyze the complex microarray data, numerous computational tools have been developed. Among them, clustering of genes based on the similarities of their expression patterns (co-expressed genes) using (traditional) clustering strategies (1–3) represents one of the most popular approaches to microarray data analyses.

The traditional clustering techniques attempt to, in the context of microarray data analyses, partition a set of genes into ‘clusters’ with similar expression patterns under specified conditions (3), or identify such clusters from an otherwise unstructured microarray data set (4). While useful, such clustering algorithms are known to be inadequate for handling the general gene-expression analyses problems, that often need to identify co-expressed genes under some (to-be-identified) conditions in contrast to finding co-expressed genes under all given conditions. The difficulty in handling the general problem of identifying co-expressed genes is that for any m given conditions, there are 2^{m} combinations of conditions to consider, making this general clustering problem much more difficult to solve.

A popular way to visualize microarray data for gene expression analyses is to represent the data set as a matrix with rows representing the genes and columns representing the conditions (or the other way around) with each element of the matrix representing the relative mRNA abundance of a gene under a specific condition. So identifying groups of genes in a microarray data set that share similar expression patterns under to-be-identified conditions is equivalent to finding submatrices with similar properties. Partitioning a matrix into submatrices with approximately the same values was first studied by Morgan and Sonquist (5) and Hartigan (6). In 2000, Getz *et al.* (7) presented a coupled two-way clustering approach that employs hierarchical clustering to each separate dimension, and then combines the clustering results along each dimension in a somewhat problem-specific manner. It is Cheng and Church (8) who firstly introduced the concept of ‘direct clustering’, originally proposed by Hartigan (6), to the field of gene expression data analyses, and referred it as ‘biclustering’, that is to find subsets of conditions under which some (to be identified) subsets of genes have similar expression patterns. Each such submatrix is called a ‘bicluster’.

Cheng and Church (8) proposed a quantitative measure, ‘mean squared residue’, essentially a variability measure, as a guide to search for biclusters in a gene expression data set, which has been adopted by numerous biclustering algorithms (9–11). Recent studies suggest that this measure is useful only for identifying certain classes of co-expressed genes, but not adequate to detect other transcriptionally co-regulated genes (12–14). Another measure was proposed lately by Aguilar-Ruiz (15) to deal with co-regulated genes with ‘scaling patterns’, which, while more general than the previous measure, was found to be rather challenging to solve algorithmically. Various algorithms have been developed, attempting to solve the biclustering problem as defined either by Cheng and Church (8) or by Aguilar-Ruiz (15) or variations, including the work by Kung *et al.* (16), Li *et al.* (17), Reiss *et al.* (11), Pedro *et al.* (18) and Bryan *et al.* (9,10,19), to name a few, which has led to a number of publicly available computer servers for biclustering analysis of microarray data. Among the published biclustering servers, some have employed combinatorial optimization techniques, such as SAMBA (14), ISA (20), Bimax (13) and NNN (21). A common issue with most of the combinatorial techniques is their high computational complexity, even for the highly simplified cases like using a 0/1 matrix to represent down/up regulations in the observed microarray data.

The state of the art is that the existing biclustering algorithms are generally effective in identifying genes of similar expression values under to-be-identified conditions, but not effective in identifying gene clusters with similar expression patterns in general. Here we report a new biclustering algorithm QUBIC that can effectively and efficiently identify all statistically significant biclusters (allowing overlaps) that cannot be identified by the existing biclustering algorithms and beyond, including both definitions for a biclustering problem given by (8) and (15), as well as finding both positively and negatively correlated expression patterns. We have demonstrated the effectiveness of the QUBIC program and its computational efficiency on a number of benchmark data sets, by comparing it with several salient programs.

## METHODS

In our biclustering scheme, we represent the expression values in a qualitative or semi-quantitative manner so that we get a new matrix representation of a gene expression data set under multiple conditions, called a representing matrix, in which the expression level of a gene under each condition is represented as an integer value (see ‘Qualitative representation of gene expression data’ section for details). We consider that two genes have correlated expression patterns under a subset of conditions if the corresponding integers along the two corresponding rows of the matrix are identical. More generally, we define the similarity level between two genes under a specified set of conditions to be the number of conditions under each of which the two genes have the same (signed) nonzero integer. For applications where identification of negatively correlated genes is desired, we generalize the definition in the same way as the above except that we consider genes with the same corresponding (nonzero) integers but all with opposite signs. We call a submatrix of the above matrix ‘feasible’ if each pair of rows of the submatrix is either (approximately) the same or the opposite (i.e. the same but with opposite signs across the entire rows). Now our definition of a biclustering problem is to find all the optimal feasible submatrices in a given matrix according to some specified optimization criteria. It is not hard to see that both definitions of a biclustering problem given in (8) and (15) are special cases of our definition. Actually, our definition covers more than just these two cases as we can see from Figures 1A and and2A2A in the Supplementary Data, where we show two biclustering problems that are more general than both definitions of (8) and (15). Figure 3A in the Supplementary Data shows another biclustering problem in which four biclusters with different expression patterns are implanted in a background matrix. To the best of our knowledge, none of the existing biclustering programs are capable of finding these biclusters.

**A**and

**B**) models and varying degrees of overlapping for ‘constant’

**...**

**A**) Proportions of

*E. coli*biclusters that have significant overlap (

*P*< 0.01) with GO biological processes, KEGG pathways and experimentally verified regulons. (

**B**) Proportions of yeast biclusters that are statistically enriched (

*P*< 0.01)

**...**

**...**

The key algorithmic idea of our biclustering program is outlined as follows. For a given representing matrix of a microarray data set, we construct a weighted graph G with genes represented as vertices, edges connecting every pair of genes, and the weight of each edge being the similarity level between the two corresponding (entire) rows. Clearly, the higher a weight, the more similar two corresponding rows are. Intuitively, genes in a bicluster should induce a heavier subgraph of G because under a subset of the conditions, these genes have highly similar expression patterns that should make the weight of each involved edge heavier, comparing to the edges in the background. But it should be noted that some heavy subgraph may not necessarily correspond to a bicluster, i.e. genes from a heavy subgraph may not necessarily have similar expression patterns because different edges in a subgraph may have heavier weights under completely different subsets of conditions (see Figure 5 in the Supplementary Data for example). It should also be noted that recognizing all heavy subgraphs in a weighted graph itself is computationally intractable because identification of maximum cliques in a graph is a special case of this, and the maximum clique problem is a well known intractable problem (NP-hard). So in our solution, we do not directly solve the problem of finding heavy subgraphs in a graph. Instead, we built our biclustering algorithm based on this graph representation of a microarray gene expression data, and tackle the biclustering problem as follows. We find all feasible biclusters (I,J) in the given data set such that min{|I|, |J|} is as large as possible, where I and J are subsets of genes and conditions, respectively.

Our algorithm consists of two key steps: (i) representing a microarray data set using a qualitative matrix as outlined earlier, and (ii) identifying all biclusters in this matrix by finding biclusters one-by-one, where for each bicluster, it starts with the heaviest (unused) edge as a seed to build an initial bicluster and then iteratively recruits additional genes into the current bicluster without violating a pre-specified consistency level (see below).

### Qualitative representation of gene expression data

The representing matrix is composed of signed integers and 0's, which will be filled based on (i) the decision regarding if each gene has its expression value changed or not, i.e. up- or downregulated, or unchanged under each experimental condition, and (ii) the ranking of all the upregulating conditions for each gene, based on the expression values of the gene under these conditions (a user does not need to preprocess their data, e.g. to determine the fold-change or compute the log values of the raw data); and a similar ranking among all downregulating conditions for each gene. Details follow.

#### Recognition of unaffected expression values

We use the following method to distinguish those affected expression values from the background data. For each (gene) row i of the original expression data matrix with n rows and m columns, we sort its expression values in the increasing order as follows:

where *c* = m/2 and s–1 = *m* × *q*, where *q* is a parameter that can be selected by the user, and its default value in our program is 0.06. A gene *i* is deemed to be unchanged under condition *j* if and only if its expression value *w _{ij}* belongs to the interval (

*v*–

_{ic}*d*,

_{i}*v*+

_{ic}*d*), where

_{i}*d*=

_{i}**min**{

*v*–

_{ic}*v*,

_{is}*v*

_{i}_{,}

_{m–s}_{+1}–

*v*}. The reason that we define the unaffected expression values in this way is given in the Supplementary Data.

_{ic}#### Ranking of regulating conditions

We consider a condition as a downregulating condition for gene *i* in the above list if its value is ≤*v _{ic}* −

*d*, and as an upregulating condition if its value is ≥

_{i}*v*+

_{ic}*d*. We now sort all the upregulating conditions for gene

_{i}*i*into the decreasing order of their corresponding expression values, and use this order as the rank of each upregulating condition for gene

*i*; we rank the downregulating conditions in a similar manner except that we sort the relevant gene-expression values into the increasing order, and we use this order as the rank of each downregulating condition for gene

*i*. To distinguish between up- and downregulating conditions, we give each upregulating condition a ‘+’ sign and each downregulating condition a ‘–’ sign. We consider two genes as oppositely regulated under a subset of conditions if they have identical nonzero integers column-wise except with opposite signs.

For practical applications (considering the noisy and stochastic nature of the real gene-expression data), we typically use a predetermined range of ranks, say, rank 1, …, 10, which is much smaller than the number of conditions, and then assign multiple conditions with similar expression values for the same gene *i* into the same rank. The specific range of ranks for a particular application has to be determined using a trial-and-error approach. The QUBIC program provides the flexibility to allow the user to select the levels, *r*, of ranks for both up- and downregulating conditions with *r*'s default value set to be 1. A basic requirement that needs to be met is that for upregulating conditions, the expression values of rank *i* should be higher than those of rank *i* + 1 for all *i* < *r*. A similar requirement needs to hold for the downregulating conditions for each gene. It should be noted that the parameter *r* allows QUBIC to distinguish up to *r*! biclusters with different expression patterns in a provided matrix. We omit further discussion about this.

### Biclustering through finding a heavy subgraph

Consider a representing matrix M with *n* rows and *m* columns as discussed above, representing expression levels of *n* genes collected under m conditions, and a corresponding weighted graph *G* with the vertex set *V* and the edge set *E* as introduced earlier. Each edge has a weight defined as the number of columns under each of which the two rows (genes) have the same nonzero integer. The basic biclustering problem is to find a submatrix (*I*, *J*) of M, with *I* being a subset of rows (genes) and *J* a subset of columns (conditions) so that **min**{|*I*|, |*J*|} is maximal and the consistency level of (*I*, *J*) is higher than a prespecified value *c*, 0 < *c* < = 1.0, which can be set by the user. In our current program, *c* is set to be 0.95. The ‘consistency level’ of a submatrix is defined as the minimum ratio between the number of identical nonzero integers in a column and the total number of rows in the submatrix.

Intuitively, a bicluster should correspond to a maximal and connected subgraph of G consisting of heavier edges, on average, than edges of an arbitrary subgraph not overlapping such bicluster subgraphs, whose total edge-weight is stochastic. Specifically, two genes from the same bicluster should have a heavy edge by nature while two arbitrary genes may have a heavy edge only by chance. Our biclustering algorithm is built on this observation. The algorithm iterates on a set *S* of seeds (edges). Initially, *S* is set to be the sorted list of edges in *G*. An edge *e* = *g _{i}g_{j}* is considered to be a seed if and only if:

- at least one of its genes
*g*and_{i}*g*is not in any previously identified bicluster, or_{j} *g*and_{i}*g*are in different biclusters_{j}*B*_{1}= (*I*_{1},*J*_{1}) and*B*_{2}= (*I*_{2},*J*_{2}) with*I*_{1}∩*I*_{2}= Ø and*w*(*e*) ≥ max{|*I*_{1}|, |*I*_{2}|},

where *w*(*e*) is the weight of edge *e*. The algorithm builds an initial bicluster (*I*, *J*) based on a selected seed, and then it expands the bicluster along both the vertical and horizontal directions without violating the preset consistency level, and outputs a bicluster when it cannot be further expanded. Details follow.

#### Step 1 (Seeding on the representing graph)

If *S* is empty, stop; otherwise, check if the first element of *S* is a seed. If it is not, remove it from *S*, and repeat this step; otherwise use it to create a new bicluster as follows: Find all the conditions under which the two genes of the seed have all identical nonzero integer values and set these columns of the two genes as the current bicluster *B* = (*I*, *J*), and go Step 2.

Note that the consistency level of the current bicluster is 1.0. The following step attempts to increase min{|*I*|, |*J*|} of the current bicluster by adding additional genes, while maintaining the consistency level at 1.0.

#### Step 2 (Expansion while mainlining total column-wise consistency)

Expand the current bicluster *B* = (*I*, *J*) by adding a new gene (if any) from outside of *I* which is most consistent with *B*, giving rise to a new bicluster *B*′ = (*I*′, *J*′), where *I*′ is *I* after adding the new gene and *J*′ is obtained from *J* by deleting those columns where the total consistency is lost. If min{|*I*′|, |*J*′|} ≥ min{|*I*|, |*J*|}, set *B* to *B*′, then repeat Step 2; otherwise, if the preset consistency level is 1.0, output *B* and remove the current seed from *S*; else go to Step 3.

#### Step 3 (Expansion allowing less than total consistency)

Expand the current bicluster *B* by adding as many columns as possible without having the consistency level of the bicluster go below *c* as follows: for each column not in *B*, if the ratio between the number of identical nonzero integers in the rows of *I* and |*I*| is ≥ *c*, add it to *J*. Let *B*′ = (*I*′ *J*′) be the new bicluster and *T* be the consensus sequence of *B*′ consisting of the dominating elements of the columns of *B*′, where the dominating element is the element with the highest frequency in the column; add as many rows as possible to *B*′ such that each new row has at least |*I*′|*c* identical nonzero integers to those of *T*. Go to Step 4.

We also include negatively co-regulated genes, if any, into our biclusters by executing the following step.

#### Step 4 (Expansion by adding oppositely regulated genes)

Continue to expand the current bicluster *B* by adding oppositely regulated genes to it: let *T* be the consensus sequence of *B*; add as many rows as possible to *B* such that each added row has at least |*I*′|*c* identical nonzero integers but with opposite signs to those of *T*. Output *B* and go to Step 1.

The algorithm has a few unique and strong features worthy mentioning: (i) if a significant bicluster is being built but not completed in Step 2 for some reason, leading to a failure of not recognizing the bicluster, this problem could be remedied later with multiple chances by using other edges of the bicluster as seeds; (ii) the algorithm is able to find biclusters not only of positively co-regulated genes but also negatively co-regulated genes; (iii) the program allows a user to provide a set of seeds and build biclusters based on the provided seeds. This capability is included based on the consideration that a biologist may be interested in finding related genes to a specific set of genes; and (iv) although the algorithm is greedy in nature, it does not in general suffer from the issue of getting stuck in local optima since it uses multiple starting points (seeds) to find each bicluster. Our application results strongly indicate this is the case for the program. The pseudo code of the algorithm is provided in the Supplementary Data.

### Parameters of QUBIC

QUBIC has a number of parameters, namely, the range *r* of possible ranks, the percentage *q* of the regulating conditions for each gene, the required consistency level *c* for a bicluster, the desired number *o* of the output biclusters, and the control parameter *f* for overlaps among to-be-identified biclusters. For each of these parameters, we allow the user to adjust the default value to provide some flexibility.

The parameters *r* and *q* affect the granularity of the biclusters. A user can start with a small value of *r* (the default value is 1 so the corresponding data matrix consists of values ‘+1’, ‘–1’ and ‘0’), evaluate the results, and then use larger values (should not be larger than half of the number of the columns) to look for fine structures within the identified biclusters. The choice of *q*'s value depends on the specific application goals; that is if the goal is to find genes that are responsive to local regulators, we should use a relatively small *q*-value; otherwise we may want to consider larger *q*-values. The default value of *q* is 0.06 in QUBIC (this value is selected based on the optimal biclustering results on simulated data). The default value of *c* is 0.95, and *o*'s default value is 100. In addition, we have a parameter *f* to control the level of overlaps between to-be-identified biclusters (not discussed in the above algorithm); its default value is set to 1 to ensure that no two reported biclusters overlap more than *f*. QUBIC also provides the option that a user can skip the step of using ranks to represent the actual gene expression values to go directly to the biclustering step on the provided matrix.

## RESULTS

We now show the application results of QUBIC first on a number of benchmark data sets developed by Prelic *et al.* (13) and on some simulated data sets constructed by ourselves. The application results on these data sets indicate that our program outperforms the existing and popular biclustering tools, such as SAMBA (14), ISA (20), BIMAX (13), RMSBE (22) and a hierarchical clustering method (HCL) in both the identification accuracy and the computational efficiency. To test the boundaries of our program, we have constructed simulated data sets with tens of thousands of genes under thousands of conditions. The algorithm can find all the embedded biclusters from such large data sets within several minutes on a desktop PC workstation. We then applied the algorithm to actual biological data, and derived a number of new insights about these microarray data. For all the tests, we have used the following parameters: *r* = 1, *q* = 0.06, *c* = 0.95, *o* = 100, *f* = 1 (unless stated otherwise), and all results are tested on a 64-bit machine.

### Applications on Prelic's benchmark data sets

We have tested QUBIC on a benchmark set proposed by Prelic *et al.* (13), which consists of two types of biclusters, constant biclusters and coherent biclusters (23). It is easy to check that both are special cases of our definition of a bicluster and the details about the construction of the benchmark sets can be found in (13).

We have compared our algorithm with four existing algorithms, BIMAX (13), Iterative Signature Algorithm (ISA) (20), SAMBA (14) and HCL but did not include three earlier biclustering algorithms, Cheng–Church method (CC) (8), xMotif (24) and OPSM (12), since they were shown to have rather low performance accuracy (below 50%) in recovering implanted biclusters by previous studies (13,22). In this study, we have used the BIMAX, ISA and HCL algorithms implemented in BICAT (25) and the SAMBA algorithm implemented in EXPANDER (26); both software packages are publicly available. In addition, we included a recently published biclustering algorithm RMSBE (22). The parameters for running these biclustering algorithms were taken either from their default settings or following the parameters suggested by the original authors (see the Supplementary Data on our website at: http://csbl.bmb.uga.edu/∼maqin/bicluster/benchmark.html). Preprocessing and postprocessing were performed in a consistent manner with the previous benchmark study (13).

Overall on the Prelic data sets, we found that QUBIC has consistently performed the best in the most general case. It appears that though ISA has the marginal advantage (8%) over QUBIC on the ‘noisy’ case, its performance drops up to 90% compared to its performance without overlaps when the degree of overlap among coherent biclusters is 10 [see details in Figure 4D in the Supplementary Data]. A more detailed description of the methods’ performance on all the Prelic data sets can be found in Figure 4 in the Supplementary Data.

### Applications on our simulated data sets

As discussed earlier, biclusters with scaling patterns were considered to be a very challenging problem for any of the existing biclustering algorithm (15). It should be noted that a bicluster with scaling patterns is a special case of our definition of biclusters because a bicluster with scaling patterns in original expression data matrix corresponds to a bicluster with identical rows in its representing matrix. Here we consider two scenarios similar to those of Prelic's benchmark: (i) matrices with varying levels of noise, and (ii) matrices with varying degrees of overlap among the biclusters. We have constructed two sets of gene expression data, for scenario 1 with scaling patterns. For scenario 2, we have constructed one set of gene expression data where the background variation parameter *σ* was set to 0, and all entries of the first (last) two rows were set to 1 (–1) so that we can simulate the situation where some transcription factors regulate more than one transcriptional modules, i.e. all the implanted biclusters shared the first two and the last two genes. Further construction details can be found in the Supplementary Data.

On all these biclustering problems, our method achieves the optimal identification results almost in every case and always has the best performance among the five programs listed in the ‘Applications on Prelic's benchmark data sets’ section. In Figure 1A, we can see that all the methods except for RMSBE (with accuracy lower than 20%) achieve almost the optimal identification results. This is not surprising since the problem given in Figure 1A is not much different from the previous test case in Figure 4A in the Supplementary Data. On the more challenging case, as shown in Figure 1B, we start to see some substantial differences in identification accuracies between our and the other programs. For example, when *σ* = 0.25, QUBIC with *r* = 2 can achieve almost the optimal identification results (note the accuracy of QUBIC with *r* = 1 is 69%), while the other programs have rather low identification accuracies. Specifically, the accuracies by SAMBA, ISA and RMSBE are all below 50% while BIMAX and HCL are relatively better, at 72% and 52%, respectively. When oppositely regulated genes are considered, we see an even larger difference between our program and the others. Specifically, we have implanted submatrices (see Figure 3 in the Supplementary Data) having some rows or columns with their sums being (approximately) zero, QUBIC finds the implanted submatrices with ∼99% accuracy while none of the other programs had better than 40% identification accuracy (by HCL). The detailed information on this is provided as the Supplementary Data on our website (see the performance results in Figure 6 in the Supplementary Data).

We have compared our performance with a recently published program, BUBBLE (19), which is designed to solve biclustering problems with scaling patterns. We have tried the same comparisons as above but found that the BUBBLE program is rather difficult to use and run a large number of samples using it so we compared our program with BUBBLE only on three data sets, representing three different patterns. Overall, QUBIC substantially outperforms BUBBLE on all these data sets, and the detailed performance comparisons are given in Table 9 in the Supplementary Data.

### Computational efficiency of QUBIC

To demonstrate the computational efficiency of QUBIC, we have generated a number of large gene-expression data sets ranging from 2000 to 20 000 genes and 1000 conditions (these data sets are available from our website for download; and further details about these data sets can be found in the Supplementary Data). We have run our program and the other five programs on these large test sets on a desktop computer (2.66 GHz Intel Core, 2 Duo CPU, and 4 GB memory). Figure 7 in the Supplementary Data gives the computing time by our program. QUBIC finds the correct biclusters in a few minutes time, essentially independent of any parameters used in the program except for the parameter *o* while none of the other programs can solve the identification problem when the number of genes goes beyond 12 000. We also tested all the five programs on a real microarray data set with 54 675 transcripts and 18 conditions (an ovarian cancer microarray data set generated by our lab, and it will become available on our website when that paper is published) (Cui *et al.*, manuscript to be submitted). QUBIC finds 100 biclusters in about 5 min.

### Applications on global transcriptional data sets

We now evaluate QUBIC on global microarray gene-expression data collected from two different organisms (*Escherichia coli* and yeast). When analyzing the whole transcriptome microarray data, one challenging problem is to find the ‘transcriptional modules’, which represent modular components in the (global) gene regulatory network, defined as a set of tightly co-regulated genes along with a set of associated conditions that trigger the co-regulation (20), making it a natural application problem for the biclustering methods. It is known that some transcriptional modules show co-regulations only under a narrow range of conditions and have weak global correlations among their gene expression patterns, therefore not easily detectable by the traditional clustering methods. In addition, some transcriptional modules may overlap due to the combinatorial regulation by multiple transcriptional factors (20), which would also complicate the use of the traditional clustering techniques. The goal of this exercise is to test the effectiveness of our biclustering algorithm in identifying such transcriptional modules.

Our first test case includes the microarray gene expression data for 4217 *E. coli* genes collected under 264 conditions from the M3D database (*E. coli* array version 4 build 3) (27). The values in the original microarray data set are log2 values of the fluorescence intensities. The goal of our analysis is to identify biclusters hidden in the microarray data, and study their relationships with known biological pathways, as defined by the GO functional classification scheme (28), as well as by the KEGG pathways (29) and the ‘EcoCyc’ database (30).

For each identified bicluster, we use the *P*-value of its most enriched functional class (biological process) as the *P*-value of the bicluster. Specifically, the probability of having *r* genes of the same functional class in a bicluster of size *n* from a genome with a total of *N* genes can be computed using the following hypergeometric function (31), where *P* is the percentage of that functional class among all functional classes of genes encoded in the whole genome,

For each functional class *C*, we calculate the *P*-value of our current bicluster enriched with *C* genes as the probability of selecting at least *r* genes of the same functional class in the bicluster, where *r* is the actual number of *C* genes present in the current bicluster. We then use the smallest *P*-value among all possible functional classes *C* as the *P*-value of the current bicluster. Clearly, the smaller the *P*-value of a bicluster *B* is, the more likely that *B*'s genes are from the same biological process. We have run the six biclustering algorithms with their default parameters on this data set, as introduced in ‘Applications on Prelic's benchmark data sets’ section.

To compare the biclustering results by different algorithms, we have applied a clean-up procedure introduced in Prelic *et al.* (13) to remove the substantially overlapping biclusters so that among the survived biclusters, no two overlap more than 25% of their sizes. For each algorithm, we calculated the proportion of biclusters that have significant *P*-values (below a pre-selected *P*-value cutoff) among the survived biclusters after the clean-up step. Then, we score each algorithm using the ratio between the number of significant biclusters and the number of the survived biclusters.

Among the six tested algorithms, QUBIC consistently show the highest enrichment ratios except for the regulon classification from the ‘EcoCyc’ database. Specifically, when the *P*-value cutoff is 0.01, 89% of the QUBIC biclusters show substantial enrichment with GO biological processes, 89% of the QUBIC results show significant overlap with known regulons and 78% enriched in KEGG pathways (29). The detailed comparisons with other programs are given in Figure 2A. Although the performance of BIMAX (96%) is better than QUBIC for the regulon classification category, we found that 59% of the QUBIC biclusters have *P*-values <10^{–6} while only 48% of the BIMAX biclusters have *P*-values <10^{–6}. This suggests that individual QUBIC clusters are more significant than those generated by BIMAX. Indeed, on a case-by-case basis, the biclusters from QUBIC have higher enrichment ratios for more functional classes than those by all the other algorithms (Table 1). As an example, the flagella assembly pathway in *E. coli* is known to consist of 38 genes. Out of these genes, one QUBIC cluster includes 33 out of the total of 52 genes in the cluster, which compares to 20 out of 28 by BIMAX, 35 out of 92 by ISA, 22 out of 220 by MSBE and 36 out of 202 by SAMBA. This comparison highlights the overall better performance by QUBIC among all the programs in terms of their combined identification sensitivity and specificity.

On our second test, we used a yeast (*Saccharomyces cerevisiae*) microarray data set (32). Similar to the *E. coli* data analysis, we evaluated each bicluster (after removing the substantially overlapping biclusters) generated by different algorithms in terms of their functional enrichments based on GO biological processes, MIPS yeast functional catagories (33) and KEGG pathways. From Figure 2B, we can see that QUBIC has the highest functional enrichment among all the tested algorithms based on the three classifications.

Through the above comparative analyses on the performance of six algorithms, we have shown that QUBIC is capable of revealing high quality biclusters in both prokaryotic and eukaryotic microarray expression data, and the genes in each bicluster show strong correlations with known functions and pathways. This study thus suggests the potential in extracting the substructures of metabolic and regulatory networks from gene expression data under multiple conditions using a biclustering method, providing a new and useful tool for biological pathway and network reconstruction.

One potential issue with the above *P*-value based analysis is that the *P*-value is bicluster size-dependent, and hence larger bi-clusters tend to have more significant *P*-values. This is clearly not a unique problem to the biclustering result analysis as other bioinformatics problems, such as the problem of *cis* regulatory motif finding, also face the same issue. Further studies will be carried out aiming to make our *P*-value calculation size independent.

### Signature identification for cancer subtyping

We now extend the application of our biclustering algorithm to the problem of cancer subtype classification. The basis of this analysis is that pathways unique to specific cancer subtypes may get activated across the majority of the patients of the subtypes, and hence the genes in these pathways can be possibly used as a signature for specific subtypes. Apparently this problem can be formulated as a biclustering problem on microarray gene expression data. Actually, there have been several studies that used biclustering as part of a larger analysis pipeline to do cancer subtyping (34).

We have used the leukemia data collected by Armstrong *et al.* (35) and searched for biclusters that might be characteristic to different leukemia subtypes (ALL, MLL and AML). This data set consists of 12 533 probes from 72 patients of different subtypes of leukemia (24 ALL, 20 MLL and 28 AML patients, respectively), which were produced on Affymetrix U95A oligo-nucleotide arrays. We did pre-processing based on the experiment background as detailed in the Supplementary Data.

Using QUBIC, we have identified a total of 192 biclusters in the data set (the parameter o is set to 500 and the output results are available on our website). We made the following observations about the predicted biclusters: 17 biclusters contain samples (conditions) from only one cancer subtype, 89 biclusters have samples from two subtypes and 86 biclusters from all three subtypes (see Figure 8 in the Supplementary Data). Although only 17 biclusters were found to have specificity for a particular subtype, these biclusters are highly significant and distinct. Figure 3 gives an example of three selected biclusters that each shows subtype-specificity (BC000, BC002 and BC074). In this example, QUBIC identifies the classical ‘checker-board’ substructures inside the original microarray data, where the three selected biclusters each corresponds to a particular leukemia subtype, with BC000 specific to ALL, BC074 specific to MLL and BC002 specific to AML (Figure 3).

We found that these subtype-specific biclusters are informative and in most cases consistent with results reported in previous studies (35,36). For example, the MLL cluster (BC071; Figure 3) contains genes involved in multiple hematopoietic lineages, including PROM1 and FLT3 in progenitor cells and CCNA1 in myeloid cells, which were also observed in (35). While some of the genes in these subtype-specific biclusters may not necessarily make good marker genes for hematopoietic lineages, others do, such as those that encode proteins critical for cell-cycle transitions such as CCNA1, CCND3 and CDK5R1/p35. It is also worth noting that we identified two negatively regulated genes in BC002. Specifically, the last two genes (SEPT9 and CCND3) in BC002 are downregulated while the other genes in BC002 are upregulated. This has been observed for CCND3 (36), but the observation on SEPT9 is new, to the best of our knowledge. We believe that these three subtype-specific biclusters are information rich and further analyses could potentially lead to improved understanding about the molecular mechanisms underlying these three subtypes.The biclusters that contain samples from more than one subtype are probably clinically just as informative as the above subtype-specific biclusters. For example, we have found that among the resulting biclusters, three biclusters (BC011, BC040 and BC148) show an opposite trend for different ALL and AML, and one bicluster (BC025) shows an opposite trend for MLL and AML. In particular, within bicluster BC011, samples from ALL patients are all downregulated, while samples from AML patients are all upregulated; BC148 shows exactly the opposite pattern to that of BC011 where ALL samples are upregulated and AML samples are downregulated. These biclusters would contain candidates of selectively expressed genes for needed molecular targets. Note that this was not possible using some other biclustering algorithms such as BIMAX, since BIMAX only deals with binary data (change versus no-change) (13) as opposed to multiple data in our analysis.

As a result of biclustering on the cancer data, we have shown that QUBIC is capable of uncovering genes that are unique to clinically known subtypes of cancers. Our future work will be focused on mining the subtype-specific biclusters, as well as on integration of the program with additional tools into a classification and characterization pipeline in support of cancer studies.

## DISCUSSION

The biclustering strategy has been widely used in analyses of gene expression data since it was first proposed in 2000 because it provides a much increased flexibility and analysis power for identifying co-expressed genes under some but not necessarily all conditions, compared to the traditional clustering methods. As of now, most of the existing biclustering algorithms were designed to solve a rather special class of the biclustering problem, specifically attempting to find biclusters that minimize the so-called mean squared residue value. The QUBIC algorithm has proven to be a useful tool for analyzing gene expression data of tens of thousands of genes for discovering complex relationships among genes and conditions that are difficult to detect using existing biclustering methods. The high computational efficiency and the ability to detect subtly correlated expression patterns among genes under certain conditions will make QUBIC a powerful tool for analyses of microarray gene expression data, particularly large data sets. Furthermore, it can be a useful tool in transcriptional regulation network prediction.

## SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

## FUNDING

The National Science Foundation (#NSF/DBI-0354771, #NSF/ITR-IIS-0407204, #NSF/DBI-0542119, and #NSF/CCF-0621700); the U.S. Department of Energy's BioEnergy Science Center (BESC) grant through the Office of Biological and Environmental Research; and grants (60873207, 10631070 and 60373025 to G.J.L.) from NSFC and the Taishan Scholar Fund from Shandong Province, China. Funding for open access charge: NSF DBI-0542119.

*Conflict of interest statement*. None declared.

## ACKNOWLEDGEMENTS

We would like to thank Dr Dongsheng Che and Mr Kun Xu for their help and insightful discussions on the work. Also, we thank the useful suggestions by the two anonymous reviewers.

## REFERENCES

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (2.1M) |
- Citation

- Identification of coherent patterns in gene expression data using an efficient biclustering algorithm and parallel coordinate visualization.[BMC Bioinformatics. 2008]
*Cheng KO, Law NF, Siu WC, Liew AW.**BMC Bioinformatics. 2008 Apr 23; 9:210. Epub 2008 Apr 23.* - QServer: a biclustering server for prediction and assessment of co-expressed gene clusters.[PLoS One. 2012]
*Zhou F, Ma Q, Li G, Xu Y.**PLoS One. 2012; 7(3):e32660. Epub 2012 Mar 5.* - Parallelized evolutionary learning for detection of biclusters in gene expression data.[IEEE/ACM Trans Comput Biol Bioinform. 2012]
*Huang Q, Tao D, Li X, Liew AW.**IEEE/ACM Trans Comput Biol Bioinform. 2012; 9(2):560-70. Epub 2011 Mar 3.* - Discovering biclusters in gene expression data based on high-dimensional linear geometries.[BMC Bioinformatics. 2008]
*Gan X, Liew AW, Yan H.**BMC Bioinformatics. 2008 Apr 23; 9:209. Epub 2008 Apr 23.* - Recent patents on biclustering algorithms for gene expression data analysis.[Recent Pat DNA Gene Seq. 2011]
*Liew AW, Law NF, Yan H.**Recent Pat DNA Gene Seq. 2011 Aug; 5(2):117-25.*

- SPARCoC: A New Framework for Molecular Pattern Discovery and Cancer Gene Identification[PLoS ONE. ]
*Ma S, Johnson D, Ashby C, Xiong D, Cramer CL, Moore JH, Zhang S, Huang X.**PLoS ONE. 10(3)e0117135* - Overexpression of E2F mRNAs Associated with Gastric Cancer Progression Identified by the Transcription Factor and miRNA Co-Regulatory Network Analysis[PLoS ONE. ]
*Zhang X, Ni Z, Duan Z, Xin Z, Wang H, Tan J, Wang G, Li F.**PLoS ONE. 10(2)e0116979* - A framework for generalized subspace pattern mining in high-dimensional datasets[BMC Bioinformatics. ]
*Curry EW.**BMC Bioinformatics. 15(1)355* - eMBI: Boosting Gene Expression-based Clustering for Cancer Subtypes[Cancer Informatics. ]
*Chang Z, Wang Z, Ashby C, Zhou C, Li G, Zhang S, Huang X.**Cancer Informatics. 13(Suppl 2)105-112* - Biclustering Methods: Biological Relevance and Application in Gene Expression Analysis[PLoS ONE. ]
*Oghabian A, Kilpinen S, Hautaniemi S, Czeizler E.**PLoS ONE. 9(3)e90801*

- QUBIC: a qualitative biclustering algorithm for analyses of gene expression dataQUBIC: a qualitative biclustering algorithm for analyses of gene expression dataNucleic Acids Research. 2009 Aug; 37(15)e101

Your browsing activity is empty.

Activity recording is turned off.

See more...