Logo of pnasPNASInfo for AuthorsSubscriptionsAboutThis Article
Proc Natl Acad Sci U S A. Jun 8, 1999; 96(12): 6745–6750.
Cell Biology

Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays


Oligonucleotide arrays can provide a broad picture of the state of the cell, by monitoring the expression level of thousands of genes at the same time. It is of interest to develop techniques for extracting useful information from the resulting data sets. Here we report the application of a two-way clustering method for analyzing a data set consisting of the expression patterns of different cell types. Gene expression in 40 tumor and 22 normal colon tissue samples was analyzed with an Affymetrix oligonucleotide array complementary to more than 6,500 human genes. An efficient two-way clustering algorithm was applied to both the genes and the tissues, revealing broad coherent patterns that suggest a high degree of organization underlying gene expression in these tissues. Coregulated families of genes clustered together, as demonstrated for the ribosomal proteins. Clustering also separated cancerous from noncancerous tissue and cell lines from in vivo tissues on the basis of subtle distributed patterns of genes even when expression of individual genes varied only slightly between the tissues. Two-way clustering thus may be of use both in classifying genes into functional groups and in classifying tissues based on gene expression.

Recently introduced experimental techniques based on oligonucleotide or cDNA arrays now allow the expression level of thousands of genes to be monitored in parallel (19). To use the full potential of such experiments, it is important to develop the ability to process and extract useful information from large gene expression data sets. Elegant methods recently have been applied to analyze gene expression data sets that are comprised of a time course of expression levels. Examples of such time-course experiments include following a developmental process or changes as the cell undergoes a perturbation such as a shift in growth conditions. The analysis methods were based on clustering of genes according to similarity in their temporal expression (5, 6, 911). Such clustering has been demonstrated to identify functionally related families of genes, both in yeast and human cell lines (5, 6, 9, 11). Other methods have been proposed for analyzing time-course gene expression data, attempting to model underlying genetic circuits (12, 13).

Here we report the application of methods for analyzing data sets comprised of snapshots of the expression pattern of different cell types, rather than detailed time-course data. The data set used is composed of 40 colon tumor samples and 22 normal colon tissue samples, analyzed with an Affymetrix oligonucleotide array (8) complementary to more than 6,500 human genes and expressed sequence tags (ESTs) (14). We focus here on generally applicable analysis methods; a more detailed discussion of the cancer-specific biology associated with this study will be presented elsewhere (D.A.N. and A.J.L., unpublished work). The correlation in expression levels across different tissue samples is demonstrated to help identify genes that regulate each other or have similar cellular function. To detect large groups of related genes and tissues we applied two-way clustering, an effective technique for detecting patterns in data sets (see e.g., refs. 15 and 16). The main result is that an efficient clustering algorithm revealed broad, coherent patterns of genes whose expression is correlated, suggesting a high degree of organization underlying gene expression in these tissues. It is demonstrated, for the case of ribosomal proteins, that clustering can classify genes into coregulated families. It is further demonstrated that tissue types (e.g., cancerous and noncancerous samples) can be separated on the basis of subtle distributed patterns of genes, which individually vary only slightly between the tissues. Two-way clustering thus may be of use both in classifying genes into functional groups and in classifying tissues based on their gene expression similarity.


Tissues and Hybridization to Affymetrix Oligonucleotide Arrays.

Colon adenocarcinoma specimens (snap-frozen in liquid nitrogen within 20 min of removal) were collected from patients (D.A.N. and A.J.L., unpublished work). From some of these patients, paired normal colon tissue also was obtained. Cell lines used (EB and EB-1) have been described (17). RNA was extracted and hybridized to the array as described (1, 8).

Treatment of Raw Data from Affymetrix Oligonucleotide Arrays.

The Affymetrix Hum6000 array contains about 65,000 features, each containing ≈107 strands of a DNA 25-mer oligonucleotide (8). Sequences from about 3,200 full-length human cDNAs and 3,400 ESTs that have some similarity to other eukaryotic genes are represented on a set of four chips. In the following, we refer to either a full-length gene or an EST that is represented on the chip as EST. Each EST is represented on the array by about 20 feature pairs. Each feature contains a 25-bp sequence, which is either a perfect match (PM) to the EST, or a single central-base mismatch (MM). The hybridization signal fluctuates between different features that represent different 25-mer oligonucleotide segments of the same EST. This fluctuation presumably reflects the variation in hybridization kinetics of different sequences, as well as the presence of nonspecific hybridization by background RNAs. Some of the features display a hybridization signal that is many times stronger than their neighbors (≈4% of the intensities are >3 SD away from the mean for their EST). These outliers appear with roughly equal incidence in PM or MM features. If not filtered out, outliers contribute significantly to the reading of the average intensity of the gene. Because most features overlap in sequence with their neighbors we used a modified median filter to eliminate outliers from local neighborhoods of features, while preserving step-like changes in intensity. The features were arranged in the order they appear in the EST sequence, the PM-MM intensities in a moving window of five features were sorted, and the filtered intensity was given by the mean of the middle three sorted intensities. The total intensity of the EST was given by the mean filtered PM-MM intensity. To compensate for possible variations between arrays, the intensity of each EST on an array was divided by the mean intensity of all ESTs on that array and multiplied by a nominal average intensity of 50. The data set is available on the web at http://www.molbio.princeton.edu/colondata.

Correlations of Pairs of Genes.

To estimate the statistical significance of the correlation between genes, the distribution of correlation coefficients within 104 randomized data sets was calculated. To control for the difference in mean expression in the two tissue types, the randomization preserved tissue identity (normal tissues were randomized with normal tissues, and tumors with tumors). This type of randomization also was used to obtain the dashed curve in Fig. Fig.11C. The probability that the randomized data showed a higher correlation coefficient for the gene of interest than the nonrandomized data was used as an estimate of the statistical significance P.

Figure 1
Correlation between pairs of genes across the 62 tissue types. ○, tumor tissues; □, normal tissue; line, best fit (least-mean squares) with correlation coefficient r. (A) Correlation between 60S ribosomal protein L22 (EST number ...

Data Clustering.

We used an algorithm, based on the deterministic-annealing algorithm (18, 19), to organize the data in a binary tree. To cluster the genes, each gene, k, was represented by a vector, Vk, whose components correspond to the intensity of the gene in each sample. Each vector was normalized so that the sum over its components is zero and the magnitude is one, |Vk| = 1. The genes were split into two clusters as follows: two cluster centroids Cj, j = 1, 2, were defined. A probability was assigned for belonging to cluster j: Pj(Vk) = exp(−β|VkCj|2)/Σj exp(−β|VkCj|2). This equation effectively fits the data with two Gaussians of variance (2β)−1. The cluster centroids were determined by the self-consistent equation Cj = ΣkVk Pj(Vk)/ΣkPj(Vk), which was solved by iterations. For β = 0 there is only one cluster, C1 = C2. We increased β in small steps until two distinct, converged centroids emerged. Each gene k then was assigned to the cluster with the larger Pj(Vk). Each of the resulting two clusters then was separated into two by repeating the same procedure. The final result was an organization of the genes into a binary tree. To cluster the tissues the same algorithm was used, where each tissue, k, was represented as a vector, Vk, whose components correspond to the intensity of the genes for that tissue. Note that because of the normalization, the Euclidean distance between two vectors x and y is related to r, the correlation coefficient of x and y: |xy|2 = 2 (1 − r).

The binary trees obtained by the above procedure were used to reorganize the matrix of gene expression (Figs. (Figs.22 and and3).3). To this end, we included a routine that orders the tree branches in a deterministic way: Each pair of sibling branches was ordered according to the proximity of their centroids to the centroid of their parent’s sibling.

Figure 2
Data set of intensities of 2,000 genes in 22 normal and 40 tumor colon tissues. The genes chosen are the 2,000 genes with highest minimal intensity across the samples. The vertical axis corresponds to genes, and the horizontal axis to tissues. Each gene ...
Figure 3
(A) Expanded view of clustered data set of 2,000 genes in 22 normal and 40 tumor colon tissues. The genes chosen are the 2,000 genes with highest minimal intensity across the samples. Tumor tissues are marked with arrows on the left. Normal tissues are ...

The present clustering algorithm is quite efficient. The computation time scales as the number of objects clustered times the number of layers in the tree, N log(N), rather than as N2 to N4 in commonly used phylogenetic tree construction algorithms (15). In particular, the method does not require the computation of all distances between pairs of objects. The clustering programs are available on the web at http://www.molbio.princeton.edu/colondata.


Genes with Correlated Expression.

The intensity of each gene across the tissues can be thought of as a pattern that can be correlated with expression patterns of other genes. Graphically, correlation between genes can be seen by plotting the expression of one gene against the expression of another gene, as demonstrated for two ribosomal proteins in Fig. Fig.11A. For this pair of genes, the correlation coefficient is relatively high (r = 0.73), and the correlation appears to be statistically significant (P < 10−3). Most genes show no significant correlation across tissues (Fig. (Fig.11 B and C). On average, each gene shows a strong correlation with on the order of 1% of the other genes on the array (Fig. (Fig.11C). A correlation between two genes could result either from a direct up-regulation of one by the other, or because they are similarly regulated by the physiological state of the cell. The correlation between pairs of genes, and an analogous correlation between pairs of tissues, is the basis for the two-way data clustering described below.

Two-Way Data Clustering.

To detect groups of correlated genes and tissues we used a clustering approach to the data set. Clustering can be thought of as forming a phylogenetic tree of genes or tissues. Genes are near each other on the “gene tree” if they show a strong correlation across experiments, and tissues are near each other on the “tissue tree” if they have similar gene expression patterns. Technically, we developed a fast algorithm, based on the deterministic annealing algorithm (18, 19), which separates a set of objects (genes or tissues) into two groups, then separates each group into two subgroups, and so on, until all the objects are arranged on a binary tree. Because this algorithm yields an unordered tree, we supplied a method for imposing an order on the tree branches so that a final, ordered list is obtained. This procedure was applied to both the genes and the tissues, using the same algorithm. We then used this two-way ordering of genes and tissues to rearrange the rows and columns of the data set, so that correlated genes and tissues are displayed near each other.

To help visualize the data, we plotted it by using a color code, with gene intensity varying from red (high intensity) to blue (low intensity) (Fig. (Fig.22A). The intensity of each gene is normalized so that the relative variation in intensity is emphasized, rather than the absolute intensity. The two-way clustering method applied to the gene expression data set yielded a matrix that appears to bear patterns (Figs. (Figs.22B and and3).3). The areas of high or low intensity correspond to groups of tens to hundreds of genes whose expression is coordinated to a substantial degree across groups of tissue samples. In contrast, the same algorithm applied to a randomized data set (Fig. (Fig.22 C and D) yielded a matrix with little apparent structure. This difference in patterning reflects the underlying organization of gene expression in the real data set.

Gene Clusters.

The clustering of the genes in the data set reveals groups of genes whose expression is correlated across tissue types. For example, 48 ESTs homologous to ribosomal proteins are represented within the set of 2,000 high-intensity genes used for the clustering. Most of these genes cluster together—as expected for genes that are regulated coordinately (Fig. (Fig.33A, arrows on the bottom). The intensity of the ribosomal protein genes is relatively low (blue) in the normal colon tissues and high (red) in the colon tumor tissues. This finding is in agreement with previous observations (20). Interspersed within the ribosomal protein cluster are ESTs homologous to genes that appear to be related to cellular metabolism such as an ATP-synthase component and an elongation factor (Table (Table1).1). A more detailed discussion of the gene clusters will be presented elsewhere (D.A.N. and A.J.L., unpublished work).

Table 1
Part of the ribosomal protein cluster

Tissue Clusters.

The clustering algorithm separated tumor and normal tissues into two distinct clusters (Figs. (Figs.33 and and4),4), probably primarily because of tissue composition. It is expected that the normal tissue samples include a mixture of tissue types, while the tumor samples are biased to epithelial tissue of the carcinoma. For example, among the 20 genes with the most statistically significant difference between tumors and normal tissues (by t test), were five muscle genes (not shown). To obtain a qualitative measure of the muscle content of each sample, we calculated a muscle index, an average over the intensity of 17 ESTs in the array that are homologous to smooth muscle genes (Fig. (Fig.4).4). Normal tissues had high muscle index, while tumors had low muscle index. The outlying tumors that clustered with the normal tissues proved to be the five tumors with the highest muscle index (Fig. (Fig.4),4), perhaps representing tumor samples with a high content of nonepithelial tissues. Similarly, the three outlying normal tissues in the tumor cluster appear to have relatively low smooth-muscle content. Thus the outliers in the tissue clustering might be accounted for by tissue composition.

Figure 4
Clustering tree for the tissue samples. Tumors (T) and normal tissue (n) numbered such that tumor and normal tissues with the same serial number originate from the same patient. Tissue T18 is a tumor and tissue T19 is a metastasis from the same patient. ...

Does the separation between tumor and normal tissues depend on only a few genes (e.g., muscle-specific genes), or is it reflected in the majority of genes used to cluster? To test this, we performed clustering by using only a partial gene set, which lacks the genes that individually best separate tumor and normal tissues (using a 500-gene set that does not include genes with the most significant differences between tumors and normal tissue). Even if one removes the 1,500 genes with the most significant differences between tumor and normal tissues, the clustering algorithm still effectively separates tumor from normal tissues (Fig. (Fig.5).5). Thus, clustering distinguishes tumor and normal samples even when the genes used have a small average difference between tumor and normal samples. This finding suggests that for many genes there is a subtle, systematic difference between tumor and normal samples, forming a distributed pattern.

Figure 5
Separation of tumor and normal tissues by clustering over a set of 500 genes. Genes were sorted by statistical significance (t test) of the difference in normal and tumors. Tissues were clustered by using a window of 500 genes selected from the sorted ...

Similarly, when cell lines derived from colon carcinoma (ref. 17 and M. Murphy, D.A.N., and A.J.L., unpublished work) were included in the data set, the clustering algorithm separated the cell lines into a cluster of their own, which is distinct from the colon tumor tissue samples (Fig. (Fig.33B, stars). The cell-line cluster was placed closer to the tumors than the normal tissue. Note that including the cell line tissues modifies the patterns obtained by clustering, because the expression patterns in the cell lines is so markedly different than either the tumor or normal in vivo tissues. the ribosomal proteins still cluster, with their relative intensity low in normal tissue, high in tumors, and very high in cell lines.


This work reports the application of techniques that proved useful in analyzing a large gene expression data set. A fast two-way clustering algorithm was developed to help identify families of genes and tissues based on expression patterns in the data set. Recent work demonstrated that genes of related function could be grouped together by clustering according to similar temporal evolution under various conditions (5, 6, 911). Here, it was demonstrated that gene grouping also could be achieved on the basis of variation between tissue samples from different individuals. Further, it was demonstrated that clustering of the tissues could detect differences between tumors of epithelial origin and muscle-rich normal tissue samples, even when the genes with significant bias (tumor-normal differences) were removed from the data set. Similarly, colon tumor cell lines were readily distinguished from in vivo colon tumors. Displaying the data with both samples and genes clustered revealed wide-scale patterns that hint at an extensive underlying organization of gene expression in these tissues.

It is worth noting that although the data-set was designed for studying colon tumors, the present analysis appears to allow access to additional information that may be relevant to the general regulation circuitry of the cell. Clustering can be thought of as a tool for reducing the dimensionality of the system. Instead of using thousands of gene intensities to describe the state of a tissue, one might, as a first approximation, use only the mean intensity of a few large clusters of genes (11). Clustering methods thus may help supply some of the basic elements for a compact, coarse-grained description of the state of the cell.

Finally, this study highlights the importance of improving tissue purity in the collection of in vivo samples. This method will allow a more reliable classification of tumors on the basis of gene expression patterns and will help characterize the differences between normal and tumor expression patterns. Because it appears likely that genomic instability in cancers can optimize gene expression for cell growth, the differences between normal and tumor expression patterns might help us understand what is being selected for as cancerous tissues evolve.


We thank S. Friend, S. Leibler, D. Lockhart, M. Mittman, R. Stoughton, and E. Tom for discussions, and J. Pipas for discussions and comments on the manuscript. We acknowledge the contribution of the Cooperative Human Tissue Network in providing tissue samples.


expressed sequence tag


1. Lockhart D, Dong H, Byrne M, Follettie M, Gallo M, Chee M, Mittmann M, Wang C, Kobayashi M, Horton H, Brown E. Nat Biotechnol. 1996;14:1675–1680. [PubMed]
2. DeRisi J, Penland L, Brown P, Bittner M, Meltzer P, Ray M, Chen Y, Su Y, Trent J. Nat Genet. 1996;14:457–460. [PubMed]
3. Pietu G, Alibert O, Guichard V, Lamy B, Bois F, Leroy E, Mariage-Sampson R, Houlgatte R, Soularue P, Auffray C. Genome Res. 1996;6:492–503. [PubMed]
4. Wodicka L, Dong H, Mittmann M, Ho M, Lockhart D. Nat Biotechnol. 1997;15:1359–1367. [PubMed]
5. DeRisi J, Iyer V, Brown P. Science. 1997. 680–686.
6. Chu S, DeRisi J, Eisen M, Mulholland J, Botstein D, Brown P, Herskowitz I. Science. 1998;282:699–705. [PubMed]
7. Marton M, DeRisi J, Bennett H, Iyer V, Meyer M, Roberts C, Stoughton R, Burchard J, Slade D, Dai H, et al. Nat Med. 1998;4:1293–1301. [PubMed]
8. Mack D H, Tom E Y, Mahadev M, Dong H, Mittman M, Dee S, Levine A J, Gingeras T R, Lockhart D J. In: Biology of Tumors. Mihich K, Croce C, editors. New York: Plenum; 1998. pp. 123–131.
9. Iyer V R, Eisen M B, Ross D T, Schuler G, Moore T, Lee J, Trent J M, Staudt L M, Hudson J, Boguski M S, et al. Science. 1999;283:83–87. [PubMed]
10. Wen X, Fuhrman S, Michaels G, Carr D, Smith S, Barker J, Somogyi R. Proc Natl Acad Sci USA. 1998;95:334–339. [PMC free article] [PubMed]
11. Eisen M B, Spellman P T, Brown P O, Botstein D. Proc Natl Acad Sci USA. 1998;95:14863–14868. [PMC free article] [PubMed]
12. Thomas R. J Theor Biol. 1973;42:563–585. [PubMed]
13. Thomas R, Thieffry D, Kaufman M. Bull. Math. Biol. 1995. 247–276. [PubMed]
14. Boguski M, Lowe T, Tolstoshev C. Nat Genet. 1993;4:332–333. [PubMed]
15. Hartigan J. Clustering Algorithms. New York: Wiley; 1975.
16. Weinstein J, Myers T, O’Connor P, Friend S, Fornace A J, Kohn K, Fojo T, Bates S, Rubinstein L, Anderson N, et al. Science. 1997;275:343–349. [PubMed]
17. Shaw P, Bovey R, Tardy S, Salhi R, Sordat B, Costa J. Proc Natl Acad Sci. 1992;89:4495–4499. [PMC free article] [PubMed]
18. Rose K, Gurewitz E, Fox G. Phys Rev Lett. 1990;65:945–948. [PubMed]
19. Rose K. Proc IEEE. 1998;96:2210–2239.
20. Pogue-Geile K, Geiser J, Shu M, Miller C, Wool I, Meisler A, Pipas J. Mol Cell Biol. 1991;11:3842–3849. [PMC free article] [PubMed]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • Cited in Books
    Cited in Books
    PubMed Central articles cited in books
  • EST
    Published EST sequences
  • Gene (nucleotide)
    Gene (nucleotide)
    Records in Gene identified from shared sequence links
  • MedGen
    Related information in MedGen
  • Nucleotide
    Published Nucleotide sequences
  • Protein
    Published protein sequences
  • PubMed
    PubMed citations for these articles
  • Taxonomy
    Related taxonomy entry
  • Taxonomy Tree
    Taxonomy Tree

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...