Logo of narLink to Publisher's site
Nucleic Acids Res. Jul 1, 2009; 37(Web Server issue): W587–W592.
Published online May 29, 2009. doi:  10.1093/nar/gkp435
PMCID: PMC2703939

VisHiC—hierarchical functional enrichment analysis of microarray data

Abstract

Measuring gene expression levels with microarrays is one of the key technologies of modern genomics. Clustering of microarray data is an important application, as genes with similar expression profiles may be regulated by common pathways and involved in related functions. Gene Ontology (GO) analysis and visualization allows researchers to study the biological context of discovered clusters and characterize genes with previously unknown functions. We present VisHiC (Visualization of Hierarchical Clustering), a web server for clustering and compact visualization of gene expression data combined with automated function enrichment analysis. The main output of the analysis is a dendrogram and visual heatmap of the expression matrix that highlights biologically relevant clusters based on enriched GO terms, pathways and regulatory motifs. Clusters with most significant enrichments are contracted in the final visualization, while less relevant parts are hidden altogether. Such a dense representation of microarray data gives a quick global overview of thousands of transcripts in many conditions and provides a good starting point for further analysis. VisHiC is freely available at http://biit.cs.ut.ee/vishic.

INTRODUCTION

Microarrays have become the standard way of producing genome-scale measurements of gene expression levels (1). Since the first experimental studies (2), microarrays have been used for answering a large variety of questions, such as characterizing gene expression patterns in tumour cell lines and healthy tissues (3,4), identifying key mechanisms of stem cell differentiation (5), and reconstructing global transcriptional networks in model organisms (6). Databases like ArrayExpress and GEO (7,8) have become goldmines of transcriptomic information with thousands of publicly available microarray datasets.

Interpretation and visualization is a crucial step of microarray analysis, as measurements are abundant and the level of experimental noise is high (9). A common reasoning behind microarray analysis is ‘guilt by association’, as genes with similar expression profiles may have common regulatory circuits and functions (10). Unsupervised clustering presented as a heatmap and dendrogram is a common approach for detecting coexpressed groups of genes (11,12). Gene Ontology (GO) annotations are often used for the biological interpretation of detected clusters (13).

Clustering has several well-identified drawbacks that affect interpretation and reproducibility (14). Popular clustering methods rely on input parameters, for example, hierarchical clustering (11) applies a fixed dendrogram cut-off value, and K-means (12) require predefining the number (and hence, the structure) of expected groups. Enrichment tools that relate gene groups to GO categories need to be accessed separately, which complicates the analysis of hundreds of clusters. Analysing results of hierarchical clustering is complicated, since each node of the dendrogram represents a potential cluster. Moreover, given the hundreds of potentially relevant datasets in public databases, the manual work would be unreasonable. Data visualization is also technically challenging, since heatmaps with thousands of transcripts hardly fit on computer screens. These problems are still commonly tackled with ad hoc means, e.g. removing genes that are ‘not interesting’ due to constant expression levels.

VisHiC (Visualization of Hierarchical Clustering) is a web server for analysis of gene expression data, that provides agile all-in-one service for hierarchical clustering, functional enrichment analysis and visualization. The tool provides a global overview of a given expression matrix and highlights its most significant functional aspects using GO analysis. VisHiC builds a compact clustering using functional enrichments rather than fixed user-defined thresholds, by pruning clusters where no enrichments are found.

GO enrichment analysis is a common measure of gene cluster interpretation and a wide range of related tools has been created in recent years (1517). Several microarray analysis pipelines are available, notably Expression Profiler (18), GeneXPress (19) and AMEN (20) incorporate clustering methods with downstream analysis of annotations, sequence information and protein–protein interactions. The ambiguity of clustering methods has created a need for algorithms that assess multiple clusters (21). Some previously published tools also use functional information for clustering (2225). More recently, Ovaska et al. (26) combine clustering of genes based on semantic similarity of GO with heatmap visualization. However, the above comprise downloadable software that require additional data and expensive local computations. Our web server, on the other hand, provides the latest information from public databases and uses speed-optimized algorithms of HappieClust (27) and g:Profiler (16) to provide fast clustering and functional profiling even for larger datasets. In conclusion, we believe that our server provides an enhanced and useful service to the community.

THE VisHiC SERVER

VisHiC (http://biit.cs.ut.ee/vishic, Figure 1) is a web server for integrated cluster analysis, interpretation and visualization of microarray data that:

  1. performs a fast approximate hierarchical clustering of a user-provided gene expression dataset;
  2. computes functional enrichments of all discovered clusters using GO, pathways and regulatory motifs;
  3. creates a compact heatmap dendrogram of the expression dataset, revealing most important functional enrichments and hiding poorly annotated expression profiles.

Figure 1.
A biological case study with VisHiC. (a) Gene expression matrix and annotated dendrogram with significant clusters; (b) mitochondrion cluster (ID:31732), (c) muscle cluster (ID:36899), (d) annotation box of the mitochondrion cluster, appears when moving ...

The input of VisHiC is a gene expression matrix in plain tab-delimited or Gene Expression Omnibus SOFT format. Alternatively, one may use an expression matrix from our selection of example datasets. VisHiC supports a wide variety of gene, protein and probeset identifiers for human as well as most eukaryotic model organisms. The output of VisHiC is a compact gene expression matrix represented as a heatmap dendrogram, similar to the format used in many gene expression analysis applications. The analysis consists of three consecutive steps as described below.

Novel approximate algorithm allows rapid hierarchical clustering of gene expression data

The first stage of VisHiC analysis involves clustering of the input gene expression matrix.

Agglomerative hierarchical clustering (AHC) organizes the data into a dendrogram, i.e. a tree where every node represents a gene cluster (28). Nodes in the bottom of the hierarchy (i.e. leaf nodes) represent single-gene clusters, all nodes except leaves are made up of two smaller clusters, and the root node contains all genes in the dataset. The AHC algorithm starts from single-gene clusters, iteratively merges most similar neighbours and results in a hierarchical structure of N − 1 non-trivial clusters given a dataset of N genes.

Computational speed is an important consideration of AHC, as the standard algorithm requires all pairwise distances between expression profiles. This renders to around 200 million distances in case of an average mammalian genome. The VisHiC server incorporates HappieClust, our novel approximate version of the AHC algorithm (27). Instead of computing all pairwise distances, HappieClust takes advantage of pivot-based similarity heuristics to calculate all distances between similarly expressed genes as well as a random subset of more distant pairs. Since only a subset of all pairwise distances is calculated, HappieClust approximates the full AHC based on the pairwise distances that have been calculated during the process. Computational experiments with public microarray data show that HappieClust produces a biologically comparable analysis an order of magnitude faster than standard AHC.

Pearson correlation is the default measure in VisHiC for determining similarity between expression profiles. Alternatively, one may apply the negative correlation measure that detects inverse correlation patterns such as those shared by a repressor and its targets. Absolute correlation is a combination of the two, as it detects both direct and inverse similarity.

Functional enrichment analysis reveals optimal gene clusters of biological relevance

The second stage of VisHiC analysis involves functional enrichment analysis of all detected clusters to infer the optimal clustering.

A common strategy for partitioning a hierarchical clustering involves a dendrogram cut-off. However, it is difficult to provide a biologically plausible cut-off value, as gene expression profiles are not uniformly distributed and a fixed cut-off for different datasets does not guarantee stability.

In this work, we take a different approach and infer clusters using statistical analysis of functional annotations [refer to (29) for a relevant review]. We use our g:Profiler software (16) to profile all discovered clusters for GO terms (13), pathways of Reactome and KEGG (30,31), regulatory motifs of Transfac (32) and microRNA target sites of miRBase (33).

VisHiC applies the cumulative hypergeometric test to detect the significance of a functional annotation α, given that there are k genes in a cluster of n genes with an annotation α, and there are K annotated genes among the total of N genes in the genome:

equation image

To evaluate the total enrichment in a given cluster, VisHiC computes a size-weighted annotation score q that summarizes enrichments of GO as well as pathways and regulatory motifs:

equation image

Alternatively, one may opt for a strategy that assigns the best log P-value to each cluster, giving more preference to clusters with specific annotations:

equation image

In order to reduce the amount of false positives resulting from numerous enrichment tests, VisHiC computes a special multiple testing correction that accounts for the hierarchical structure of GO (34). Standard corrections such as Bonferroni and Benjamini–Hochberg False Discovery Rate are also applicable.

Enrichment-driven pruning of clustering dendrogram creates a compact view of expression data

The final stage of VisHiC analysis creates a compact and biologically motivated clustering of the expression dataset to reveal its functional essence.

Hierarchical clustering places gene groups in a parent–child structure, where clusters up in the hierarchy naturally contain smaller clusters as subsets. Similarly, the GO comprises a structured vocabulary where smaller groups of specific annotations are contained in large general groups. Hence, one expects to see specific enrichments in child clusters and corresponding general annotations in parent clusters. As the clustering dendrogram contains a spectrum of hierarchically contained clusters from single genes to the whole genome, choosing an optimal cluster involves maximizing certain criteria within a branch.

We have devised the following two-stage greedy algorithm that determines the cluster structure based on functional annotations.

  • First, we look for dense clusters, i.e. clusters with a high annotation score q, or alternatively, the term with the strongest P-value m. We scan all groups of genes that have functional enrichments, greedily starting from the one that provides the strongest annotation score. A cluster is not considered if any of its child or parent clusters is already a dense cluster. Dense clusters are shown in the final output.
  • Second, we detect sparse clusters, i.e. groups of genes that have poor or no functional enrichments. We start the analysis from the root of the dendrogram and pass it recursively, compressing all clusters except the ones that contain dense clusters as child nodes. Sparse clusters are cut-off from the dendrogram and corresponding expression profiles are hidden in the heatmap.

Our annotation-driven clustering algorithm is fully automated and does not depend on user-defined cut-offs. Cluster boundaries are determined only from significant enrichments of functional terms. VisHiC excludes small (<5 genes) and large (>1000 genes) clusters from enrichment analysis for optimal running time. The user may choose a different range of cluster sizes, or disable all compression to view the full expression matrix with all related enrichments. All functional terms that remain significant after multiple testing correction are used for computing the optimal clustering. However, one may apply a more stringent P-value threshold to reduce the number of contributing enrichments and compress the matrix to a greater extent.

The resulting expression matrix is presented as a heatmap of gene activation and repression patterns, complete with a dendrogram that highlights functional groups of coexpressed genes. Colour-coded rectangles in the dendrogram denote dense clusters and related functional categories (GO, KEGG, Reactome, regulatory motifs, microRNA target sites). Cluster-specific functional annotations are additionally presented in a table and also appear when hovering over the dendrogram. The main window displays the compact heatmap with all highlighted clusters, while one may also ‘zoom in’ to view any cluster separately. In compact view, vertical branch stumps of the dendrogram mark places where sparse clusters are compressed. The user may search for genes of interest, or conduct further analysis via hyperlinks to external resources, e.g. browse-related functional categories via the GO web site or g:Profiler.

Results: expression profiles of heart tissue of cardiovascular patients contain clusters related to muscle, mitochondria and extracellular matrix

We present a case study to demonstrate the use of VisHiC in biological analyses (Figure 1). The example comprises a microarray dataset of myocardial remodelling, including 38 samples from 3 clinical groups of patients with ischemic, non-ischemic and myocardial infarction, taken before and after left ventricular assist device implantation [available in GEO as part of the series GSE974 (35)]. We clustered the dataset, detected optimal clusters with best enrichments and visualized the resulting expression matrix (Figure 1a). We used a custom stringent P-value threshold (P < 10−7) and ‘best annotation’ cluster selection strategy with Pearson correlation measure to compress the matrix into a reasonable publication-sized format.

The best scoring clusters are related to mitochondrion (Figure 1b), muscle tissue (Figure 1c) and extracellular matrix, all of which are expected to be present in heart tissue expression profiles. Mitochondria produce adenosine triphosphate (ATP) and are the primary cellular energy generators. A recent publication underlines the importance of mitochondria in the heart and relates its mutations to heart disorders (36).

The cluster with muscle tissue enrichments (ID:36899, see Figure 1e for expression profiles and Figure 1f for functional annotations) contains 420 probesets for 251 genes and has several strong enrichments (contractile fibre: P < 10−28, muscle system process: P < 10−22, cytoskeletal protein binding: P < 10−19). In addition, our analysis reveals an enrichment for the binding site of serum response factor (SRF) (Transfac M01007, P < 10−9). SRF is a known heart transcription factor which increased expression in congestive heart failure (37).

The case study shows that VisHiC successfully extracts relevant functional aspects of a dataset, and compresses it into an easily perceivable compact format that fits well on screen and paper.

DISCUSSION AND CONCLUSION

VisHiC (http://biit.cs.ut.ee/vishic/) is a public web server for clustering and interpreting gene expression data. The tool is designed to extract the most significant biological features of a microarray dataset in a single run. The main output is a compact global view of the expression matrix with only the most significant clusters shown and less pronounced patterns hidden away, as its interactive format leaves open ends for more detailed analyses. VisHiC provides stability to otherwise ambiguous clustering and performs the labour-intensive task of evaluating hundreds of redundant clusters in a rapid automated manner. The approximate hierarchical clustering and rapid functional analysis guarantee meaningful results even if the datasets are large.

Functional assessment of microarray datasets is an immediate application of VisHiC analysis, as annotations of highlighted clusters should relate to proposed hypotheses. Our approach is likely to be useful for large expression data warehouses, so that first broad overviews could be offered to users who are routinely browsing hundreds of datasets. One may use VisHiC to compare different datasets in the context of experimental conditions, global expression patterns and functional aspects. Integrating expression clusters with other types of experimental data like protein–DNA and protein–protein interactions may provide researchers with additional clues about gene regulation.

FUNDING

EU FP6 grants (ENFIN LSHG-CT-2005-518254 and COBRED LSHB-CT-2007-037730). Funding for open access charge: ERDF through the Estonian Centre of Excellence in Computer Science project.

Conflict of interest statement. None declared

ACKNOWLEDGEMENTS

We would like to thank Dr Alvis Brazma for early discussions on the method, as well as Anton Litvinenko for technical support, consulting and programming assistance. J.R. and M.K. acknowledge the Tiger University Program of the Estonian Information Technology Foundation.

REFERENCES

1. Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC, et al. Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat. Genet. 2001;29:365–371. [PubMed]
2. Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995;270:467–470. [PubMed]
3. Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, Spellman P, Iyer V, Jeffrey SS, Van de Rijn M, Waltham M, et al. Systematic variation in gene expression patterns in human cancer cell lines. Nat. Genet. 2000;24:227–235. [PubMed]
4. Ge X, Yamamoto S, Tsutsumi S, Midorikawa Y, Ihara S, Wang SM, Aburatani H. Interpreting expression profiles of cancers by genome-wide survey of breadth of expression in normal tissues. Genomics. 2005;86:127–141. [PubMed]
5. Loh YH, Wu Q, Chew JL, Vega VB, Zhang W, Chen X, Bourque G, George J, Leong B, Liu J, et al. The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cells. Nat. Genet. 2006;38:431–440. [PubMed]
6. Hu Z, Killion PJ, Iyer VR. Genetic reconstruction of a functional transcriptional regulatory network. Nat. Genet. 2007;39:683–687. [PubMed]
7. Parkinson H, Kapushesky M, Kolesnikov N, Rustici G, Shojatalab M, Abeygunawardena N, Berube H, Dylag M, Emam I, Farne A, et al. ArrayExpress update–from an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Res. 2009;37:D868–D872. [PMC free article] [PubMed]
8. Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Edgar R. NCBI GEO: mining tens of millions of expression profiles–database and tools update. Nucleic Acids Res. 2007;35:D760–D765. [PMC free article] [PubMed]
9. Allison DB, Cui X, Page GP, Sabripour M. Microarray data analysis: from disarray to consolidation and consensus. Nat. Rev. Genet. 2006;7:55–65. [PubMed]
10. Troyanskaya OG. Putting microarrays in a context: integrated analysis of diverse biological data. Brief Bioinformatics. 2005;6:34–43. [PubMed]
11. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA. 1998;95:14863–14868. [PMC free article] [PubMed]
12. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM. Systematic determination of genetic network architecture. Nat. Genet. 1999;22:281–285. [PubMed]
13. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 2000;25:25–29. [PMC free article] [PubMed]
14. Garge NR, Page GP, Sprague AP, Gorman BS, Allison DB. Reproducible clusters from microarray research: whither? BMC Bioinformatics. 2005;6(Suppl. 2):S10. [PMC free article] [PubMed]
15. Draghici S, Khatri P, Martins RP, Ostermeier GC, Krawetz SA. Global functional profiling of gene expression. Genomics. 2003;81:98–104. [PubMed]
16. Reimand J, Kull M, Peterson H, Hansen J, Vilo J. g:Profiler–a web-based toolset for functional profiling of gene lists from large-scale experiments. Nucleic Acids Res. 2007;35:193–200. [PMC free article] [PubMed]
17. Huang da W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 2009;4:44–57. [PubMed]
18. Kapushesky M, Kemmeren P, Culhane AC, Durinck S, Ihmels J, Krner C, Kull M, Torrente A, Sarkans U, Vilo J, Brazma A. Expression Profiler: next generation–an online platform for analysis of microarray data. Nucleic Acids Res. 2004;32:W465–W470. [PMC free article] [PubMed]
19. Segal E, Yelensky R, Kaushal A, Pham T, Regev A, Koller D, Friedman N. GeneXPress: a visualization and statistical analysis tool for gene expression and sequence data. Proceedings of the 11th International Conference on Intelligent Systems for Molecular Biology (ISMB). 2004 Glasgow, UK.
20. Chalmel F, Primig M. The Annotation, Mapping, Expression and Network (AMEN) suite of tools for molecular systems biology. BMC Bioinformatics. 2008;9:86. [PMC free article] [PubMed]
21. Torrente A, Kapushesky M, Brazma A. A new algorithm for comparing and visualizing relationships between hierarchical and flat gene expression data clusterings. Bioinformatics. 2005;21:3993–3999. [PubMed]
22. Adryan B, Schuh R. Gene-Ontology-based clustering of gene expression data. Bioinformatics. 2004;20:2851–2852. [PubMed]
23. Okada Y, Sahara T, Mitsubayashi H, Ohgiya S, Nagashima T. Knowledgeassisted recognition of cluster boundaries in gene expression data. Artif. Intell. Med. 2005;35:171–183. [PubMed]
24. Seo J, Gordish-Dressman H, Hoffman EP. An interactive power analysis tool for microarray hypothesis testing and generation. Bioinformatics. 2006;22:808–814. [PubMed]
25. Dotan-Cohen D, Melkman AA, Kasif S. Hierarchical tree snipping: clustering guided by prior knowledge. Bioinformatics. 2007;23:3335–3342. [PubMed]
26. Ovaska K, Laakso M, Hautaniemi S. Fast Gene Ontology based clustering for microarray experiments. BioData Min. 2008;1:11. [PMC free article] [PubMed]
27. Kull M, Vilo J. Fast approximate hierarchical clustering using similarity heuristics. BioData Min. 2008;1:9. [PMC free article] [PubMed]
28. Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM Comput. Surv. 1999;31:264–323.
29. Huang daW, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009;37:1–13. [PMC free article] [PubMed]
30. Vastrik I, D'Eustachio P, Schmidt E, Joshi-Tope G, Gopinath G, Croft D, de Bono B, Gillespie M, Jassal B, Lewis S, et al. Reactome: a knowledge base of biologic pathways and processes. Genome Biol. 2007;8:R39. [PMC free article] [PubMed]
31. Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima S, Okuda S, Tokimatsu T, et al. KEGG for linking genomes to life and the environment. Nucleic Acids Res. 2008;36:D480–D484. [PMC free article] [PubMed]
32. Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, et al. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 2006;34:D108–D110. [PMC free article] [PubMed]
33. Griffiths-Jones S, Grocock RJ, van Dongen S, Bateman A, Enright AJ. miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res. 2006;34:D140–D144. [PMC free article] [PubMed]
34. Reimand J. Gene ontology mining tool GOSt. University of Tartu: MSc thesis; 2006.
35. Hall JL, Grindle S, Han X, Fermin D, Park S, Chen Y, Bache RJ, Mariash A, Guan Z, Ormaza S, et al. Genomic profiling of the human heart before and after mechanical support with a ventricular assist device reveals alterations in vascular signaling networks. Physiol. Genomics. 2004;17:283–291. [PubMed]
36. Fan W, Waymire KG, Narula N, Li P, Rocher C, Coskun PE, Vannan MA, Narula J, Macgregor GR, Wallace DC. A mouse model of mitochondrial disease reveals germline selection against severe mtDNA mutations. Science. 2008;319:958–962. [PMC free article] [PubMed]
37. Azhar G, Zhang X, Wang S, Zhong Y, Quick CM, Wei JY. Maintaining serum response factor activity in the older heart equal to that of the young adult is associated with better cardiac response to isoproterenol stress. Basic Res. Cardiol. 2007;102:233–244. [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...