![]() | ![]() |
Formats:
|
||||||||||
Copyright © The Author 2005. Published by Oxford University Press. All rights reserved T-profiler: scoring the activity of predefined groups of genes using gene expression data Swammerdam Institute for Life Sciences–Microbiology, University of Amsterdam, Biocentrum Amsterdam, Nieuwe Achtergracht 166, 1018 WV Amsterdam, The Netherlands 1Department of Biological Sciences, Columbia University, New York, NY 10027, USA 2Center for Computational Biology and Bioinformatics, Columbia University, New York, NY 10032, USA *To whom correspondence should be addressed. Tel: +1 212 854 9932; Fax: +1 212 865 8246; Email: Harmen.Bussemaker/at/columbia.edu Received February 9, 2005; Revised April 18, 2005; Accepted April 18, 2005. The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions/at/oupjournals.org This article has been cited by other articles in PMC.Abstract One of the key challenges in the analysis of gene expression data is how to relate the expression level of individual genes to the underlying transcriptional programs and cellular state. Here we describe T-profiler, a tool that uses the t-test to score changes in the average activity of predefined groups of genes. The gene groups are defined based on Gene Ontology categorization, ChIP-chip experiments, upstream matches to a consensus transcription factor binding motif or location on the same chromosome. If desired, an iterative procedure can be used to select a single, optimal representative from sets of overlapping gene groups. T-profiler makes it possible to interpret microarray data in a way that is both intuitive and statistically rigorous, without the need to combine experiments or choose parameters. Currently, gene expression data from Saccharomyces cerevisiae and Candida albicans are supported. Users can upload their microarray data for analysis on the web at http://www.t-profiler.org. INTRODUCTION An important technique in the post-genomic era is the simultaneous measurement of the transcript levels of all genes from a genome by microarray experiments (1,2). In recent years, the amount of data from such experiments has rapidly increased (3,4). Furthermore, the combination of chromatin-immunoprecipitation and microarray technology (‘ChIP-chip’) has made it possible to globally measure the binding of transcription factors to gene promoters (5,6). There has also been an explosion in the number of computational methods for analyzing microarray data. Among the most popular are algorithms such as hierarchical clustering (7), K-means clustering (8) and self-organizing maps (9). A limitation of these clustering methods is the need to have gene expression profiles across multiple hybridizations. Alternative methods have been developed that can take a single genome-wide expression pattern as input, such as motif-based correlation or regression (10–12). To obtain easily interpretable information on changes in the cellular state in terms of functional annotation, methods such as Funspec (13), GO term finder (14), GOAL (15) and GeneXpress (http://genexpress.stanford.edu) score the significance of overlap between predefined gene groups [from Gene Ontology (GO) (16) or the MIPS database (17)] and the subset of induced or repressed genes. These methods are based on the cumulative hypergeometric distribution (also referred to as Fisher's exact test). A disadvantage of these methods is that they require individual genes to be significantly up- or down-regulated in order to contribute to the score. We previously developed a method that can score GO categories without the need to apply cut-offs to the expression level of individual genes (18). This algorithm, now named T-profiler, uses the t-test to score the difference between the mean expression level of predefined groups of genes and that of all other genes on the microarray (see Methods). A similar approach was independently pioneered by Pavlidis et al. (19). T-profiler is currently suitable for the analysis of Saccharomyces cerevisiae and Candida albicans gene expression profiles, and in the near future will be extended to other organisms. METHODS For a given gene group G, the t-value is given by the following formula:
Gene groups sharing a common motif in their upstream region Motif groups are defined as genes with a match to a particular consensus motif within 600 base pairs upstream of the open reading frame (ORF) (21), allowing no overlap with neighboring ORFs. The consensus motifs used in T-profiler are derived from three different sources. First, motifs were extracted from the SCPD database (http://cgsigma.cshl.org/jian/). Next, motifs were found by comparing the genome sequences of highly related yeast species (22,23). Finally, motifs discovered from various microarray experiments using the REDUCE algorithm (11,24) were added. Most of these motifs are similar or identical to motifs described in the literature. In total, 153 motif groups are included in the T-profiler calculation. Far less information is available about regulatory sequences of C.albicans. It was recently reported that about one-third of S.cerevisiae regulatory elements are conserved in C.albicans (25). T-profiler therefore uses the list of S.cerevisiae motifs, supplemented with newly discovered C.albicans regulatory motifs, to score C.albicans expression data. Gene groups bound by a common transcription factor based on ChIP-chip data The binding of transcription factors to their global DNA targets can be measured by ChIP-chip experiments. In S.cerevisiae this technique has been explored on a large scale by Lee et al. (5) and Harbison et al. (6). We used the transcription factor binding (TFB) data for 203 transcription factors from Harbison et al. (6) as input into T-profiler; the binding of 84 of these regulators was measured under various environmental conditions. A gene was considered to be part of a TFB group if the P-value reported by the authors was <0.001. In addition, TFB groups were required to have at least seven gene members. This resulted in 252 TFB groups that were used for T-profiler analysis. GO categories The third type of gene group is based on membership of a specific GO category (16). In GO, each gene is classified according to biological process, molecular function and cellular component. The GO gene group contains the genes associated with a specific GO category as well as all of its child categories. Only GO groups with more than six members were used for calculation. This resulted in 1389 GO-derived gene groups that were used for T-profiler analysis. Significant scores of GO groups give direct information about which functions or cellular processes are expected to have changed as a result of the altered gene expression. It should be kept in mind, however, that, unlike in the case of motif and ChIP-chip based gene groups, the t-values for GO categories are not directly related to a molecular mechanism. Iterative removal of redundant gene groups Several of the predefined gene groups scored by T-profiler show strong mutual overlap: the GO categories used by T-profiler are hierarchically organized; consensus motifs can match similar sequences; and ChIP-chip experiments can reveal that similar sets of genes are bound by different transcription factors and/or under different conditions. The t-values for overlapping gene groups are strongly correlated and therefore mutually redundant. Following the idea of forward selection of non-redundant motifs in REDUCE (11), we implemented an iterative procedure to select a non-redundant set of gene groups among those that have t-values significantly different from zero. At each step, we subtract the mean expression level of the genes in the gene group with the highest absolute t-value from all genes in that gene group. The t-values are then recalculated for all other gene groups, and the procedure is repeated until even the most significantly regulated gene group has a P-value > 0.05. In the case of nested GO categories at different levels in the hierarchy, this procedure will naturally select the most appropriate level for a given branch of annotation. Aneuploidy test Hughes et al. (26) described the discovery of chromosomal aberrations in yeast deletion mutants based on gene expression profiles. These are often duplications or deletions of an entire chromosome. By applying T-profiler at the level of whole chromosomes, where gene groups are defined as the set of all genes on a specific chromosome, it is possible to detect such aneuploidy. A statistically significant chromosomal t-value does not necessarily point, however, to aneuploidy, as it may also be caused by normal differential regulation by a transcription factor whose targets are preferentially located on the same chromosome. In the aneuploid dataset from Hughes et al. (26) we observed an absolute t-value > 10 for almost all deleted or duplicated chromosomes; such extreme t-values are therefore a good indicator of aneuploidy. AN EXAMPLE Gene expression datasets can be uploaded as a tab-delimited text file with the systematic ORF name in the first column and the log-transformed expression data in the second column. The upload of an expression profile comparing cells 80 min after a heat shift from 30 to 37°C from the Environmental Stress Response data set of Gasch et al. (23) will serve as an example. After uploading, the user is presented with some basic information about the dataset, including the number of genes, the average and the standard deviation (Figure 1A
Next, the user can follow links to results for four different types of predefined gene groups: genes whose promoter region matches a specific consensus motif (Figure 1B Figure 1B CONCLUSION T-profiler analyzes genome-wide expression patterns one experiment at a time, without the need to tune any parameters. Our use of the t-test to score gene groups eliminates the need to impose a threshold on the expression level of individual genes. A group can be scored as significantly induced or repressed even if the expression of none of its individual member genes changes significantly. This feature greatly increases the sensitivity to small-amplitude coordinate changes in the expression of groups of genes. Representing a transcriptome by a relatively small set of statistically robust and easily interpretable t-values allows for seamless comparison between hybridizations, even across different platforms and laboratories. We plan to extend the functionality of T-profiler to multiple experiments in the near future. Acknowledgments We would like to thank Merijn Schuurmans and Ania Zakrzewska for helpful discussions and for testing T-profiler, Reka Letso for a critical reading of the manuscript, and Xiang-Jun Lu for assistance in setting up the webserver. This work was supported by grants from the Netherlands Foundation for Technical Research (STW) to F.K. (APB.5504) and from the National Institutes of Health to H.J.B. (R01HG003008). Funding to pay the Open Access publication charges for this article was provided by the National Institutes of Health. Conflict of interest statement. None declared. REFERENCES 1. Schena M., Shalon D., Davis R.W., Brown P.O. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995;270:467–470. [PubMed] 2. Lockhart D.J., Dong H., Byrne M.C., Follettie M.T., Gallo M.V., Chee M.S., Mittmann M., Wang C., Kobayashi M., Horton H., et al. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat. Biotechnol. 1996;14:1675–1680. [PubMed] 3. Brazma A., Parkinson H., Sarkans U., Shojatalab M., Vilo J., Abeygunawardena N., Holloway E., Kapushesky M., Kemmeren P., Lara G.G., et al. ArrayExpress—a public repository for microarray gene expression data at the EBI. Nucleic Acids Res. 2003;31:68–71. [PubMed] 4. Barrett T., Suzek T.O., Troup D.B., Wilhite S.E., Ngau W.C., Ledoux P., Rudnev D., Lash A.E., Fujibuchi W., Edgar R. NCBI GEO: mining millions of expression profiles—database and tools. Nucleic Acids Res. 2005;33:D562–D566. [PubMed] 5. Lee T.I., Rinaldi N.J., Robert F., Odom D.T., Bar-Joseph Z., Gerber G.K., Hannett N.M., Harbison C.T., Thompson C.M., Simon I., et al. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science. 2002;298:799–804. [PubMed] 6. Harbison C.T., Gordon D.B., Lee T.I., Rinaldi N.J., Macisaac K.D., Danford T.W., Hannett N.M., Tagne J.B., Reynolds D.B., Yoo J., et al. Transcriptional regulatory code of a eukaryotic genome. Nature. 2004;431:99–104. [PubMed] 7. Eisen M.B., Spellman P.T., Brown P.O., Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA. 1998;95:14863–14868. [PubMed] 8. Sharan R., Shamir R. CLICK: a clustering algorithm with applications to gene expression analysis. Proc. Int. Conf. Intell. Syst. Mol. Biol. 2000;8:307–316. [PubMed] 9. Tamayo P., Slonim D., Mesirov J., Zhu Q., Kitareewan S., Dmitrovsky E., Lander E.S., Golub T.R. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl Acad. Sci. USA. 1999;96:2907–2912. [PubMed] 10. Jensen L.J., Knudsen S. Automatic discovery of regulatory patterns in promoter regions based on whole cell expression data and functional annotation. Bioinformatics. 2000;16:326–333. [PubMed] 11. Bussemaker H.J., Li H., Siggia E.D. Regulatory element detection using correlation with expression. Nature Genet. 2001;27:167–171. [PubMed] 12. Conlon E.M., Liu X.S., Lieb J.D., Liu J.S. Integrating regulatory motif discovery and genome-wide expression analysis. Proc. Natl Acad. Sci. USA. 2003;100:3339–3344. [PubMed] 13. Robinson M.D., Grigull J., Mohammad N., Hughes T.R. FunSpec: a web-based cluster interpreter for yeast. BMC Bioinformatics. 2002;3:35. [PubMed] 14. Boyle E.I., Weng S., Gollub J., Jin H., Botstein D., Cherry J.M., Sherlock G. GO:TermFinder—open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics. 2004;20:3710–3715. [PubMed] 15. Volinia S., Evangelisti R., Francioso F., Arcelli D., Carella M., Gasparini P. GOAL: automated Gene Ontology analysis of expression profiles. Nucleic Acids Res. 2004;32:W492–W499. [PubMed] 16. Ashburner M., Ball C.A., Blake J.A., Botstein D., Butler H., Cherry J.M., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T., et al. Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genet. 2000;25:25–29. [PubMed] 17. Mewes H.W., Heumann K., Kaps A., Mayer K., Pfeiffer F., Stocker S., Frishman D. MIPS: a database for genomes and protein sequences. Nucleic Acids Res. 1999;27:44–48. [PubMed] 18. Lascaris R., Bussemaker H.J., Boorsma A., Piper M., van der Spek H., Grivell L., Blom J. Hap4p overexpression in glucose-grown Saccharomyces cerevisiae induces cells to enter a novel metabolic state. Genome Biol. 2003;4:R3. [PubMed] 19. Pavlidis P., Lewis D.P., Noble W.S. Exploring gene expression data with class scores. Pac. Symp. Biocomput. 2002:474–485. [PubMed] 20. Heyer L.J., Kruglyak S., Yooseph S. Exploring expression data: identification and analysis of coexpressed genes. Genome Res. 1999;9:1106–1115. [PubMed] 21. van Helden J., Andre B., Collado-Vides J. A web site for the computational analysis of yeast regulatory sequences. Yeast. 2000;16:177–187. [PubMed] 22. Kellis M., Patterson N., Birren B., Berger B., Lander E.S. Methods in comparative genomics: genome correspondence, gene identification and regulatory motif discovery. J. Comput. Biol. 2004;11:319–355. [PubMed] 23. Gasch A.P., Spellman P.T., Kao C.M., Carmel-Harel O., Eisen M.B., Storz G., Botstein D., Brown P.O. Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell. 2000;11:4241–4257. [PubMed] 24. Roven C., Bussemaker H.J. REDUCE: an online tool for inferring cis-regulatory elements and transcriptional module activities from microarray data. Nucleic Acids Res. 2003;31:3487–3490. [PubMed] 25. Gasch A.P., Moses A.M., Chiang D.Y., Fraser H.B., Berardini M., Eisen M.B. Conservation and evolution of cis-regulatory systems in ascomycete fungi. PloS Biol. 2004;2:e398. [PubMed] 26. Hughes T.R., Roberts C.J., Dai H., Jones A.R., Meyer M.R., Slade D., Burchard J., Dow S., Ward T.R., Kidd M.J., et al. Widespread aneuploidy revealed by DNA microarray expression profiling. Nature Genet. 2000;25:333–337. [PubMed] 27. Hughes J.D., Estep P.W., Tavazoie S., Church G.M. Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol. 2000;296:1205–1214. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||
Science. 1995 Oct 20; 270(5235):467-70.
[Science. 1995]Nat Biotechnol. 1996 Dec; 14(13):1675-80.
[Nat Biotechnol. 1996]Nucleic Acids Res. 2003 Jan 1; 31(1):68-71.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D562-6.
[Nucleic Acids Res. 2005]Science. 2002 Oct 25; 298(5594):799-804.
[Science. 2002]Proc Natl Acad Sci U S A. 1998 Dec 8; 95(25):14863-8.
[Proc Natl Acad Sci U S A. 1998]Proc Int Conf Intell Syst Mol Biol. 2000; 8():307-16.
[Proc Int Conf Intell Syst Mol Biol. 2000]Proc Natl Acad Sci U S A. 1999 Mar 16; 96(6):2907-12.
[Proc Natl Acad Sci U S A. 1999]Bioinformatics. 2000 Apr; 16(4):326-33.
[Bioinformatics. 2000]Proc Natl Acad Sci U S A. 2003 Mar 18; 100(6):3339-44.
[Proc Natl Acad Sci U S A. 2003]BMC Bioinformatics. 2002 Nov 13; 3():35.
[BMC Bioinformatics. 2002]Bioinformatics. 2004 Dec 12; 20(18):3710-5.
[Bioinformatics. 2004]Nucleic Acids Res. 2004 Jul 1; 32(Web Server issue):W492-9.
[Nucleic Acids Res. 2004]Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]Nucleic Acids Res. 1999 Jan 1; 27(1):44-8.
[Nucleic Acids Res. 1999]Genome Biol. 2003; 4(1):R3.
[Genome Biol. 2003]Pac Symp Biocomput. 2002; ():474-85.
[Pac Symp Biocomput. 2002]Genome Res. 1999 Nov; 9(11):1106-15.
[Genome Res. 1999]Yeast. 2000 Jan 30; 16(2):177-87.
[Yeast. 2000]J Comput Biol. 2004; 11(2-3):319-55.
[J Comput Biol. 2004]Mol Biol Cell. 2000 Dec; 11(12):4241-57.
[Mol Biol Cell. 2000]Nat Genet. 2001 Feb; 27(2):167-71.
[Nat Genet. 2001]Nucleic Acids Res. 2003 Jul 1; 31(13):3487-90.
[Nucleic Acids Res. 2003]Science. 2002 Oct 25; 298(5594):799-804.
[Science. 2002]Nature. 2004 Sep 2; 431(7004):99-104.
[Nature. 2004]Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]Nat Genet. 2001 Feb; 27(2):167-71.
[Nat Genet. 2001]Nat Genet. 2000 Jul; 25(3):333-7.
[Nat Genet. 2000]Mol Biol Cell. 2000 Dec; 11(12):4241-57.
[Mol Biol Cell. 2000]J Mol Biol. 2000 Mar 10; 296(5):1205-14.
[J Mol Biol. 2000]Mol Biol Cell. 2000 Dec; 11(12):4241-57.
[Mol Biol Cell. 2000]Mol Biol Cell. 2000 Dec; 11(12):4241-57.
[Mol Biol Cell. 2000]