![]() | ![]() |
Formats:
|
||||||||||||||||||
Copyright © The Author 2005. Published by Oxford University Press. All rights reserved RACE: Remote Analysis Computation for gene Expression data DNA Array Facility, Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland 1Bioinformatics Research Center, North Carolina State University, Raleigh, NC, USA *To whom correspondence should be addressed. Tel: +1 41 21 692 3909; Fax: +1 41 21 692 3905; Email: Beate.Sick/at/unil.ch Received February 14, 2005; Revised April 12, 2005; Accepted April 25, 2005. The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions/at/oupjournals.org This article has been cited by other articles in PMC.Abstract The Remote Analysis Computation for gene Expression data (RACE) suite is a collection of bioinformatics web tools designed for the analysis of DNA microarray data. RACE performs probe-level data preprocessing, extensive quality checks, data visualization and data normalization for Affymetrix GeneChips. In addition, it offers differential expression analysis on normalized expression levels from any array platform. RACE estimates the false discovery rates of lists of potentially regulated genes and provides a Gene Ontology-term analysis tool for GeneChip data to support the biological interpretation and annotation of results. The analysis is fully automated but can be customized by flexible parameter settings. To offer a convenient starting point for subsequent analyses, and to provide maximum transparency, the R scripts used to generate the results can be downloaded along with the output files. RACE is freely available for use at http://race.unil.ch. INTRODUCTION DNA microarrays are standard tools in biology and medicine. An increasingly long list of applications includes the identification of gene expression changes associated with changes in cell state (1,2), classifying clinical samples based on the underlying pathological characteristics (3,4), drug development (5) and the functional annotation of genes (6). A typical microarray experiment might measure expression levels of tens of thousands of genes, and systematic variations introduced into the datasets (e.g. variations in labeling efficiencies or scanner settings) can often obscure the biological variation that is of real interest. Furthermore, once differentially expressed genes have been identified, inferring function based simply on their expression pattern can be both arduous and ineffective. Hence, bioinformatics tools that facilitate rigorous data analysis and interpretation are of the highest importance. Presented here is Remote Analysis Computation for gene Expression data (RACE), a web server which provides some solutions to these problems. Microarray data analysis typically begins with data quality checks and data normalization (7). Once normalized expression levels are determined, expression ratios can be calculated and differentially expressed genes identified. At this stage of data analysis, the magnitude and significance of gene expression changes as well as the false discovery rate are important measures. Once a list of differentially expressed genes is identified, the task often turns to describing and interpreting the biological significance of the results. An often useful approach is to compile a list of the Gene Ontology (GO) terms associated with the differentially expressed genes (8). This provides an overview of the biological, physiological and cellular processes potentially involved in the biological phenomena and suggests directions for further studies. A number of very useful web tools for microarray data analysis exist (e.g. 9–13). RACE contributes to the field by providing access to a wide range of quality checks, probe-level methods and state-of-the-art normalization techniques for Affymetrix raw data. To the best of our knowledge, these are not provided by any other publicly available server. Additionally, RACE provides tools to identify lists of differentially expressed genes and to determine and investigate the associated GO-term composition of those genes. To facilitate subsequent analyses and guarantee maximal transparency and reproducibility, the R script used to generate the results is provided. SYSTEM AND MODULE DESIGN RACE is divided into two components: the user interface and the analysis part. RACE uses basic authentication provided by Apache. HTTP communication is exclusively via port 80, making the system easily accessible through a firewall. Submitted jobs are queued and a customized analysis script is generated by a set of Perl scripts. The analysis script is executed in a subprocess. All statistical analysis is performed using the free high-level interpreted statistical language R (R Core, 2004, http://www.R-project.org) and various Bioconductor packages (http://www.Bioconductor.org). The design of the software is modular to facilitate the addition of further analysis tools. User accounts RACE can be used with an anonymous guest account but personal password-protected access is recommended. Registered users can store data in a personal account on the server, making it possible to run multiple tasks without the need to re-upload input files. Moreover, waiting times are avoided as the user is automatically emailed at the completion of a job. RACE creates for each job a directory for storing the input files, the selected parameters, the utilized R script and the results. File handling The upload files module allows users to upload and store files, decompress ZIP files and organize the data in different subdirectories. After setting the parameters in the analysis tools, the user is given the option of providing the input data either by a new upload or by copying or splitting previously uploaded or generated data. The download files module allows users to access their password-protected directories, to browse their data and to download or delete files. Every file is deleted automatically by the system 1 week after its creation. RACE ANALYSIS TOOLS RACE currently offers three analysis tools accessible via the web interface, namely Data Quality Checks & Normalization, Statistical Tests and GO-term Analysis. Each tool is structured into three sections. The first section contains links to three help pages describing the purpose and implemented methods, the required input data format and the output files generated. Parameters which are required for the analysis are set in the second section. Parameters which can be optionally changed to customize the output files generated are set in the third section. At the bottom of the second section the user can provide the data to be analyzed. After the submission of an analysis request, a confirmation message, including a link to the output page, is displayed. When the job is completed, authenticated users will receive the link to the output page by email. The output page contains the user data, the customized R script used for the analysis, all result files, ZIP archives and a log file which tracks job start and completion as well as problems that may have occurred during the run. Data Quality Checks & Normalization tool Purpose and required data input format The Data Quality Checks & Normalization tool is dedicated to the visualization, quality checking and normalization of Affymetrix GeneChip data. Data should be provided as Affymetrix CEL files in ASCII format, optionally zipped. Description The Data Quality Checks & Normalization tool uses primarily methods implemented in the Bioconductor packages ‘affy’ (14) and ‘affyPLM’. To quality check the perfect match (PM), probe levels are summarized in spatial and density plots. Individual probes in each probe set are numbered starting from the 5′ end of the transcript, and the mean 5′ to 3′ probe intensity bias for each array is determined. The probe-level intensities for probe sets are summarized to define a measure of the individual gene expression. To make data from different arrays comparable, RACE provides several normalization methods. The first of these is MAS 5.0, the current Affymetrix default algorithm. However, several studies (15,16) suggest that measures based only on the PM probes outperform the MAS 5.0 algorithm. For this reason RACE also provides access to two of the most prominent PM-based algorithms: RMA (Robust Multichip Average; 17) and gcRMA (see the Bioconductor website: http://www.Bioconductor.org). RMA includes quantile normalization and a robust multi-array probe-level fit, and gcRMA additionally exploits sequence information for the background adjustment. Based on the normalized expression values the Pearson correlation and the standard deviation of gene-wise expression differences between two arrays are calculated to evaluate similarities of the gene expression profile for each pair of samples. Moreover, a hierarchical sample cluster is built using Ward's minimum variance method. Output The principle output of this tool is a file containing normalized gene expression levels. In addition, multiple data visualizations are provided to assist in judging the quality of the data and the success of the normalization. Figure 1
An example of the output type ‘Bias 5′ to 3′end plot’ is shown in Figure 2
Owing to space limitations, the content and purpose of all other output graphs can be only briefly summarized in Table 1. Statistical Tests tool Purpose and required data input format The Statistical Tests tool identifies genes which are differently expressed between two groups. The input files for this tool are two expression matrices provided as tab-delimited ASCII files. The first column of both files must contain unique gene identifiers and all other columns contain normalized expression values of the samples corresponding to the different groups. The input files can be generated on the server by splitting the output file ‘NormExprLevels.txt’ from the first tool into two groups. Description The design of gene expression experiments can be represented in terms of a linear model (18). At the moment RACE supports designs where two groups are compared to identify genes changing expression across the groups. RACE uses the Bioconductor package ‘limma’ (http://bioinf.wehi.edu.au/limma/usersguide.pdf), which makes use of an empirical Bayesian approach, to fit the linear model. This approach outperforms a conventional t-test under conditions typical for microarray experiments (18–20). Owing to the large number of genes analyzed in a typical microarray experiment, an assessment of the effect of multiple testing is necessary. Therefore, we estimate from the distribution of raw p-values the fraction of the non-changing genes among all tested genes, as well as the false discovery rate (FDR) for each p-value threshold using the Bioconductor package ‘qvalue’ (21,22). Output The principle output are lists of potentially differentially expressed genes chosen according to user-specified fold-change and p-value thresholds. A separate overview list containing all genes, complemented by statistical measures and additional gene annotations (e.g. GeneSymbol and LocusID), is also provided. RACE determines for each gene the fold-change, the logarithm of the fold-change (M), the mean expression level (A), the uncorrected p-value, the estimated FDR, the regularized t-value, the log odds ratio (B) and the standard deviations of the expression levels in each group. RACE provides multiple ways of visualizing these values. See Table 2 for an overview of the output graphs.
Figure 3
Figure 4
GO-term Analysis tool Purpose and required data input format The aim of the GO-term Analysis tool is to assist in the biological interpretation of gene lists by identifying functional annotations (GO terms) which are enriched among the user-provided input genes. Users can choose among the different ontology categories and GO-term levels and can select threshold combinations for list coverage (minimum number of genes corresponding to each GO term) and statistical significance (p-value) for the overrepresentation of each GO term. GO terms which meet these criteria are reported together with the corresponding genes. A tab-delimited file containing Affymetrix identifiers in one column is required as input. Optionally, another column may contain log ratios, which can then be used to analyze the GO terms according to the under- or overexpression of the genes being analyzed. Gene lists generated by the Statistical Tests tool can be used directly as input files. Description GO (23) provides three structured, controlled vocabularies (ontologies) that describe gene products species-independently in terms of their associated biological processes, cellular components and molecular functions. GO terms are organized in directed acyclic graphs, representing networks where each term may be a ‘child’ (more specialized term) of one or more ‘parents’ (less specialized terms). The networks define the ‘is a’ or ‘part of’ relationships between terms and allow the grouping of all GO terms into different levels. As the GO term level increases, the informational specificity increases and the genome coverage decreases (24; also see http://www.geneontology.org/ for a more detailed description). RACE uses the Bioconductor meta-data packages for the mappings of Affymetrix identifiers to LocusLink identifiers and of LocusLink identifiers to GO terms. GO-term levels are derived from the ‘gene_ontology.obo’ text file provided by the Gene Ontology Consortium. Based on the GO-term composition of all genes on the array used, a p-value is determined using a hypergeometric distribution for the overrepresentation of each GO term among the specified gene list. The ‘Gostats' Bioconductor package was used to implement this method. For more information, see http://Bioconductor.org/Docs/Papers/2003/Compendium/GOstats.pdf. Output According to the user-specified parameters (GO-term type, GO-term specificity level, minimum number of genes annotated with a certain GO term, p-value threshold), a list of enriched GO terms is generated for the genes provided. For each enriched GO term, the numbers of supporting genes from the list as well as from the entire chip are reported and visualized. The counts of annotated and unannotated genes are reported as well. If the gene list corresponds to differentially expressed genes which are supplied with log ratios, the numbers of over- and underexpressed genes among the regulated genes are presented. To generate a ranking based on statistical significance, a p-value is calculated for the overrepresentation of GO-terms based on the hypergeometric distribution. The results are summarized in bar graphs and tables Figure 5
SUMMARY RACE offers an easy to use collection of bioinformatics web tools to analyze DNA microarray data, without requiring any installation or maintenance on the user side. By using various R subroutines and Bioconductor packages, RACE provides users with access to powerful statistical analysis tools without the need for specific expertise in their use. It offers different users or laboratories the possibility of performing data QC, normalization and analysis in a standardized way, which is likely to lead to more consistent and reproducible results. Acknowledgments We thank all the people providing and maintaining the excellent open source software on which RACE is based, i.e. Linux, Apache, Perl and R, together with CRAN and Bioconductor. We also thank Olivier Schaad for beta-testing RACE and providing numerous suggestions, Robeto Fabbretti from the Vital-IT Group of the Swiss Institute of Bioinformatics for hosting the RACE server, Otto Hagenbüchle and Johann Weber for comments on the manuscript, and Thierry Sengstag and the members of the DAFL for support. The DNA Array Facility is supported by the Etat de Vaud. Funding to pay the Open Access publication charges for this article was provided by the Etat de Vaud. Conflict of interest statement. None declared. REFERENCES 1. Chi J.-T., Chang H.Y., Haraldsen G., Jahnsen F.L., Troyanskaya O.G., Chang D.S., Wang Z., Rockson S.G., van de Rijn M., Botstein D., Brown P.O. Endothelial cell diversity revealed by global expression profiling. Proc. Natl Acad. Sci. USA. 2003;100:10623–10628. [PubMed] 2. Magee J.A., Abdulkadir S.A., Milbrandt J. Haploinsufficiency at the Nkx3.1 locus: a paradigm for stochastic, dosage-sensitive gene regulation during tumor initiation. Cancer Cell. 2003;3:273–283. [PubMed] 3. Roepman P., Wessels L.F., Kettelarij N., Kemmeren P., Miles A.J., Lijnzaad P., Tilanus M.G., Koole R., Hordijk G.J., van der Vliet P.C., et al. An expression profile for diagnosis of lymph node metastases from primary head and neck squamous cell carcinomas. Nature Genet. 2005;37:182–186. [PubMed] 4. Chung C.H., Bernard P.S., Perou C.M. Molecular portraits and the family tree of cancer. Nature Genet. 2002;32:533–540. [PubMed] 5. Gerhold D.L., Jensen R.V., Gullans S.R. Better therapeutics through microarrays. Nature Genet. 2002;32(Suppl.):547–551. [PubMed] 6. Zhang W., Morris Q.D., Chang R., Shai O., Bakowski M.A., Mitsakakis N., Mohammad N., Robinson M.D., Zirngibl R., Somogyi E., et al. The functional landscape of mouse gene expression. J. Biol. 2004;3:21. [PubMed] 7. Yang Y.H., Dudoit S., Luu P., Lin D.M., Peng V., Ngai J., Speed T.P. Normalization for cDNA normalization data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 2002;30:e15. [PubMed] 8. Desagher S., Severac D., Lipkin A., Bernis C., Ritchie W., Le Digarcher A., Journot L. Genes regulated in neurons undergoing transcription-dependent apoptosis belong to signaling pathways rather than the apoptotic machinery. J. Biol. Chem. 2004;280:5693–5702. [PubMed] 9. Colantuoni C., Henry G., Zeger S., Pevsner J. SNOMAD (Standardization and NOrmalization of MicroArray Data): web accessible gene expression data analysis. Bioinformatics. 2002;18:1540–1541. [PubMed] 10. Herrero J., Al-Shahrour F., Díaz-Uriarte R., Mateos A., Vaquerizas J.M., Santoyo J., Dopazo J. GEPAS: a web-based resource for microarray gene expression data analysis. Nucleic Acids Res. 2003;31:3461–3467. [PubMed] 11. Kapushesky M., Kemmeren P., Culhane A.C., Durinck S., Ihmels J., Korner C., Kull M., Torrente A., Sarkans U., Vilo J., Brazma A. Expression Profiler: next generation—an online platform for analysis of microarray data. Nucleic Acids Res. 2004;32:W465–W470. [PubMed] 12. Luscombe N.M., Royce T.E., Bertone P., Echols N., Horak C.E., Chang J.T., Snyder M., Gerstein M. ExpressYourself: A modular platform for processing and visualizing microarray data. Nucleic Acids Res. 2003;31:3477–3482. [PubMed] 13. Knudsen S., Workman C., Sicheritz-Poten T., Friis C. GenePublisher: automated analysis of DNA microarray data. Nucleic Acids Res. 2003;31:3471–3476. [PubMed] 14. Gautier L., Cope L., Bolstad B.M., Iriyarry R.A. affy—analysis of affymetrix genechip data at the probe level. Bioinformatics. 2004;20:307–315. [PubMed] 15. Bolstad B.M., Irizarry R.A., Astrand M., Speed T.P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185–193. [PubMed] 16. Naef F., Socci N.D., Magnasco M. A study of accuracy and precision in oligonucleotide arrays: extracting more signal at large concentrations. Bioinformatics. 2003;19:178–184. [PubMed] 17. Irizarry R.A., Bolstad B.M., Collin F., Cope L.M., Hobbs B., Speed T.P. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 2003;31:e15. [PubMed] 18. Smyth G.K. Linear Models and Empirical Bayes Methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 2004;3 Article 3. 19. Hatfield G.W., Hung S., Baldi P. Differential analysis of DNA microarray gene expression data. Mol. Microbiol. 2003;47:871–877. [PubMed] 20. Baldi P., Long A.D. A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics. 2001;17:509–519. [PubMed] 21. Storey J.D., Tibshirani R. Statistical significance for genome-wide experiments. Proc. Natl Acad. Sci. USA. 2003;100:9440–9445. [PubMed] 22. Benjamini Y., Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B. 1995;57:289–300. 23. Harris M.A., Clark J., Ireland A., Lomax J., Ashburner M., Foulger R., Eilbeck K., Lewis S., Marshall B., Mungall C., et al. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 2004;32:D258–D261. [PubMed] 24. Dennis G., Sherman B.T., Hosack D.A., Yang J., Gao W., Lane H.C., Lempicki R.A. DAVID: database for annotation, visualization, and integrated discovery. Genome Biol. 2003;4:P3. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||
Proc Natl Acad Sci U S A. 2003 Sep 16; 100(19):10623-8.
[Proc Natl Acad Sci U S A. 2003]Cancer Cell. 2003 Mar; 3(3):273-83.
[Cancer Cell. 2003]Nat Genet. 2005 Feb; 37(2):182-6.
[Nat Genet. 2005]Nat Genet. 2002 Dec; 32 Suppl():533-40.
[Nat Genet. 2002]Nat Genet. 2002 Dec; 32 Suppl():547-51.
[Nat Genet. 2002]Nucleic Acids Res. 2002 Feb 15; 30(4):e15.
[Nucleic Acids Res. 2002]J Biol Chem. 2005 Feb 18; 280(7):5693-702.
[J Biol Chem. 2005]Bioinformatics. 2002 Nov; 18(11):1540-1.
[Bioinformatics. 2002]Nucleic Acids Res. 2003 Jul 1; 31(13):3471-6.
[Nucleic Acids Res. 2003]Bioinformatics. 2004 Feb 12; 20(3):307-15.
[Bioinformatics. 2004]Bioinformatics. 2003 Jan 22; 19(2):185-93.
[Bioinformatics. 2003]Bioinformatics. 2003 Jan 22; 19(2):178-84.
[Bioinformatics. 2003]Nucleic Acids Res. 2003 Feb 15; 31(4):e15.
[Nucleic Acids Res. 2003]Bioinformatics. 2001 Jun; 17(6):509-19.
[Bioinformatics. 2001]Proc Natl Acad Sci U S A. 2003 Aug 5; 100(16):9440-5.
[Proc Natl Acad Sci U S A. 2003]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D258-61.
[Nucleic Acids Res. 2004]Genome Biol. 2003; 4(5):P3.
[Genome Biol. 2003]