Logo of bioinfoLink to Publisher's site
Bioinformatics. Sep 15, 2009; 25(18): 2425–2429.
Published online Jul 14, 2009. doi:  10.1093/bioinformatics/btp430
PMCID: PMC2735672

WebArrayDB: cross-platform microarray data analysis and public data repository

Abstract

Motivation: Cross-platform microarray analysis is an increasingly important research tool, but researchers still lack open source tools for storing, integrating and analyzing large amounts of microarray data obtained from different array platforms.

Results: An open source integrated microarray database and analysis suite, WebArrayDB (http://www.webarraydb.org), has been developed that features convenient uploading of data for storage in a MIAME (Minimal Information about a Microarray Experiment) compliant fashion, and allows data to be mined with a large variety of R-based tools, including data analysis across multiple platforms. Different methods for probe alignment, normalization and statistical analysis are included to account for systematic bias. Student's t-test, moderated t-tests, non-parametric tests and analysis of variance or covariance (ANOVA/ANCOVA) are among the choices of algorithms for differential analysis of data. Users also have the flexibility to define new factors and create new analysis models to fit complex experimental designs. All data can be queried or browsed through a web browser. The computations can be performed in parallel on symmetric multiprocessing (SMP) systems or Linux clusters.

Availability: The software package is available for the use on a public web server (http://www.webarraydb.org) or can be downloaded.

Contact: moc.liamg@07aixqx; moc.liamg@leahcim.dnallelccm; moc.liamg@wgnepiy

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

Large amounts of microarray experimental data are stored in public repositories, making cross-platform analysis of data from different sources (either different laboratories and/or different platforms), an increasingly attractive and important research tool (Moreau et al., 2003). Such analyses are possible because biological treatments usually have a greater impact on measured expression than the noise of a cross-platform analysis (Chen et al., 2008; Larkin et al., 2005; Shippy et al., 2004). Moreover, the combined use of multiple platforms can overcome the inherent biases of individual platforms for identification of the more robust changes in gene expression profiles (Bosotti et al., 2007).

Currently available analysis packages do not provide all the required functions for cross-platform integration, normalization and statistical analysis of data from different sources. Integrative Array Analyzer (iArray; Pan et al., 2006) offers statistical cross-platform analysis functions but does not have probe alignment or data normalization features. MatchMiner (Bussey et al., 2003) is a powerful tool for matching genes and gene products from two platforms, but is not designed for statistical analysis. The Gene Expression Pattern Analysis Suite (GEPAS; Tárraga et al., 2008) integrates many tools for microarray data analysis, but it does not have data storage capability or cross-platform analysis functions. Other online platforms and public repositories are designed mainly for data storage and lack probe matching and cross-platform analysis functions: prominent examples include Expression Profiler (Kapushesky et al., 2004), ArrayExpress (Parkinson et al., 2007), the Stanford Microarray Database (SMD; Demeter et al., 2007), the Longhorn Array Database (LAD; Killion et al., 2003) and the BioArray Software Environment (BASE; Saal et al., 2002; Troein et al., 2006).

An earlier open source online platform for microarray data analysis, WebArray (Xia et al., 2005), did not offer a cross-platform analysis function, but provided an excellent framework for extension to WebArrayDB (http://www.webarraydb.org)—a database system and analysis suite that provides this function. In addition to traditional methods such as median and quantile for between-array normalization, WebArrayDB has integrated median rank scores (MRS), quantile discretization (QD; Warnat et al., 2005), gene quantile (GQ)—a quantile normalization for each individual gene among different platforms, and principal component analysis (PCA; Stoyanova et al., 2004). WebArrayDB provides standard statistical analysis methods, such as Student's t-test, eBayes-moderated t-test, Significance Analysis of Microarrays (SAM; Tusher et al., 2001), analysis of variance or covariance (ANOVA/ANCOVA) and non-parametric tests, as options for users to explore.

2 DATABASE INFRASTRUCTURE

WebArrayDB includes all fields required for MIAME-compliant microarray data storage (Brazma et al., 2001). Data are classified into five categories: ‘project’, ‘array’, ‘platform’, ‘protocol’ and ‘sample’. Each record in these tables is given a unique ID (‘MPMDB ID’), and all five categories have to be filled for MIAME compliance and subsequent data analysis. All tables in the database have been indexed to speed up queries even when the size of the dataset becomes very large.

The project table serves as the hub of information—most information is linked to a specific project in the database (Fig. 1 and Supplementary Fig. 1). Intrinsic relationships among project, array, platform, protocol and sample are directly linked by references between tables, which permits fast cross-table searching. When defining a platform, users may supply probe information, including user-defined IDs and gene IDs from other public databases, such as RefSeq, UniGene, etc. All of these IDs can serve as references for cross-platform probe alignment. Since there are extensive gene annotations in GO (Gene Ontology database, http://www.geneontology.org/; Ashburner and Lewis, 2002), WebArrayDB is also designed to facilitate the use of GO for probe searching. The GO database in WebArrayDB is updated monthly.

Fig. 1.
Information organization in WebArrayDB.

The project table is linked to the ‘users’ table that contains the user information including user name and password (Fig. 1), enabling data access to be controlled based on user privileges. Every project has an associated release date which determines the public accessibility of the project. By default the project release date will be 2 years from the data deposit date to protect data privacy. The user can change the release date at the time the data is deposited or at any time thereafter.

WebArrayDB is powered by the affy (Gautier et al., 2004) and the Linear Models for Microarray Data (LIMMA, http://bioinf.wehi.edu.au/limma; Smyth, 2005) packages from bioconductor (http://www.bioconductor.org/), which are open source and open development software projects for the analysis and comprehension of genomic data. Thus, many different formats of intensity files are recognized, including data from Affymetrix CEL files, Agilent Feature Extraction, ArrayVision, BlueFuse, GenePix, QuantArray (Version 3 or later), SMD and SPOT. Any formats that affy and LIMMA do not recognize can be accepted when defined by the user in a tab-delimited text file, including data with more than two scanned channels.

WebArrayDB stores parsed data in database tables. The image files, intensity files, probe files, protocol files and other user-supplied raw data files are stored in the file system on servers with indices in the database.

3 DATA ANALYSIS

Data queried from the database can be directly subjected to analysis. WebArrayDB presents a variety of options for data preprocessing, and differential analysis. Conservative default analysis methods and parameters are set so that novice users will be less likely to use flawed analysis strategies.

3.1 Data preprocessing

Data preprocessing includes cross-platform probe alignment, background correction and normalization. For cross-platform analysis, the primary concern is how to match probes from different platforms. Based on the intrinsic relationships between platforms, we offer three approaches to this issue.

  • Direct match Direct match is used when all probes are identical across microarray platforms.
  • Match by reference IDs Probes from two different platforms can be aligned if they share the same reference ID. IDs from well-known public databases, for example, UniGene ID or Ensembl ID, can serve as reference IDs, as can any user-defined category.
  • Match by file Users can align probes by providing a probe-mapping file, in which homologous probes are explicitly mapped.

If multiple platforms are involved, normalization within or between arrays of the same platform can be done directly on the raw data before probe alignment. After alignment, the whole data set can be normalized.

3.2 Differential analysis

Users can analyze data based on either ratio or intensity. The ratio-based model is R = μ + ε, where R is the ratio, μ represents the intercept of the ratio of the two groups and ε represents the Gaussian random error. We say two samples are different if μ significantly differs from the null hypothesis.

More than one comparison among groups of data can be requested simultaneously. Furthermore, users may apply ‘+’, ‘−’ and parentheses to make more specific comparisons. For instance, given four groups, ‘(group1 + group2) − (group3 + group4)’ computes the global difference between array data supplied in the first two groups compared with array data supplied in the second two groups.

Fold-change analysis, Student's t-test, eBayes-moderated t-test (Smyth, 2004; Smyth et al., 2005), SAM test (Tusher et al., 2001), non-parametric tests (including Wilcoxon rank sum test, Kruskal–Wallis rank sum test and Friedman rank sum test) and ANOVA/ANCOVA are among the choices of algorithms for differential analysis of data in WebArrayDB.

Mixed-effect model ANOVA plays a very important role in microarray data analysis (Churchill, 2002). ANOVA is capable of dealing with multiple factors. The default model in WebArrayDB is

equation image

where E is the observed log-transformed intensity value, μ is the theoretical ‘real’ log-transformed intensity value, ε represents the Gaussian random error with 0 as expected value and G is the group factor, which leads to effects of interest, e.g. treatment effects. P, A, D, S and I represent effects of platform, array, dye, sample and individual, respectively, among which array and individual are considered random effect factors. Based on the data to be analyzed, more or fewer factors might be used in specific analysis processes.

Experienced users can define new factors and create complicated analysis models. This enables WebArrayDB to analyze data from virtually any experimental design and thereby to retain relevance as methods continue to evolve.

3.3 Other analysis tools

Both raw and differentially analyzed data can be used for further analysis, including hierarchical clustering, correspondence analysis, between group analysis and plotting using genome position. A variety of high-quality charts in PDF and EPS formats can be produced to visualize analysis results.

3.4 Example

3.4.1 Data sources

A demonstration of a cross-platform analysis is used as a training example in every WebArray account. This example uses two publicly available prostate cancer micro-array datasets. One set was obtained using a custom made cDNA microarray (20K chip, platform MPMDB ID:42) that contains 19 947 sequence verified PCR-amplified human cDNAs representing 15 495 UniGene clusters (Dhanasekaran et al., 2005, project MPMDB ID:76). The other was obtained using a commercially available oligonucleotide microarray (Affymetrix U95A array, platform MPMDB ID:9) that contains 12 626 probe sets consisting of 25-base oligonucleotide probes (Welsh et al., 2001, project MPMDB ID:78). From the two datasets, 49 tumor samples (prostate cancer) and 21 non-tumor samples are analyzed in this example.

3.4.2 Options for analysis

Analysis options selected for this demonstration are illustrated in Figure 2. The IDs from the UniGene database (http://www.ncbi.nlm.nih.gov/UniGene) are used to match cDNA clones and Affymetrix probe sets between platforms. Within each study, the median value is used for expression values corresponding to probes of the same UniGene cluster. Genes not mapping to a UniGene cluster present in both microarray platforms are not considered for cross-platform analysis. For the integration and normalization of microarray measurements from different platforms, we apply quantile discretization (Warnat et al., 2005). A common reference sample is used in the two-color cDNA microarray study and the log2 ratios of the intensity values from experimental samples over the common reference sample are calculated for each individual array and used for further analysis. A non-parametric analysis method, the Wilcoxon rank sum test, is used for differential analysis.

Fig. 2.
Options selected in an analysis of two publicly available prostate cancer microarray datasets. See text for details.

3.4.3 Results

A total of 4690 probes are identified as common to both datasets, among which 661 are reported to be differentially expressed between tumor and non-tumor samples at P < 0.01, with 267 retained after false discovery rate adjustment by the step-up method of Benjamini and Hochberg (1995). Hierarchical clustering is performed for the top 30 most significant differential expressed gene sets (Fig. 3). Clustering results show that the samples were separated into two major groups correlating with their biological origin (tumor versus non-tumor) instead of their platforms. In general, discriminative gene sets found in two datasets on different platforms are likely to be more reliably the characteristic of tumor status than the genes obtained from each individual dataset (Warnat et al., 2005).

Fig. 3.
Heat map of the 30 most significantly differentially expressed probes between tumor and non-tumor samples. The tumor samples are marked at the top of the plot by a brown bar and the non-tumor group by a yellow bar. Arrays of the 20K platform are named ...

4 IMPLEMENTATION

WebArrayDB has been implemented on a LAMP system (a Linux server with Apache, MySQL and Python) in a typical browser/server model (Fig. 4). In a deployment, the WebArrayDB web server, database server and file server can be located on a single machine or on separate machines. Most modules are written in Python (http://www.python.org), while analysis functions are powered by R language (http://www.r-project.org) (R Development Core Team, 2006) and Bioconductor (Gentleman et al., 2004). Our WebArrayDB is hosted on a Dell server with four CPU cores with hyper-threading technology, 24 GB of RAM, 1 TB main hard disk and 1 TB hard disk for backup. The configuration will be upgraded depending on the burdens of computation and increases in the data stored.

Fig. 4.
Architecture of WebArrayDB.

Parallel computation can be done at two levels:

  • Multiple analysis requests from users can be processed simultaneously. In order to avoid too many active requests, WebArrayDB will automatically determine a maximum number of requests that can be processed simultaneously, limiting both the number per user and the total number, while keeping other requests waiting in the queue. The default values can be adjusted by the administrator.
  • Even in a single analysis request, computation can be distributed into many processes that run in parallel. The number of processes can be adjusted by the administrator. The package SNOW (Rossini et al., 2003) was adopted for this purpose, so Message Passing Interface (MPI), Parallel Virtual Machine (PVM) or SOCKET can be used for communication in parallel computation.

Although WebArrayDB is presented as a web server on the internet, a package is downloadable for those who want to build their own dedicated servers with Win32 or POSIX (Portable Operating System Interface) on symmetric multiprocessing (SMP) systems or Linux clusters. WebArrayDB is designed as a lightweight database with a user-friendly web interface facilitating ease of use for bench scientists. Although a curator is always desirable there is no necessity for one. WebArrayDB is an ideal tool for individual researchers, laboratories or small research institutes, to store, share and analyze the microarray data. The installation of the WebArrayDB server and maintenance is likely to require only a few hours of assistance of IT staff.

5 TUTORIAL AND EXAMPLES

A web-based tutorial, presented in English, Chinese and Spanish at the WebArrayDB web site (http://www.webarraydb.org), shows how to upload data and to process a simple example. The input data and analysis results used in the tutorial (simple analysis) and this article (complex cross-platform comparison) are available for viewing by all WebArrayDB users. Analysis methods other than the preselected ones can be chosen for these examples, and results of these changes can be viewed and stored in the user-specific accounts. Thus, all new users have the opportunity to familiarize themselves with the powerful capabilities of WebArrayDB by browsing and editing both the simple and the complex examples in the ‘demo’ account upon first entry into the system.

ACKNOWLEDGEMENTS

This work was made possible by the generous support of Sidney Kimmel, Ira Lechner, Eileen Haag and Ron Neeley. We also thank Yong Jiang, Krzysztof Studziński, Rocio Canals and Sang-Ho Choi for testing WebArrayDB, and Fred Long for maintaining the server. This work was performed in the laboratory of Michael McClelland.

Funding: Prostate Cancer Foundation and Mary Kay Ash Foundation (in parts); National Institutes of Health (grants R01AI034829, R01AI052237, R01CA68822 and U01CA114810, in parts).

Conflict of Interest: none declared.

REFERENCES

  • Ashburner M, Lewis S. On ontologies for biologists: the gene ontology–untangling the web. Novartis Found Symp. 2002;247:66–80. discussion 80–83, 84–90, 244–52. [PubMed]
  • Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B. 1995;57:289–300.
  • Bosotti R, et al. Cross platform microarray analysis for robust identification of differentially expressed genes. BMC Bioinformatics. 2007;8(Suppl. 1):S5. [PMC free article] [PubMed]
  • Brazma A, et al. Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat. Genet. 2001;29:365–371. [PubMed]
  • Bussey KJ, et al. MatchMiner: a tool for batch navigation among gene and gene product identifiers. Genome Biol. 2003;4:R27. [PMC free article] [PubMed]
  • Chen Q-R, et al. An integrated cross-platform prognosis study on neuroblastoma patients. Genomics. 2008;92:195–203. [PMC free article] [PubMed]
  • Churchill GA. Fundamentals of experimental design for cDNA microarrays. Nat. Genet. 2002;32(Suppl. 2):490–495. [PubMed]
  • Demeter J, et al. The stanford microarray database: implementation of new analysis tools and open source release of software. Nucleic Acids Res. 2007;35:D766–D770. [PMC free article] [PubMed]
  • Dhanasekaran SM, et al. Molecular profiling of human prostate tissues: insights into gene expression patterns of prostate development during puberty. FASEB J. 2005;19:243–245. [PubMed]
  • Gautier L, et al. affy–analysis of affymetrix genechip data at the probe level. Bioinformatics. 2004;20:307–315. [PubMed]
  • Gentleman RC, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004;5:R80. [PMC free article] [PubMed]
  • Kapushesky M, et al. Expression profiler: next generation–an online platform for analysis of microarray data. Nucleic Acids Res. 2004;32:W465–W470. [PMC free article] [PubMed]
  • Killion PJ, et al. The Longhorn Array Database (LAD): an open-source, MIAME compliant implementation of the Stanford Microarray Database (SMD) BMC Bioinformatics. 2003;4:32. [PMC free article] [PubMed]
  • Larkin JE, et al. Independence and reproducibility across microarray platforms. Nat. Methods. 2005;2:337–344. [PubMed]
  • Moreau Y, et al. Comparison and meta-analysis of microarray data: from the bench to the computer desk. Trends Genet. 2003;19:570–577. [PubMed]
  • Pan F, et al. Integrative array analyzer: a software package for analysis of cross-platform and cross-species microarray data. Bioinformatics. 2006;22:1665–1667. [PubMed]
  • Parkinson H, et al. Arrayexpress–a public database of microarray experiments and gene expression profiles. Nucleic Acids Res. 2007;35:D747–D750. [PMC free article] [PubMed]
  • Rossini A, et al. UW Biostatistics Working Paper Series, Paper 193. WA: University of Washington; 2003. Simple parallel statistical computing in R.
  • R Development Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2006. ISBN 3-900051-07-0.
  • Saal LH, et al. Bioarray software environment (base): a platform for comprehensive management and analysis of microarray data. Genome Biol. 2002;3 SOFTWARE0003. [PMC free article] [PubMed]
  • Shippy R, et al. Performance evaluation of commercial short-oligonucleotide microarrays and the impact of noise in making cross-platform correlations. BMC Genomics. 2004;5:61. [PMC free article] [PubMed]
  • Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 2004;3 Article 3. [PubMed]
  • Smyth GK. LIMMA: linear models for microarray data. In: Gentleman R, et al., editors. Bioinformatics and Computational Biology Solutions using R and Bioconductor. New York: Springer; 2005. pp. 397–420.
  • Smyth GK, et al. The use of within-array replicate spots for assessing differential expression in microarray experiments. Bioinformatics. 2005;21:2067–2075. [PubMed]
  • Stoyanova R, et al. Normalization of single-channel dna array data by principal component analysis. Bioinformatics. 2004;20:1772–1784. [PubMed]
  • Tárraga J, et al. GEPAS, a web-based tool for microarray data analysis and interpretation. Nucleic Acids Res. 2008;36:W308–W314. [PMC free article] [PubMed]
  • Troein C, et al. An introduction to bioarray software environment. Methods Enzymol. 2006;411:99–119. [PubMed]
  • Tusher VG, et al. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA. 2001;98:5116–5121. [PMC free article] [PubMed]
  • Warnat P, et al. Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes. BMC Bioinformatics. 2005;6:265. [PMC free article] [PubMed]
  • Welsh JB, et al. Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer. Cancer Res. 2001;61:5974–5978. [PubMed]
  • Xia X, et al. WebArray: an online platform for microarray data analysis. BMC Bioinformatics. 2005;6:306. [PMC free article] [PubMed]

Articles from Bioinformatics are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...