# Comprehensive literature review and statistical considerations for microarray meta-analysis

^{1}Department of Biostatistics,

^{2}Department of Human Genetics, University of Pittsburgh, Pittsburgh and

^{3}Department of Statistics, Pennsylvania State University, University Park, PA, USA

## Abstract

With the rapid advances of various high-throughput technologies, generation of ‘-omics’ data is commonplace in almost every biomedical field. Effective data management and analytical approaches are essential to fully decipher the biological knowledge contained in the tremendous amount of experimental data. Meta-analysis, a set of statistical tools for combining multiple studies of a related hypothesis, has become popular in genomic research. Here, we perform a systematic search from PubMed and manual collection to obtain 620 genomic meta-analysis papers, of which 333 microarray meta-analysis papers are summarized as the basis of this paper and the other 249 GWAS meta-analysis papers are discussed in the next companion paper. The review in the present paper focuses on various biological purposes of microarray meta-analysis, databases and software and related statistical procedures. Statistical considerations of such an analysis are further scrutinized and illustrated by a case study. Finally, several open questions are listed and discussed.

## INTRODUCTION

With the rapid advances in biological high-throughput technology, generation of various kinds of genomic data is commonplace in almost every biomedical field. Effective data management and analytical approaches are essential to fully decipher the biological knowledge contained in the tremendous amount of experimental data. In the past decade, the accumulation of transcriptomic data mainly from microarray experiments was particularly significant, and resulted in several large public data depositories (such as Gene Expression Omnibus and ArrayExpress). Similarly, genome-wide association studies (GWAS) are another example: thousands of GWAS have been performed world-wide and results and/or raw data for many are publicly available (see companion review paper for GWAS meta-analysis). It is common that multiple transcriptomic studies or GWAS are available for the same or related disease condition and each study has relatively small sample size with limited statistical power. Combining information from these studies to increase sensitivity and validate conclusions is a natural step. Such genomic information integration is akin to the classical meta-analysis in statistics where results of multiple studies of a similar research hypothesis are combined for a conclusive finding.

A major distinction in the genome-wide setting compared with the classical one is that we are typically analyzing data on thousands of genes. We term genomic information integration in which we combine results from multiple transcriptomic studies or GWAS as ‘horizontal genomic meta-analysis’ (Figure 1A). Figure 1B demonstrates another type of multi-dimensional integrative analysis that combines multiple sources of -omics information on a given cohort of patients. The multi-dimensional -omics data usually include, but are not limited to, transcriptome profile, genotypes, DNA copy number variation, methylation, microRNA, proteome and phenome. Examples of publicly available databases that include this type of information include the Cancer Genome Atlas (TCGA; cancergenome.nih.gov) and the Therapeutically Applicable Research to Generate Effective Treatments (TARGET; target.cancer.gov). Integration of this type of data is called ‘vertical genomic integrative analysis’. In this article, we will focus on horizontal genomic meta-analysis through extensive search of PubMed database and manual literature referencing. Of the 582 papers related to genomic meta-analysis, we will concentrate on 333 microarray meta-analysis papers in this article. The other 249 GWAS meta-analysis papers are discussed in the companion paper. The goal of this article is 3-fold. First, we aim to provide a summary of the methodologies used in the microarray meta-analysis papers. In this light, the article can be viewed as a ‘meta’–meta-analysis paper. The second goal of the article is to provide a critique of the methodologies used in the literature. Finally, we outline some further issues in the field that need more attention.

**A**) Horizontal genomic meta-analysis that combines different sample cohorts for the same molecular event. (

**B**) Vertical genomic integrative analysis that combines different molecular events usually in

**...**

The article is structured as follows. ‘Comprehensive review’ section summarizes details of the comprehensive literature review. In ‘Purposes of Microarray Meta-Analysis’ and ‘Databases and Software’ sections, we discuss various purposes of microarray meta-analysis and related software and database resources. In ‘Meta-Analysis for DE Gene Detection’ section, we discuss statistical considerations behind meta-analysis for differentially expressed (DE) gene detection, an analysis commonly encountered in microarray meta-analysis. ‘Open questions’ section describes a list of open questions and further discussions. ‘Conclusion and discussion’ section provides final conclusions.

## COMPREHENSIVE REVIEW

Papers under review came from two sources: PubMed search and manual collection. 745 papers were obtained from searching the PubMed database by keywords on 29 December 2010 (see legend of Figure 2), and 102 papers were identified from cross-referencing accumulated in our research activities. After removing duplicates and irrelevant papers, a total of 620 distinct papers were formally reviewed and summarized. Among them, 22 papers belong to the vertical genomic integrative analysis category and 598 papers were horizontal genomic meta-analysis. Of the 598 papers, 333 papers were related to microarray meta-analysis, 256 papers were in the GWAS meta-analysis category and 9 papers were meta-analysis of other categories (e.g. copy-number variation or genome-wide linkage scan). The flow diagram is shown in Figure 2.

**(“meta-analysis”[Title/Abstract])**

^{a}**...**

Figure 3 illustrates a summary of our microarray meta-analysis review. Detailed information of the paper list and categorization to generate Figure 3 is available in the Supplementary Data. Of the 333 microarray meta-analysis papers, 7 (2%) were descriptive review without quantitative information integration, 42 (13%) were meta-analysis on one or several targeted genes (not at genome-wide scale) and the remaining 284 (85%) represented genome-wide meta-analysis on a global basis (Figure 3A). In Figure 3B, the 333 papers were categorized into review papers (11 papers; 3%), biological applications (201 papers; 60%), novel methodologies (83 papers; 25%) and database/software (38 papers; 12%). For different purposes of meta-analysis shown in Figure 3C, the majority of papers targeted on DE gene or pathway detection (218 papers; 66%). Other purposes include ‘network or co-expression analysis’ (32 papers; 10%), ‘classification analysis’ (25 papers; 8%), ‘reproducibility or bias analysis’ (19 papers; 6%) and ‘others’ (34 papers; 10%). We will further survey these various meta-analysis purposes later in ‘Purposes of microarray meta-analysis’ section. Since two-thirds (218 papers; 66%) of the microarray meta-analysis papers were related to DE gene or pathway detection which conceptually were extensions from traditional meta-analysis, we scrutinized this category and summarized four types of statistical methodologies used (Figure 3D). Of the 191 papers that could be clearly categorized, 81 papers (42%) used meta-analysis methods that combine *P*-values from individual studies, while 41 papers (22%) combined effect sizes, 18 papers (9%) combined ranks and 51 papers (27%) directly merged data after proper normalization. ‘Types of meta-analysis methods’ section will go over these four types of statistical methodologies in more detail.

## PURPOSES OF MICROARRAY META-ANALYSIS

When the term ‘microarray meta-analysis’ is used, it usually means meta-analysis for DE gene (or marker) detection. Although two-thirds of identified publications (Figure 3C) were of this type, microarray studies have also been combined for many other biological purposes, as described below.

### DE gene detection (218 papers)

DE gene detection is a commonly used downstream analysis in microarray that identifies genes differentially expressed across two or more conditions with statistical significance and/or biological significance (e.g. fold change). In the simple case that we are looking at one gene, this type of analysis is usually performed using a two-sample *t*-test or a Wilcoxon rank-sum test. However, when this analysis is performed genome-wide, a major issue becomes the fact that there can be many spurious associations that are expected by chance. To counteract this problem, some type of multiple comparisons adjustment is usually done; a popular one is to use the *q*-value (1). The task is usually a first step to identify gene targets for understanding genetic mechanisms under a disease or for guiding the search of treatment targets. From Figure 3C, detection of DE genes covers two-thirds of papers (218 papers) in the microarray meta-analysis literature. Most existing methods or applications are for two-class comparison (e.g. identify DE genes comparing cases versus controls). Other types of outcome variables (e.g. multi-class, continuous, censored survival or time series) have also been considered in microarray meta-analysis (2). Details of these methods will be further described in ‘Types of meta-analysis Methods’ section.

### Pathway analysis

Pathway analysis (a.k.a. gene set analysis) is a statistical tool to infer correlation of differential expression evidence in the data with pathway knowledge from established databases (3,4). The idea behind pathway analysis is to determine if there is enrichment in the detected DE genes based on an *a priori* defined biological category. Such a category might come from one or multiple databases such as Gene Ontology (GO; www.geneontology.org), the Kyoto Encyclopedia of Genes and Genomes (KEGG; http://www.genome.jp/kegg/), Biocarta Pathways (http://www.biocarta.com/) and the comprehensive Molecular Signatures Database (MSigDB; http://www.broadinstitute.org/gsea/msigdb/). For the majority of recent microarray meta-analysis applications, pathway analysis has been a standard follow-up to identify pathways associated with detected DE genes [e.g. (5) and many others]. The result provides more insightful biological interpretation and it has been reported that pathway analysis results are usually more consistent and reproducible across studies than DE gene detection (6). Shen and Tseng (7) developed a systematic framework of Meta-Analysis for Pathway Enrichment (MAPE) by combining information at gene level, at pathway level and a hybrid of the two.

### Network and co-expression analysis (32 papers)

Co-expression analysis and network analysis of microarray data are used to investigate potential transcriptional co-regulation and gene interactions. Network analyses typically work with the gene–gene co-expression matrix, which represents the correlation between each pair of genes in the study. A crucial assumption is that the magnitude of the co-expression between any pair of genes is associated with a greater likelihood that the two genes interact. Thus, networks of interactions between genes are inferred from the co-expression matrix. Many papers have extended this analysis to the meta-analysis scenario. Of the 32 papers identified, some directly merge multiple studies to construct a network as if from a single study (8–15). Others combine pairwise gene interaction evidence across studies by vote counting (16–18) or Fisher's (19,20) method, similar to meta-analysis for DE gene detection. Segal *et al*. (21) was probably the first large-scale microarray meta-analysis for network or co-expression analysis. They developed a ‘module map’ by combining 1975 arrays in 26 cancer studies to characterize expression behavior of 2849 modules collected from various sources (e.g. Gene Ontology, KEGG pathways and gene expression clusters). Wang *et al*. (22) formulated a regularized approach to combine multiple time-course microarray studies for inferring gene regulatory networks. Zhou *et al*. (23) proposed a 2nd-order correlation analysis to construct network and functional annotation by combining 39 yeast data sets. Huttenhower *et al*. (24) used a scalable Bayesian framework to combine studies for pairwise meta-correlation and predicted functional relationship. Wang *et al*. (25) developed a semi-parametric meta-analysis approach for combining co-expression relationships from multiple expression profile data sets to evaluate similarity and dissimilarity of gene network across species. Steele *et al*. (26) proposed a weighted meta-analysis Bayesian network based on combining statistical confidences attached to network edges and a consensus Bayesian network to identify consistent network features across all studies.

### Inter-study prediction analysis (25 papers)

Prediction analysis (a.k.a. classification analysis or supervised machine learning) is probably the most commonly applied microarray analysis that leads to clinical utility. In this type of analysis, the goal is to construct an improved discrimination between two or more study populations with accuracy beyond existing criteria in clinical practice (27). There now exists an extensive literature on classification methods for gene expression data; we refer the reader to Perez-Diaz *et al*. (28) for a recent review. In a single microarray study analysis, cross-validation has been routinely used by splitting the entire cohort into training and testing groups, constructing a prediction rule in the training group and finally validating in the test group. To demonstrate validity of microarray signatures or prediction models in other studies, two major strategies for developing prognostic signatures have been pursued. The first approach focuses on validity of biomarkers in external data. The prognostic signatures (a small number of genes) generated from training data are usually subsequently developed from a more traditional platform such as qRT–PCR. Reasons for failure of external validation in this regard have been widely surveyed and discussed in the literature (27,29–35). The second type of external validation focuses on inter-study prediction (i.e. construct a prediction model in one study and use the model to make predictions in another study). Although external validation of a gene expression-based prediction model has been shown valid in some publications (36,37), it has been found to be difficult in general. The failure of direct inter-study prediction is mainly due to discrepancy of probe design and experimental protocols across array platforms, plus possible heterogeneous patient cohorts across studies. Some reports avoided the major cross-platform obstacle by directly merging studies of the same platform (usually Affymetrix) to construct a prediction signature (38–42) and conventional cross-validation can be performed. Others developed sophisticated normalization techniques to solve or alleviate such a problem, including cross-platform normalization (XPN) (43), distance-weighted discrimination (DWD) (44), ratio-adjusted gene-wise normalization (rGN) (45) and module-based prediction (MBP) (46). In these approaches, data are normalized across studies so the prediction model can be applied across studies (47–50). Rank-based robust approaches have also been used (41,51).

### Reproducibility and bias analysis (19 papers)

Evaluating reproducibility and bias across microarray studies was an important topic, especially when array technology and experimental protocols were in an early developmental stage. Simple Pearson correlation and Venn diagrams have been widely used (52–55). Other sophisticated statistical measures have been proposed to quantify similarity of any two microarray studies, including integrative correlation coefficient (56), similarities of ordered gene lists (SOGL) (57,58), BayesGen (59) and co-inertia analysis (CIA) (60).

### Others (34 papers)

Additional purposes of microarray meta-analysis include: (i) discover or validate disease subtypes (61–65); (ii) predict unknown gene functions (66,67) or transcriptional regulations (13); (iii) dimension reduction (68); (iv) gene clustering (69). Targeted gene detections other than classical DE gene analysis have also been pursued. For example, phase-coupled models (70) or Bayesian approaches (71) have been used to combine multiple studies to detect periodic or cell cycle-related genes. Sequence information and gene expression have been combined for cyclic gene detection (72). Others have also combined large-scale microarray studies to identify house-keeping genes (defined as genes having consistent expression across various cellular or environmental changes) (73–75) or conversely highly variable genes (76,77).

## DATABASES AND SOFTWARE

### Databases

Many web databases are available for public storage and meta-analysis of microarray data sets. Gene Expression Omnibus (GEO) from NCBI and ArrayExpress from EBI are probably the two largest public repositories. On 3 April 2011, GEO contained 22170 data series and 546633 samples. Several other databases are housed in specific universities or groups, including Stanford Microarray Database (SMD), caArray at NCI, UPenn RAD Database, UNC Microarray Database, Yale Microarray Database, MUSC Database and UPSC-BASE. These websites are considered primary databases, where the main purpose is to provide downloadable and searchable microarray data sets. Other secondary databases import data sets from primary data archives, preprocess the data, perform in-depth analyses and deliver it through convenient interfaces for fast query, data mining and information integration. GEO Profiles and Gene Expression Atlas (78) are two secondary databases that accompany GEO and ArrayExpress. Other secondary databases include Genevestigator (79), ArrayTrack (80), Gemma, NextBio (81), LOLA (82), L2L (83), A-MADMAN (84), PrognoScan (85), MiMiR (86), Microarray retriever (87), TranscriptomeBrowser (88), M^{2}DB (89), MAMA (90) and GeneSigDB (91). These tools contain various types of gene signature, regulatory network and differential expression information available for fast query, retrieval and evaluation.

In addition to the general-purpose microarray databases listed above, many databases are specialized to particular disease or species, including aging databases [AGEMAP (92) and Gene Age Nexus (93)], Pancreatic Expression database (94), COXPRESdb for gene networks in mammals (95), CYCLONET for cell cycle regulation (96), HCNet for heart and calcium functional network (14), and general cancer databases [Oncomine (97) and Cancer Genome Workbench (CGWB) (98)]. Of these, Oncomine has been used and cited widely in cancer research particularly when only a few targeted genes are scrutinized. While the statistical methods in these databases are relatively simple, a major advantage of these is the ease of use for biological scientists who are generating microarray data sets.

### Software

Despite the availability of many web databases and many microarray meta-analysis methods (to be discussed in detail in the ‘Types of meta-analysis methods’ section), there exist surprisingly few user-friendly software packages for microarray meta-analysis implementation, in terms of their documentation and workflow. Compared with popular microarray packages (e.g. SAM, LIMMA or BRB array tool), existing meta-analysis packages are relatively primitive and difficult to use. In the R and Bioconductor environment, GeneMeta (implements fixed and random effects model; http://www.bioconductor.org/packages/release/bioc/html/GeneMeta.html; version 1.24.20), metaMA (implements random effects model and Stouffer's method; http://cran.r-project.org/web/packages/metaMA/; version 2.1), metaArray (implements meta-analysis of probability of expression, POE; http://www.bioconductor.org/packages/release/bioc/html/metaArray.html; version 1.28.20) (99), OrderedList (compares ordered gene lists; http://www.bioconductor.org/packages/release/bioc/html/OrderedList.html; version 1.24.20) (100), SequentialMA (for determining sensitivity and judge whether more samples are needed to assure firm conclusion) (101), RankProd (implement rank product method; http://www.bioconductor.org/packages/release/bioc/html/RankProd.html; version 2.24.20) (102) and RankAggreg (implements various rank aggregation methods; http://cran.r-project.org/web/packages/RankAggreg/; version 0.4-2) (103) are available. GODiff (104) (http://fishgenome.org/bioinfo/godiff/index.htm version 1.2) allows investigation of functional differentiation across studies using Gene Ontology annotation. Integrative Array Analyzer (105) (http://zhoulab.usc.edu/iArrayAnalyzer.htm; version 1.1.13) provides data mining and visualization tools to combine studies for simple co-expression analysis and differential expression analysis. For visualization, UCSC Genome Browser (106) and Genome Graphs provide flexible tools to compare and explore multiple genomic studies. Other commercial packages, including JMP Genomics from SAS (http://www.jmp.com/software/genomics/index.shtml; version 5.1) and Partek Genomic Suite (http://www.partek.com/software), also provide similar or more advanced visualization and graphical tools but with less statistical information integration capabilities.

In addition to scarcity of software packages in the field, quality of software packages should be enhanced. The concept of ‘literate programming’ (107) (e.g. the ‘sweave’ package in R) has been developed for reproducible research and should be promoted in future software development. For example, all packages available in Bioconductor now meet this requirement. Such a programing practice allows users to easily understand program design and rationale in the source code and to reproduce the results by other researchers.

## META-ANALYSIS FOR DE GENE DETECTION

Ramasamy *et al*. (108) outlined a seven-step practical guidelines for conducting microarray meta-analysis: ‘(1) Identify suitable microarray studies; (2) Extract the data from studies; (3) Prepare the individual datasets; (4) Annotate the individual datasets; (5) Resolve the many-to-many relationship between probes and genes; (6) Combine the study-specific estimates; (7) Analyze, present, and interpret results’. In the section below, we will focus on steps 6 and 7 for DE gene detection of microarray meta-analysis. We will discuss four major types of statistical meta-analysis methods in the ‘Types of meta-analysis methods’ section. In the ‘Statistical considerations behind the methods’ and ‘A case study’ sections, related statistical considerations and a case study are discussed to illustrate the issue of choosing a suitable method.

### Types of meta-analysis methods

As shown in Figure 3C, microarray meta-analysis for DE gene detection is a commonly encountered application. In this sub-section, we will discuss four categories of methods to combine information for DE gene detection: combine *P*-values, combine effect sizes, combine ranks and directly merge after normalization. In addition to these major categories, sophisticated latent variable approaches have also been developed.

#### Combining *P*-values (81 papers)

Combining *P*-values from multiple studies for information integration has a long history in statistical science. It has two major advantages (e.g. compared with another popular category of combining effects sizes below), including its simplicity and extensibility to different kinds of outcome variables. When the outcome variable is not binary (e.g. multi-class, continuous or censored survival), effects sizes may not be well defined, while association *P*-values can still be calculated. Below, we briefly introduce five *P*-value combination methods and use the examples in the ‘A case study’ section for illustration later. A major advantage of the *P*-value-based approaches is that they allow for standardization of the associations from genomic studies to a common scale.

Rhodes *et al*. (109) was among the earliest to demonstrate use of sophisticated statistical meta-analysis for DE gene detection. They applied the famous Fisher's method that summed up minus log-transformed *P*-values. For example, two-sided *P*-values of the PTTG1 gene were obtained from differential expression analysis in four prostate cancer studies separately in Table 1. The Fisher's statistics was calculated as *S*_{Fisher}=−2×[log(1.6×10^{−3})+log(4.7×10^{−7})+log(1.7× 10^{−4})+log(4.7×10^{−7})]=88.52, where larger Fisher score reflects stronger aggregated differential expression evidence. Instead of log-transformation, Stouffer's method (110) adopted a different alternative by inverse normal transformation. In the PTTG1 example,
[where is the inverse cumulative distribution function of standard normal distribution]. Similar to Fisher score, smaller *P*-values result in larger values and thus generate larger Stouffer score to reflect stronger aggregated statistical evidence. For the third and fourth methods, minimum or maximum *P*-values are taken as the test statistics: *S*_{minP}=min(1.9E-5, 1E-20, 2E-5, 1E-20)=1E-20 and *S*_{maxP}=max(1.9E-5, 1E-20, 2E-5, 1E-20)=2E-5. Smaller minP or maxP statistics reflects stronger differential expression evidence. Conceptually, minP claims a DE gene if any study used to combine has a small *P*-value while maxP tends to be more conservative that detected DE genes should have small *P*-values in all studies combined. Differences of these two methods that correspond to the two hypothesis settings will be discussed in the ‘Statistical considerations behind the methods’ section. Recently, Li and Tseng (111) introduced an adaptively weighted Fisher's method (AW) that characterizes effective studies contributing to the meta-analysis so that the meta-analysis result has better biological interpretation. Take the ‘TPM2’ gene in Table 1 as an example. AW searched all possible 0-1 weights for the four studies (a total of 2^{4}−1=15 possibilities) and identified (1,0,1,1) as the best adaptive weight, meaning that combination of the three effective studies (Lapointe, Varambally and Yu) contributes the best to the DE evidence in the meta-analysis. For all the five methods, statistical inference can be performed parametrically under the assumption that *P*-values are uniformly distributed under the null hypothesis or can be done non-parametrically by permutation-based analysis (109,112).

Despite availability of powerful statistical tools described above, many biological applications we surveyed chose to apply naïve Venn diagram (used in 21 papers in our survey) or vote counting methods (used in 24 papers) for convenience. Venn diagram is a useful visualization tool, when combining few (usually 2–4) studies, to demonstrate the intersection and union distribution of DE gene lists detected by each individual study under a fixed threshold (e.g. FDR=5%). The naïve diagram, however, does not perform real information integration but only displays a consistency summary. When many studies are combined, naïve vote counting is often chosen by biologists instead. For each gene, the method simply counts the number of studies with *P*-values under a given threshold (e.g. *P*<0.05). In the statistical literature, it is well known that vote counting is statistically inefficient (113,114). On the other hand, vote counting is useful when raw data and complete *P*-value information of all genes are unavailable while only a list of DE genes under certain *P*-value threshold is available. This happened frequently in many early microarray studies, in which DE gene lists were summarized in supplemental tables of publications but raw data were not uploaded to public domain. Due to the significant loss of information and efficiency, the vote counting method should be avoided whenever possible in the applications.

#### Combining effect sizes (41 papers)

Many meta-analysis methods have been based on the assumption that the standardized effect sizes are combinable across studies. Fixed and random effects models (FEM & REM) are the two most popular approaches in this category. In FEM, the estimated effect size in each study is assumed to come from an underlying true effect size plus measurement error (that may come from experimental or population sampling error). In REM, each study further contains a random effect that can incorporate unknown cross-study heterogeneities in the model. Choi *et al*. (115) was among the first to apply these models to microarray meta-analysis. In a given application, a *Q*-statistic was used to determine the need for a random effects model and the underlying effect size was estimated under FEM or REM. Bayesian meta-analysis was also developed with Markov Chain Monte Carlo (MCMC) simulation to estimate the underlying effect size. Others have also developed different variations of effect size models (116–118).

#### Combining ranks (18 papers)

One apparent downside of methods combining *P*-values or effect sizes is that the results can often be dominated by outliers. This can be a significant problem when thousands of genes are analyzed simultaneously in the noisy nature of microarray experiments. Methods combining robust rank statistics are used to alleviate this problem. Instead of *P*-values or effect sizes, the ranks of DE evidence are calculated for each gene in each study. The product, mean (119) or equivalently sum (120) of ranks from all studies is then calculated as the test statistic. Permutation analysis can be performed to assess the statistical significance and to control FDR. Hong *et al*. (102) proposed a more advanced RankProd algorithm that calculates the product of the ranks of fold change in each inter-group pair of samples. In a follow-up comparative study, they showed its better performance as compared to Fisher's method and the random effects model (121). DeConde (122) applied various ‘rank aggregation’ methods, which were developed for the meta-search problem for combining top-k lists in the computer science literature. The methods effectively aggregate the rankings of, say the top 100 most upregulated or downregulated genes in each study.

#### Directly merging the raw data (51 papers)

Despite the concern of heterogeneity across studies, many microarray meta-analysis applications chose to normalize across studies and directly merge data sets for DE gene detection. This approach is often called ‘mega-analysis’, especially in GWAS meta-analysis. In microarray meta-analysis, such applications usually restrict selection of studies from the same or similar array platform, e.g. a single Affymetrix U133 or multiple Affymetrix platforms (38,123). The collection of only Affymetrix arrays allows pre-processing by model-based robust multi-array (RMA) normalization (124) on the CEL files of all samples simultaneously. Others have developed advanced normalization techniques to eliminate cross-study discrepancy and allow direct merge of studies [e.g. XPN (43), DWD (44) and rGN (45)]. Although direct merging can be attractive in applications for its convenience, cautions have to be taken that normalizations do not guarantee to remove all cross-study discrepancies. In fact, Goldstein *et al*. (125) demonstrated that RMA does not remove batch effects even when two studies are from the same lab and same Affymetrix platform but performed at different time.

#### Latent variable approaches

There are more sophisticated approaches in place that attempt to model the pre-processed microarray data sets using latent variable-based models and attendant inference using either expectation–maximization routines or Markov Chain Monte Carlo algorithms. For example, the probability of expression (POE) was a latent variable used in several papers that was not observable in the data but could be inferred from other observed variables. Papers of this category include metaArray (99) which employs two types of inferential strategies, frequentist and Bayesian (see the ‘Statistical considerations behind the methods’ section) for modeling data from multiple platforms, and XDE (126), which fits a joint parametric Bayesian model for multi-study meta-analysis. In particular, the latter paper shows some compelling simulation evidence for a joint modeling strategy using these latent variable models. For more specialized settings, Conlon *et al*. (127) and Fan *et al*. (71) have presented Bayesian modeling approaches for combining data from multiple microarray studies. While the hierarchical models used in these papers are statistically more sophisticated than the methods described in the previous section, they offer the potential of pooling information across genes to sharpen inferences about which genes are differentially expressed. However, due to their complexity, they have not been used much in practice. One notable exception is Shen *et al*. (128), which applied a precursor of the metaArray algorithm to identification of gene expression signatures for aggressive breast cancer.

### Statistical considerations behind the methods

#### Null and alternative hypothesis assumptions behind the methods

Although the concept of combining studies for meta-analysis is seemingly straightforward, the targeted biomarker characteristics implicitly reflected by different statistical hypothesis settings behind the methods can be varied. Following the convention of Birnbaum (129), Li and Tseng (111) presented two major hypothesis settings behind microarray meta-analysis methods described in the ‘Types of meta-analysis methods’ section. Suppose *K* studies are combined and *θ _{k}* is the effect size of study

*k*. The first hypothesis setting (HS

_{A}) detects candidate genes differentially expressed in ‘all’ studies (

*H*

_{0}:

*θ*=

_{1}*θ*

_{k}=

*0*for one or more

*k*versus

*H*:

_{a}*θ*≠0, 1≤

_{k}*k*≤

*K*) whereas, HS

_{B}identifies markers differentially expressed in ‘partial’ (one or more) studies (

*H*

_{0}:

*θ*=…=

_{1}*θ*=0 versus

_{k}*H*:

_{a}*θ*≠0 for one or more

_{k}*k*). For example, Fisher's method takes sum of log-transformed

*P*-values as the statistics. If, for a given gene, a study has very significant

*P*-value (e.g.

*P*=1E-20) but all other studies do not have significant

*P*-values (e.g. the FOLR3 gene in the ‘A case study’ section), the Fisher's method still concludes a large Fisher's score and declares this gene as a DE gene. As a result, Fisher's method pursues the second hypothesis setting, HS

_{B}. Similarly, Stouffer, minP, maxP, AW, as well as rank sum and RankProd, all adopt similar hypothesis setting HS

_{B}. On the other hand, the maxP method takes the maximum

*P*-value as the statistics. It requires that

*P*-values from all studies are small and thus it pursues the first hypothesis setting, HS

_{A}. The random effects model has the same hypothesis setting that all studies have the same overall effect size while each study may contain an additional random effect component. One might somewhat relax HS

_{A}to detect genes differentially expressed in ‘majority’ of studies (denoted as HS

_{A−}). The vote counting method follows this relaxed hypothesis setting. The hypothesis setting of each method is presented in Table 1.

#### Frequentist versus Bayesian inference

Implicit in the discussion about inference has been the use of a frequentist framework. In particular, we assume that there is a test statistic, larger values which indicate stronger evidence against the null hypothesis. However, one could also perform Bayesian hypothesis testing using these hypotheses. This is done by consideration of posterior probabilities of the specific hypotheses (e.g. P(*θ _{1}*=…=

*θ*=0|data) versus P(

_{k}*θ*≠0 ∀

_{k}*k*|data)). Computation of these posterior probabilities requires the use of a likelihood for the parameters of interest along with prior probabilities of the specific hypotheses being tested. The prior probabilities are typically selected based on the relative costs of a type I error (rejecting the null hypothesis when it is true) versus a type II error (accepting the null hypothesis when it is false). The larger the relative cost, the larger the prior probability for the null hypothesis should be. Bayesian hypothesis testing procedures are amenable with the latent variable models for meta-analysis described in the ‘Databases and software’ section. In the literature, another advantage of Bayesian approach is the use of Bayes factor that does not require a prior probability of the two hypotheses and can work as an alternative of classical hypothesis testing.

#### Consistent up or downregulation

Comparing the first three categories of meta-analysis methods in the ‘Types of meta-analysis methods’ section, combining effects sizes (e.g. random or fixed effects model) automatically identifies genes that have consistent up or downregulation in all studies. This may not be the case for methods combining *P*-values or ranks if the *P*-values and ranks are obtained from two-sided hypothesis testing. In this case, up- and down-regulation are treated as equally strong evidence and a gene may be detected from the meta-analysis with strong up-regulation evidence in one study but strong down-regulation evidence in another study, which leads to confusing conclusions. Theoretically, the discordance may reflect underlying biological truth due to population heterogeneity but it may as well be a result of technical artifacts such as gene annotation mistakes or cross-hybridization. Distinguishing the two is often a difficult, if not impossible, task. A convenient solution to avoid detecting genes with such discordances is by combining *P*-values or ranks from one-sided tests. For example, a modified Stouffer's method can apply a z-transformation that automatically utilizes one-sided tests and splits up- and downregulation evidences into positive and negative z-scores, respectively. Owen (130) applied a similar Pearson one-sided test adjustment for Fisher's method and the modification can be extended to minP, maxP and other methods. Note that the consistent up- or downregulation issue only exists in two-class comparison in DE gene detection and does not apply to other types of response variables (e.g. multi-class, continuous or survival).

### A case study

To illustrate some properties of the methods described in the ‘Types of meta-analysis methods’ section, we performed a simple case study. The motivation of this small case study was to help understand how the algorithm of each method works and to explain pros and cons of each method. The result provides general insight for selecting an adequate method in applications. This case study is, however, neither comprehensive nor conclusive enough as a comparative study to judge performance of the methods. In this case study, four prostate cancer expression profiles (Lapointe, Tomlins, Varambally and Yu) containing metastasis versus primary tumor samples were combined for meta-analysis. After gene matching by official gene symbols, pre-processing and filtering, 4260 genes were analyzed in the meta-analysis. We used the R package ‘siggenes’ to perform DE gene analysis in each study. ‘siggenes’ allows implementation of the Significance Analysis of Microarray (SAM) method and the Empirical Bayes Analyses of Microarrays (EBAM) method. For simplicity, we applied the popular SAM method with B=500 permutation. According to Phipson and Smyth (131), the *P*-values from permutation analysis should never be zero but the ‘siggenes’ package does occasionally generate zero *P*-values. If *P*=0 is obtained for a certain gene in an individual study, we set it to *P*=1E-20 to avoid failure of logarithmic or inverse normal transformation in the Fisher's and Stouffer's methods. After *P*-values are generated, Benjamini–Hochberg procedure is applied to calculate *q*-values and correct for multiple comparison (‘p.adjust’ function in R is used). The random effects model was implemented using the ‘GeneMeta’ package in R. RankSum and RankProd methods were performed in the R package ‘RankProd’. In the ‘RankProd’ package, the RankSum and RankProd methods could only be implemented with up- and downregulation analysis separately. Theoretically, it is easy to modify the algorithm to analyze up- and downregulation simultaneously. For the vote counting method, the method determines a DE gene if it has *P*-values smaller than a threshold *P* in greater or equal to *S* studies among the four studies combined. In Table 1, we list results for *P*=0.01 or 0.05 and S=3 or 4. Table 1 shows results of four single-study analyses and nine meta-analysis methods in four selected genes.

The first example gene, ‘PTTG1’, was up-regulated in the metastatic group with strong statistical significance in all four studies (*P*=1.9E-5, 1E-20, 2E-5 and 1E-20). As expected, all nine meta-analysis methods concluded very strong statistical significance even after multiple comparison correction. As a comparison, the second selected gene ‘FOLR3’ was down-regulated with strong statistical significance in the Tomlins study (*P*=1E-20; fold change FC=0.58) but was not statistically significant in the other three studies (*P*=0.65, 0.96 and 0.43). Such sporadic high statistical significance in a subset of studies might be a result of unknown experimental artifacts (e.g. non-specific probe design that causes cross-hybridization in the cDNA array design) but might instead be a biological truth in the specific cohort. Fisher, minP, AW, RankSum and RankProd all obtained strong to moderate statistical significance after meta-analysis for this gene (see FOLR3 column in Table 1). This reflected the underlying HS_{B} hypothesis setting of these methods to detect a DE gene if the gene is differentially expressed in one or more studies (see ‘Statistical considerations behind the methods’ section). On the other hand, vote counting, the random effects model and maxP required a gene to be differentially expressed in all or ‘majority’ of the studies (i.e. hypothesis setting HS_{A}) and thus did not generate significant *q*-values. The third gene, ‘TPM2’, was differentially expressed in three studies (*P*=9.4E-7, 1E-20 and 1E-20 in Lapointe, Varambally and Yu) but not differentially expressed in Tomlins (*P*=0.92). Among the nine methods, it was detected by seven methods, excepting only maxP (*q*=0.13) and vote counting (S=4). This result shows that methods to detect genes differentially expressed in ‘all’ studies might be too stringent and could ignore an important marker gene when many studies are combined. It was interesting that, in the random effects model, although it is aimed at HS_{A}, the random effects assumption provided robustness so that TPM2 was statistically significant (*q*=0.02). The fourth example gene, ‘BRAF’, was differentially expressed in all four studies but was surprisingly down-regulated in two studies but up-regulated in the other two studies. Among the nine methods, Fisher, minP, AW, vote counting and maxP detected BRAF as a DE gene because the methods combined two-sided *P*-values without distinguishing DE direction. RankSum and RankProd, although considered DE directions in the algorithm, still determined BRAF as an upregulated DE gene. Stouffer and random effects model were two methods that considered DE directions in the algorithm and generated non-significance *q*-values. Whether detecting a discordant gene like BRAF is favorable or not depends on the inferential goals of the experiment. It can be the case that BRAF is an important marker and the discordance is generated from an unknown meaningful confounding variable (e.g. race; say, BRAF is up-regulated in black but down-regulated in white). It is equally possible that the discordance comes from unknown technical artifacts.

Below, we further scrutinize the biological functions of the four genes using the NCBI database. PTTG1 has been related to DNA repair, cell division and mitosis cell cycle and has been correlated with tumor aggressiveness in multiple tumors. The strong statistical significance in all four studies is biologically verified. On the contrary, there is no direct evidence of cancer association found for FOLR3. The strong DE statistical significance in the Tomlins study might indeed be an artifact. For TPM2, a recent paper has identified a novel splice variant of TPM2 related to prostate cancer cell lines (132). The high statistical significance in three out of four studies might be strong enough evidence for its association with metastasis. The fourth gene, ‘BRAF’, plays a role in regulating the MAP kinase/ERKs signaling pathway, has been associated to multiple cancers and is in the KEGG prostate cancer pathway (05215). Indeed, the confusing discordant direction of fold changes might be the result of unknown confounding factors such as age or race. Further investigation of demographic or experimental information for the four studies might help elucidate the mystery. We also note that interpretation of detected DE genes also depend on other genes due to gene dependency.

## OPEN QUESTIONS

Despite the popularity of microarray meta-analysis, many issues remain unresolved that can hamper the effectiveness of its application. In this section, we discuss a few open questions and related problems.

### Quality assessment and inclusion/exclusion criteria

To date, the decision to include or exclude microarray studies in a meta-analysis has been mostly ad hoc and subjective in the literature. Researchers usually apply arbitrary criteria, such as number of samples or array platforms (e.g. (112,133,134) and many others), to make the decision. Inclusion of a low quality or outlying study into the meta-analysis, however, can greatly reduce the statistical power or even result in a false conclusion. As a first step, keyword searching in primary data repositories can provide a useful initial screening to identify studies to combine. Some biological terminology systems (e.g. Unified Medical Language System, UMLS) may help provide a refined and unbiased selection for more homogeneous studies. Ramaswamy *et al*. (108) has suggested to apply the integrative correlation technique by Parmagiani *et al*. (56) to select ‘reproducible’ genes for meta-analysis. This approach potentially can be extended for objective inclusion/exclusion decisions. In general, a data-driven quantitative evaluation for inclusion/exclusion criteria is still an open question in the field. This is tied to the classical question of between-study variation. In the case of a single gene, the issue of between-study variation has been carefully studied; a review of available methods can be found in (135). How to adapt this to the genomic, high-dimensional data setting is still an open question. This issue is also discussed in the companion paper for GWAS meta-analysis, under the terminology of ‘heterogeneity’.

### Practical guidelines from large-scale comparative study and simulation

Among the papers we have surveyed, only two papers performed systematic comparative analysis on microarray meta-analysis methods: Hong *et al*. (121) and Campain and Yang (136). Although the two studies provided insightful conclusions, the number of methods compared (three and five methods, respectively) and the number of real examples examined (two and three examples, respectively with each example combining only 2–5 microarray studies) were very limited. Some key conclusions from the two papers were even contradictory. A large-scale comparative study and simulation study with adequate evaluation measures will help provide insights and practical guidelines for choosing the ‘best’ meta-analysis method(s) in practice.

### Combining studies with censored information

As mentioned in ‘Types of meta-analysis methods’ section, vote counting has a natural advantage to combine information from studies with censored *P*-value information (i.e. raw data are not accessible but only a top ranked DE gene list under certain *P*-value threshold is available), though it suffers greatly from low statistical power. Although many grant agencies and journals now enforce data sharing policies, many old studies or new studies funded by private foundations are still not openly accessible. Studies with censored information can be an obstacle for meta-analysis. Researchers are forced to either drop studies with censored information or use inefficient vote counting methods in the meta-analysis. In the literature, Bushman and Wang (137) have transformed *P*-values to pseudo effect sizes to combine vote counting and effect size combination methods. Extension of other existing methods, such as Fisher, Stouffer and maxP, to analyze such censored *P*-value data in partial studies will provide a more powerful solution to this practical problem.

### Meta-analysis to guide and design future studies

In modern evidence-based medicine, meta-analysis is often used (or required) to combine existing evidence in the literature when planning for a new study. Similarly, genomic meta-analysis should be used more frequently to narrow down gene targets or scope of study when designing new studies (e.g. targeted sequencing).

### Meta-analysis on a pathway basis

While the work of authors such as Shen *et al*. (37) and Shen and Tseng (7) has led the way in the area of combining information from multiple studies at the pathway level, there are several issues that remain to be addressed. Adjusting for inference due to pathway dependence remains an important open problem, as the dependence in pathway data might render many of the statistical methods available for multiple testing (e.g. *q*-values/false discovery rate control) invalid.

### Development of user-friendly software

In our review, only a few microarray meta-analysis methods are developed with R packages. When we tested the packages, most of them either did not have clear manuals or had functions that were not easy to apply (especially compared with mature and popular microarray packages such as SAM, PAM, LIMMA, BRB Array Tool or GSEA). Convenient R packages or packages in a programmable environment will allow researchers to test and compare methods and motivate further methodological development. Software with friendly graphical user interfaces (GUI) will further assist biologists in daily applications.

### Adjust for potential confounding variables

Heterogeneities caused by demographic, clinical and technical variables often exist within and across studies. Failure to consider these variables in the statistical models and meta-analysis can result in reduced statistical power or false positives. In a microarray meta-analysis, these systematic variabilities should be considered and incorporated in the analysis whenever possible. Leek and Storey (138) proposed surrogate variable analysis (SVA) to further account for unmeasured and unmodeled factors in a genome-wide expression analysis. The result has shown improved sensitivity and accuracy. Similar techniques can be extended to microarray meta-analysis.

## CONCLUSION AND DISCUSSION

In this article, we performed a comprehensive review of microarray meta-analysis and discussed the related statistical issues. Although many methods have been proposed and used in published applications, the detailed meta-analysis workflow and the hypothesis behind the analysis needs more attention. Selection of a suitable method depends on the type of analysis desired (various purposes described in ‘Purposes of microarray meta-analysis’ section) and the hypothesis setting behind each method (‘Statistical considerations behind the methods’ section). In our review, we noticed that easy to use software packages are rare in the field. We have also addressed several important open questions (‘Open questions’ section), including developing a quantitative inclusion/exclusion evaluation, performing comparative study for a practical guideline and adjusting for confounding variables. As many high-throughput experimental technologies are rapidly developed and widely applied nowadays, data management and effective integrative analysis will become more and more essential to fully utilize the rich information contained in the tremendous amount of data. The analytical techniques and concepts may also extend to information integration of other types of genomic data.

One limitation of this review article is the restricted scope of literature search by PubMed. We have attempted to include 102 manually collected references. The inclusion, however, cannot be exhaustive. For example, many related approaches are termed ‘integrative analysis’ in the literature and thus cannot be included in the review. This is especially true in categories other than DE gene analysis (e.g. pathway analysis, prediction analysis or network analysis). We attempted to include ‘integrative analysis’ in the keyword search but failed because it generated thousands of publications with most of them irrelevant to the purpose of this article.

## SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online: Supplementary Table.

## FUNDING

National Institutes of Health (NIH) (R01MH077159 and RC2HL101715, to G.C.T.); (R01HD38979 and R01DE14899, to E.F. and F.B.); NIH (R01GM72007, to D.B.); Huck Institute for Life Sciences (to D.B.). Funding for open access charge: University of Pittsburgh.

*Conflict of interest statement*. None declared.

## ACKNOWLEDGEMENTS

The authors thank C. Song, X. Wang and G. Liao for collecting and printing papers.

## REFERENCES

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (3.7M) |
- Citation

- Statistical Test of Expression Pattern (STEPath): a new strategy to integrate gene expression data with genomic information in individual and meta-analysis studies.[BMC Bioinformatics. 2011]
*Martini P, Risso D, Sales G, Romualdi C, Lanfranchi G, Cagnin S.**BMC Bioinformatics. 2011 Apr 11; 12:92. Epub 2011 Apr 11.* - Meta-analysis methods for combining multiple expression profiles: comparisons, statistical characterization and an application guideline.[BMC Bioinformatics. 2013]
*Chang LC, Lin HM, Sibille E, Tseng GC.**BMC Bioinformatics. 2013 Dec 21; 14:368. Epub 2013 Dec 21.* - Meta-analysis of microarray results: challenges, opportunities, and recommendations for standardization.[Gene. 2007]
*Cahan P, Rovegno F, Mooney D, Newman JC, St Laurent G 3rd, McCaffrey TA.**Gene. 2007 Oct 15; 401(1-2):12-8. Epub 2007 Jul 3.* - INMEX--a web-based tool for integrative meta-analysis of expression data.[Nucleic Acids Res. 2013]
*Xia J, Fjell CD, Mayer ML, Pena OM, Wishart DS, Hancock RE.**Nucleic Acids Res. 2013 Jul; 41(Web Server issue):W63-70. Epub 2013 Jun 12.* - Microarray analysis of gene expression: considerations in data mining and statistical treatment.[Physiol Genomics. 2006]
*Verducci JS, Melfi VF, Lin S, Wang Z, Roy S, Sen CK.**Physiol Genomics. 2006 May 16; 25(3):355-63. Epub 2006 Mar 22.*

- Toward Computational Cumulative Biology by Combining Models of Biological Datasets[PLoS ONE. ]
*Faisal A, Peltonen J, Georgii E, Rung J, Kaski S.**PLoS ONE. 9(11)e113053* - In Silico Prediction of Synthetic Lethality by Meta-Analysis of Genetic Interactions, Functions, and Pathways in Yeast and Human Cancer[Cancer Informatics. ]
*Wu M, Li X, Zhang F, Li X, Kwoh CK, Zheng J.**Cancer Informatics. 13(Suppl 3)71-80* - Array data extractor (ADE): a LabVIEW program to extract and merge gene array data[BMC Research Notes. ]
*Kurtenbach S, Kurtenbach S, Zoidl G.**BMC Research Notes. 6496* - HYPOTHESIS SETTING AND ORDER STATISTIC FOR ROBUST GENOMIC META-ANALYSIS[The annals of applied statistics. 2014]
*Song C, Tseng GC.**The annals of applied statistics. 2014; 8(2)777-800* - Transcriptomic Analysis Unveils Correlations between Regulative Apoptotic Caspases and Genes of Cholesterol Homeostasis in Human Brain[PLoS ONE. ]
*Picco R, Tomasella A, Fogolari F, Brancolini C.**PLoS ONE. 9(10)e110610*

- PubMedPubMedPubMed citations for these articles

- Comprehensive literature review and statistical considerations for microarray me...Comprehensive literature review and statistical considerations for microarray meta-analysisNucleic Acids Research. May 2012; 40(9)3785

Your browsing activity is empty.

Activity recording is turned off.

See more...