Format

Send to

Choose Destination
Brief Bioinform. 2020 Feb 6. pii: bbz158. doi: 10.1093/bib/bbz158. [Epub ahead of print]

Toward a gold standard for benchmarking gene set enrichment analysis.

Author information

1
Graduate School of Public Health and Health Policy, City University of New York, New York, NY 10027, USA.
2
Institute for Implementation Science and Population Health, City University of New York, New York, NY 10027, USA.
3
Institute for Bioinformatics, Ludwig-Maximilians-Universität München, 80333 Munich, Germany.
4
Roswell Park Cancer Institute, Buffalo, NY 14203, USA.
5
Graduate School of Arts and Sciences, Boston University, Boston, MA 02215, USA.
6
Epigenetics and Development Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria 3052, Australia.
7
Department of Medical Biology, The University of Melbourne, Parkville, Victoria 3010, Australia.
8
Center for Cancer Research, National Cancer Institute, Bethesda, MD 20892, USA.
9
Harvard Medical School, Boston, MA 02215, USA.

Abstract

MOTIVATION:

Although gene set enrichment analysis has become an integral part of high-throughput gene expression data analysis, the assessment of enrichment methods remains rudimentary and ad hoc. In the absence of suitable gold standards, evaluations are commonly restricted to selected datasets and biological reasoning on the relevance of resulting enriched gene sets.

RESULTS:

We develop an extensible framework for reproducible benchmarking of enrichment methods based on defined criteria for applicability, gene set prioritization and detection of relevant processes. This framework incorporates a curated compendium of 75 expression datasets investigating 42 human diseases. The compendium features microarray and RNA-seq measurements, and each dataset is associated with a precompiled GO/KEGG relevance ranking for the corresponding disease under investigation. We perform a comprehensive assessment of 10 major enrichment methods, identifying significant differences in runtime and applicability to RNA-seq data, fraction of enriched gene sets depending on the null hypothesis tested and recovery of the predefined relevance rankings. We make practical recommendations on how methods originally developed for microarray data can efficiently be applied to RNA-seq data, how to interpret results depending on the type of gene set test conducted and which methods are best suited to effectively prioritize gene sets with high phenotype relevance.

AVAILABILITY:

http://bioconductor.org/packages/GSEABenchmarkeR.

CONTACT:

ludwig.geistlinger@sph.cuny.edu.

KEYWORDS:

RNA-seq; gene expression data; gene set analysis; microarray; pathway analysis

PMID:
32026945
DOI:
10.1093/bib/bbz158

Supplemental Content

Full text links

Icon for Silverchair Information Systems
Loading ...
Support Center