Feature-specific quantile normalization and feature-specific mean-variance normalization deliver robust bi-directional classification and feature selection performance between microarray and RNAseq data

Daniel Skubleny; Sunita Ghosh; Jennifer Spratlin; Daniel E Schiller; Gina R Rayat

doi:10.1186/s12859-024-05759-w

Feature-specific quantile normalization and feature-specific mean-variance normalization deliver robust bi-directional classification and feature selection performance between microarray and RNAseq data

BMC Bioinformatics. 2024 Mar 29;25(1):136. doi: 10.1186/s12859-024-05759-w.

Authors

Daniel Skubleny¹, Sunita Ghosh^{2

3}, Jennifer Spratlin², Daniel E Schiller⁴, Gina R Rayat⁴

Affiliations

¹ Department of Surgery, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, AB, T6G 2R3, Canada. skubleny@ualberta.ca.
² Department of Oncology, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, AB, T6G 2R3, Canada.
³ Department of Mathematical and Statistical Sciences, Faculty of Science, University of Alberta, Edmonton, AB, T6G 2R3, Canada.
⁴ Department of Surgery, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, AB, T6G 2R3, Canada.

PMID: 38549046
DOI: 10.1186/s12859-024-05759-w

Abstract

Background: Cross-platform normalization seeks to minimize technological bias between microarray and RNAseq whole-transcriptome data. Incorporating multiple gene expression platforms permits external validation of experimental findings, and augments training sets for machine learning models. Here, we compare the performance of Feature Specific Quantile Normalization (FSQN) to a previously used but unvalidated and uncharacterized method we label as Feature Specific Mean Variance Normalization (FSMVN). We evaluate the performance of these methods for bidirectional normalization in the context of nested feature selection.

Results: FSQN and FSMVN provided clinically equivalent bidirectional model performance with and without feature selection for colon CMS and breast PAM50 classification. Using principal component analysis, we determine that these methods eliminate batch effects related to technological platforms. Without feature selection, no statistical difference was identified between the performance of FSQN and FSMVN of cross-platform data compared to within-platform distributions. Under optimal feature selection conditions, balanced accuracy was FSQN and FSMVN were statistically equivalent to the within-platform distribution performance in multivariable linear regression analysis. FSQN and FSMVN also provided similar performance to within-platform distributions as the number of selected genes used to create models decreases.

Conclusions: In the context of generating supervised machine learning classifiers for molecular subtypes, FSQN and FSMVN are equally effective. Under optimal modeling conditions, FSQN and FSMVN provide equivalent model accuracy performance on cross-platform normalization data compared to within-platform data. Using cross-platform data should still be approached with caution as subtle performance differences may exist depending on the classification problem, training, and testing distributions.

Keywords: Cross-platform normalization; FSMVN; FSQN; Feature selection; Mean; Microarray; Molecular classification; Quantile normalization; RNAseq; Variance.

MeSH terms

Gene Expression Profiling* / methods
Linear Models
Microarray Analysis
Transcriptome*