Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results

Chansik An; Yae Won Park; Sung Soo Ahn; Kyunghwa Han; Hwiyoung Kim; Seung-Koo Lee

doi:10.1371/journal.pone.0256152

Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results

PLoS One. 2021 Aug 12;16(8):e0256152. doi: 10.1371/journal.pone.0256152. eCollection 2021.

Authors

Chansik An^{1

2}, Yae Won Park³, Sung Soo Ahn³, Kyunghwa Han³, Hwiyoung Kim³, Seung-Koo Lee³

Affiliations

¹ Department of Radiology, National Health Insurance Service Ilsan Hospital, Goyang, Korea.
² Research Institute, National Health Insurance Service Ilsan Hospital, Goyang, Korea.
³ Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Korea.

Abstract

This study aims to determine how randomly splitting a dataset into training and test sets affects the estimated performance of a machine learning model and its gap from the test performance under different conditions, using real-world brain tumor radiomics data. We conducted two classification tasks of different difficulty levels with magnetic resonance imaging (MRI) radiomics features: (1) "Simple" task, glioblastomas [n = 109] vs. brain metastasis [n = 58] and (2) "difficult" task, low- [n = 163] vs. high-grade [n = 95] meningiomas. Additionally, two undersampled datasets were created by randomly sampling 50% from these datasets. We performed random training-test set splitting for each dataset repeatedly to create 1,000 different training-test set pairs. For each dataset pair, the least absolute shrinkage and selection operator model was trained and evaluated using various validation methods in the training set, and tested in the test set, using the area under the curve (AUC) as an evaluation metric. The AUCs in training and testing varied among different training-test set pairs, especially with the undersampled datasets and the difficult task. The mean (±standard deviation) AUC difference between training and testing was 0.039 (±0.032) for the simple task without undersampling and 0.092 (±0.071) for the difficult task with undersampling. In a training-test set pair with the difficult task without undersampling, for example, the AUC was high in training but much lower in testing (0.882 and 0.667, respectively); in another dataset pair with the same task, however, the AUC was low in training but much higher in testing (0.709 and 0.911, respectively). When the AUC discrepancy between training and test, or generalization gap, was large, none of the validation methods helped sufficiently reduce the generalization gap. Our results suggest that machine learning after a single random training-test set split may lead to unreliable results in radiomics studies especially with small sample sizes.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Area Under Curve
Brain Neoplasms / diagnosis*
Glioblastoma / diagnosis*
Humans
Image Processing, Computer-Assisted / methods*
Machine Learning*
Magnetic Resonance Imaging / methods*
Retrospective Studies

Grants and funding

This research received funding from the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, Information and Communication Technologies & Future Planning (2020R1A2C1003886) by S.S.A. This research was also supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2020R1I1A1A01071648) by Y.W.P.