Department of Radiology, National Health Insurance Service Ilsan Hospital, Goyang, Korea.
Research Institute, National Health Insurance Service Ilsan Hospital, Goyang, Korea.
PLoS One. 2021 Aug 12;16(8):e0256152. doi: 10.1371/journal.pone.0256152. eCollection 2021.
This study aims to determine how randomly splitting a dataset into training and test sets affects the estimated performance of a machine learning model and its gap from the test performance under different conditions, using real-world brain tumor radiomics data. We conducted two classification tasks of different difficulty levels with magnetic resonance imaging (MRI) radiomics features: (1) "Simple" task, glioblastomas [n = 109] vs. brain metastasis [n = 58] and (2) "difficult" task, low- [n = 163] vs. high-grade [n = 95] meningiomas. Additionally, two undersampled datasets were created by randomly sampling 50% from these datasets. We performed random training-test set splitting for each dataset repeatedly to create 1,000 different training-test set pairs. For each dataset pair, the least absolute shrinkage and selection operator model was trained and evaluated using various validation methods in the training set, and tested in the test set, using the area under the curve (AUC) as an evaluation metric. The AUCs in training and testing varied among different training-test set pairs, especially with the undersampled datasets and the difficult task. The mean (±standard deviation) AUC difference between training and testing was 0.039 (±0.032) for the simple task without undersampling and 0.092 (±0.071) for the difficult task with undersampling. In a training-test set pair with the difficult task without undersampling, for example, the AUC was high in training but much lower in testing (0.882 and 0.667, respectively); in another dataset pair with the same task, however, the AUC was low in training but much higher in testing (0.709 and 0.911, respectively). When the AUC discrepancy between training and test, or generalization gap, was large, none of the validation methods helped sufficiently reduce the generalization gap. Our results suggest that machine learning after a single random training-test set split may lead to unreliable results in radiomics studies especially with small sample sizes.
本研究旨在使用真实的脑肿瘤放射组学数据,确定随机将数据集划分为训练集和测试集如何影响机器学习模型的估计性能及其在不同条件下与测试性能的差距。我们使用磁共振成像(MRI)放射组学特征进行了两个难度不同的分类任务:(1)“简单”任务,胶质母细胞瘤[ n = 109]与脑转移瘤[ n = 58],(2)“困难”任务,低级别[ n = 163]与高级别[ n = 95]脑膜瘤。此外,通过从这些数据集随机抽取 50%创建了两个采样不足的数据集。我们对每个数据集重复进行随机训练-测试集分割,创建了 1000 个不同的训练-测试集对。对于每个数据集对,使用各种验证方法在训练集中训练和评估最小绝对收缩和选择算子模型,并在测试集中进行测试,使用曲线下面积(AUC)作为评估指标。不同的训练-测试集对之间的 AUC 在训练和测试中有所不同,尤其是在采样不足的数据集和困难任务中。在没有采样不足的简单任务中,训练和测试之间的平均(±标准差)AUC 差异为 0.039(±0.032),在有采样不足的困难任务中为 0.092(±0.071)。例如,在没有采样不足的困难任务的一个训练-测试集中,AUC 在训练中较高,但在测试中较低(分别为 0.882 和 0.667);然而,在另一个具有相同任务的数据集对中,AUC 在训练中较低,但在测试中较高(分别为 0.709 和 0.911)。当训练和测试之间的 AUC 差异(或泛化差距)较大时,没有一种验证方法可以充分减小泛化差距。我们的结果表明,在放射组学研究中,特别是在样本量较小时,单次随机训练-测试集分割后的机器学习可能导致不可靠的结果。