Shenouda Mena, Whitney Heather M, Giger Maryellen L, Armato Samuel G
The University of Chicago, Committee on Medical Physics, Department of Radiology, Chicago, Illinois, United States.
J Med Imaging (Bellingham). 2024 Nov;11(6):064503. doi: 10.1117/1.JMI.11.6.064503. Epub 2024 Dec 26.
This study aimed to investigate the impact of different model retraining schemes and data partitioning on model performance in the task of COVID-19 classification on standard chest radiographs (CXRs), in the context of model generalizability.
Two datasets from the same institution were used: Set A (9860 patients, collected from 02/20/2020 to 02/03/2021) and Set B (5893 patients, collected from 03/15/2020 to 01/01/2022). An original deep learning (DL) model trained and tested in the task of COVID-19 classification using the initial partition of Set A achieved an area under the curve (AUC) value of 0.76, whereas Set B yielded a significantly lower value of 0.67. To explore this discrepancy, four separate strategies were undertaken on the original model: (1) retrain using Set B, (2) fine-tune using Set B, (3) regularization, and (4) repartition of the training set from Set A 200 times and report AUC values.
The model achieved the following AUC values (95% confidence interval) for the four methods: (1) 0.61 [0.56, 0.66]; (2) 0.70 [0.66, 0.73], both on Set B; (3) 0.76 [0.72, 0.79] on the initial test partition of Set A and 0.68 [0.66, 0.70] on Set B; and (4) on repartitions of Set A. The lowest AUC value (0.66 [0.62, 0.69]) of the Set A repartitions was no longer significantly different from the initial 0.67 achieved on Set B.
Different data repartitions of the same dataset used to train a DL model demonstrated significantly different performance values that helped explain the discrepancy between Set A and Set B and further demonstrated the limitations of model generalizability.
本研究旨在探讨在模型泛化的背景下,不同的模型再训练方案和数据划分对基于标准胸部X光片(CXR)进行COVID-19分类任务中模型性能的影响。
使用了来自同一机构的两个数据集:A组(9860例患者,收集于2020年2月20日至2021年2月3日)和B组(5893例患者,收集于2020年3月15日至2022年1月1日)。一个在使用A组初始划分进行COVID-19分类任务中训练和测试的原始深度学习(DL)模型,其曲线下面积(AUC)值为0.76,而B组的该值显著较低,为0.67。为探究这种差异,对原始模型采取了四种不同策略:(1)使用B组进行再训练,(2)使用B组进行微调,(3)正则化,以及(4)对A组训练集进行200次重新划分并报告AUC值。
该模型对四种方法获得的AUC值(95%置信区间)如下:(1)在B组上为0.61[0.56, 0.66];(2)在B组上为0.70[0.66, 0.73];(3)在A组初始测试划分上为0.76[0.72, 0.79],在B组上为0.68[0.66, 0.70];以及(4)在A组重新划分上。A组重新划分中最低的AUC值(0.66[0.62, 0.69])与在B组上最初获得的0.67不再有显著差异。
用于训练DL模型的同一数据集的不同数据重新划分显示出显著不同的性能值,这有助于解释A组和B组之间的差异,并进一步证明了模型泛化的局限性。