Jiang Yifan, Manem Venkata S K
Centre de Recherche du CHU de Québec, Université Laval, Québec, QC, Canada.
Département de Biologie Moléculaire, Biochimie Médicale et Pathologie, Université Laval, Québec, QC, Canada.
Front Oncol. 2025 Feb 25;15:1492758. doi: 10.3389/fonc.2025.1492758. eCollection 2025.
In the context of lung cancer screening, the scarcity of well-labeled medical images poses a significant challenge to implement supervised learning-based deep learning methods. While data augmentation is an effective technique for countering the difficulties caused by insufficient data, it has not been fully explored in the context of lung cancer screening. In this research study, we analyzed the state-of-the-art (SOTA) data augmentation techniques for lung cancer binary prediction.
To comprehensively evaluate the efficiency of data augmentation approaches, we considered the nested case control National Lung Screening Trial (NLST) cohort comprising of 253 individuals who had the commonly used CT scans without contrast. The CT scans were pre-processed into three-dimensional volumes based on the lung nodule annotations. Subsequently, we evaluated five basic (online) and two generative model-based offline data augmentation methods with ten state-of-the-art (SOTA) 3D deep learning-based lung cancer prediction models.
Our results demonstrated that the performance improvement by data augmentation was highly dependent on approach used. The Cutmix method resulted in the highest average performance improvement across all three metrics: 1.07%, 3.29%, 1.19% for accuracy, F1 score and AUC, respectively. MobileNetV2 with a simple data augmentation approach achieved the best AUC of 0.8719 among all lung cancer predictors, demonstrating a 7.62% improvement compared to baseline. Furthermore, the MED-DDPM data augmentation approach was able to improve prediction performance by rebalancing the training set and adding moderately synthetic data.
The effectiveness of online and offline data augmentation methods were highly sensitive to the prediction model, highlighting the importance of carefully selecting the optimal data augmentation method. Our findings suggest that certain traditional methods can provide more stable and higher performance compared to SOTA online data augmentation approaches. Overall, these results offer meaningful insights for the development and clinical integration of data augmented deep learning tools for lung cancer screening.
在肺癌筛查背景下,标注良好的医学图像稀缺对实施基于监督学习的深度学习方法构成重大挑战。虽然数据增强是应对数据不足所带来困难的有效技术,但在肺癌筛查背景下尚未得到充分探索。在本研究中,我们分析了用于肺癌二元预测的最新数据增强技术。
为全面评估数据增强方法的效率,我们考虑了嵌套病例对照的国家肺癌筛查试验(NLST)队列,该队列由253名进行了常用非增强CT扫描的个体组成。基于肺结节标注将CT扫描预处理为三维体积。随后,我们使用十个基于三维深度学习的最新肺癌预测模型评估了五种基本(在线)和两种基于生成模型的离线数据增强方法。
我们的结果表明,数据增强带来的性能提升高度依赖于所使用的方法。Cutmix方法在所有三个指标上均带来了最高的平均性能提升:准确率、F1分数和AUC分别提升了1.07%、3.29%和1.19%。采用简单数据增强方法的MobileNetV2在所有肺癌预测器中实现了最佳AUC,为0.8719,与基线相比提升了7.62%。此外,MED-DDPM数据增强方法能够通过重新平衡训练集和添加适度的合成数据来提高预测性能。
在线和离线数据增强方法的有效性对预测模型高度敏感,凸显了仔细选择最优数据增强方法的重要性。我们的研究结果表明,与最新的在线数据增强方法相比,某些传统方法可提供更稳定且更高的性能。总体而言,这些结果为用于肺癌筛查的数据增强深度学习工具的开发和临床整合提供了有意义的见解。