Department of Pure and Applied Chemistry, Thomas Graham Building, 295 Cathedral Street, University of Strathclyde, Glasgow, G1 1XL, UK.
Dxcover Ltd, Royal College Building, 204 George Street, Glasgow, G1 1XW, UK.
Analyst. 2023 Aug 7;148(16):3860-3869. doi: 10.1039/d3an00669g.
Over recent years, deep learning (DL) has become more widely used within the field of cancer diagnostics. However, DL often requires large training datasets to prevent overfitting, which can be difficult and expensive to acquire. Data augmentation is a method that can be used to generate new data points to train DL models. In this study, we use attenuated total reflectance Fourier-transform infrared (ATR-FTIR) spectra of patient dried serum samples and compare non-generative data augmentation methods to Wasserstein generative adversarial networks (WGANs) in their ability to improve the performance of a convolutional neural network (CNN) to differentiate between pancreatic cancer and non-cancer samples in a total cohort of 625 patients. The results show that WGAN augmented spectra improve CNN performance more than non-generative augmented spectra. When compared with a model that utilised no augmented spectra, adding WGAN augmented spectra to a CNN with the same architecture and same parameters, increased the area under the receiver operating characteristic curve (AUC) from 0.661 to 0.757, presenting a 15% increase in diagnostic performance. In a separate test on a colorectal cancer dataset, data augmentation using a WGAN led to an increase in AUC from 0.905 to 0.955. This demonstrates the impact data augmentation can have on DL performance for cancer diagnosis when the amount of real data available for model training is limited.
近年来,深度学习(DL)在癌症诊断领域的应用越来越广泛。然而,DL 通常需要大型训练数据集来防止过拟合,这在获取方面既困难又昂贵。数据扩充是一种可以用来生成新数据点来训练 DL 模型的方法。在这项研究中,我们使用患者干燥血清样本的衰减全反射傅里叶变换红外(ATR-FTIR)光谱,并比较非生成性数据扩充方法和 Wasserstein 生成对抗网络(WGAN)在改善卷积神经网络(CNN)性能方面的能力,以区分 625 名患者的总队列中的胰腺癌和非癌样本。结果表明,WGAN 扩充光谱比非生成性扩充光谱更能提高 CNN 的性能。与未使用扩充光谱的模型相比,将 WGAN 扩充光谱添加到具有相同架构和相同参数的 CNN 中,将接收者操作特征曲线下的面积(AUC)从 0.661 增加到 0.757,诊断性能提高了 15%。在另一个结直肠癌数据集上的测试中,使用 WGAN 进行数据扩充导致 AUC 从 0.905 增加到 0.955。这表明,当用于模型训练的实际数据量有限时,数据扩充对癌症诊断的 DL 性能有很大的影响。