Farhangi Mohammad Mehdi, Maynord Michael, Fermüller Cornelia, Aloimonos Yiannis, Sahiner Berkman, Petrick Nicholas
FDA, CDRH, OSEL, Division of Imaging, Diagnostics, and Software Reliability, Silver Spring, Maryland, United States.
University of Maryland, Iribe Center for Computer Science and Engineering, Computer Science Department, College Park, Maryland, United States.
J Med Imaging (Bellingham). 2024 Jul;11(4):044507. doi: 10.1117/1.JMI.11.4.044507. Epub 2024 Aug 7.
Synthetic datasets hold the potential to offer cost-effective alternatives to clinical data, ensuring privacy protections and potentially addressing biases in clinical data. We present a method leveraging such datasets to train a machine learning algorithm applied as part of a computer-aided detection (CADe) system.
Our proposed approach utilizes clinically acquired computed tomography (CT) scans of a physical anthropomorphic phantom into which manufactured lesions were inserted to train a machine learning algorithm. We treated the training database obtained from the anthropomorphic phantom as a simplified representation of clinical data and increased the variability in this dataset using a set of randomized and parameterized augmentations. Furthermore, to mitigate the inherent differences between phantom and clinical datasets, we investigated adding unlabeled clinical data into the training pipeline.
We apply our proposed method to the false positive reduction stage of a lung nodule CADe system in CT scans, in which regions of interest containing potential lesions are classified as nodule or non-nodule regions. Experimental results demonstrate the effectiveness of the proposed method; the system trained on labeled data from physical phantom scans and unlabeled clinical data achieves a sensitivity of 90% at eight false positives per scan. Furthermore, the experimental results demonstrate the benefit of the physical phantom in which the performance in terms of competitive performance metric increased by 6% when a training set consisting of 50 clinical CT scans was enlarged by the scans obtained from the physical phantom.
The scalability of synthetic datasets can lead to improved CADe performance, particularly in scenarios in which the size of the labeled clinical data is limited or subject to inherent bias. Our proposed approach demonstrates an effective utilization of synthetic datasets for training machine learning algorithms.
合成数据集有潜力为临床数据提供经济高效的替代方案,确保隐私保护并可能解决临床数据中的偏差。我们提出一种利用此类数据集来训练作为计算机辅助检测(CADe)系统一部分应用的机器学习算法的方法。
我们提出的方法利用对一个物理人体模型进行临床采集的计算机断层扫描(CT),该模型中插入了人造病变以训练机器学习算法。我们将从人体模型获得的训练数据库视为临床数据的简化表示,并使用一组随机化和参数化的增强方法来增加该数据集中的变异性。此外,为了减轻模型数据集与临床数据集之间的固有差异,我们研究了将未标记的临床数据添加到训练流程中。
我们将所提出的方法应用于CT扫描中肺结节CADe系统的假阳性减少阶段,在该阶段中,包含潜在病变的感兴趣区域被分类为结节或非结节区域。实验结果证明了所提出方法的有效性;在来自物理模型扫描的标记数据和未标记临床数据上训练的系统,在每次扫描八个假阳性的情况下实现了90%的灵敏度。此外,实验结果证明了物理模型的益处,当由50次临床CT扫描组成的训练集通过从物理模型获得的扫描进行扩充时,在竞争性能指标方面的性能提高了6%。
合成数据集的可扩展性可导致CADe性能的提高,特别是在标记临床数据的规模有限或存在固有偏差的情况下。我们提出的方法展示了对合成数据集进行有效利用以训练机器学习算法的方法。