MGH & BWH Center for Clinical Data Science, Mass General Brigham, Boston, Massachusetts, United States of America.
Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts, United States of America.
PLoS One. 2022 Apr 29;17(4):e0267213. doi: 10.1371/journal.pone.0267213. eCollection 2022.
A standardized objective evaluation method is needed to compare machine learning (ML) algorithms as these tools become available for clinical use. Therefore, we designed, built, and tested an evaluation pipeline with the goal of normalizing performance measurement of independently developed algorithms, using a common test dataset of our clinical imaging. Three vendor applications for detecting solid, part-solid, and groundglass lung nodules in chest CT examinations were assessed in this retrospective study using our data-preprocessing and algorithm assessment chain. The pipeline included tools for image cohort creation and de-identification; report and image annotation for ground-truth labeling; server partitioning to receive vendor "black box" algorithms and to enable model testing on our internal clinical data (100 chest CTs with 243 nodules) from within our security firewall; model validation and result visualization; and performance assessment calculating algorithm recall, precision, and receiver operating characteristic curves (ROC). Algorithm true positives, false positives, false negatives, recall, and precision for detecting lung nodules were as follows: Vendor-1 (194, 23, 49, 0.80, 0.89); Vendor-2 (182, 270, 61, 0.75, 0.40); Vendor-3 (75, 120, 168, 0.32, 0.39). The AUCs for detection of solid (0.61-0.74), groundglass (0.66-0.86) and part-solid (0.52-0.86) nodules varied between the three vendors. Our ML model validation pipeline enabled testing of multi-vendor algorithms within the institutional firewall. Wide variations in algorithm performance for detection as well as classification of lung nodules justifies the premise for a standardized objective ML algorithm evaluation process.
需要一种标准化的客观评估方法来比较机器学习 (ML) 算法,因为这些工具即将可用于临床应用。因此,我们设计、构建和测试了一个评估管道,旨在使用我们的临床成像的共同测试数据集来标准化独立开发算法的性能测量。在这项回顾性研究中,使用我们的数据预处理和算法评估链评估了三种用于检测胸部 CT 检查中实性、部分实性和磨玻璃肺结节的供应商应用程序。该管道包括用于创建和去识别图像队列的工具;用于真实标签注释的报告和图像;服务器分区,用于接收供应商“黑盒”算法并在我们的内部临床数据(100 份胸部 CT 和 243 个结节)上进行模型测试,而无需通过我们的安全防火墙;模型验证和结果可视化;以及性能评估,计算算法召回率、精度和接收器操作特征曲线 (ROC)。检测肺结节的算法真阳性、假阳性、假阴性、召回率和精度如下:供应商 1(194、23、49、0.80、0.89);供应商 2(182、270、61、0.75、0.40);供应商 3(75、120、168、0.32、0.39)。三个供应商之间用于检测实性(0.61-0.74)、磨玻璃(0.66-0.86)和部分实性(0.52-0.86)结节的 AUC 各不相同。我们的 ML 模型验证管道使在机构防火墙内测试多供应商算法成为可能。用于检测和分类肺结节的算法性能存在广泛差异,这证明了标准化客观 ML 算法评估过程的前提是合理的。