Herber Sarah K, Müller Lukas, Pinto Dos Santos Daniel, Jorg Tobias, Souschek Fabio, Bäuerle Tobias, Foersch Sebastian, Galata Christian, Mildenberger Peter, Halfmann Moritz C
Department of Diagnostic and Interventional Radiology, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany.
Institute of Pathology, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany.
Eur Radiol. 2025 Jul 25. doi: 10.1007/s00330-025-11845-1.
Lung cancer is the leading cause of cancer-related mortality. While early detection improves survival, distinguishing malignant from benign pulmonary nodules remains challenging. Artificial intelligence (AI) has been proposed to enhance diagnostic accuracy, but its clinical reliability is still under investigation. Here, we aimed to evaluate the diagnostic performance of AI models in classifying pulmonary nodules.
This single-center retrospective study analyzed pulmonary nodules (4-30 mm) detected on CT scans, using three AI software models. Sensitivity, specificity, false-positive and false-negative rates were calculated. The diagnostic accuracy was assessed using the area under the receiver operating characteristic (ROC) curve (AUC), with histopathology serving as the gold standard. Subgroup analyses were based on nodule size and histopathological classification. The impact of imaging parameters was evaluated using regression analysis.
A total of 158 nodules (n = 30 benign, n = 128 malignant) were analyzed. One AI model classified most nodules as intermediate risk, preventing further accuracy assessment. The other models demonstrated moderate sensitivity (53.1-70.3%) but low specificity (46.7-66.7%), leading to a high false-positive rate (45.5-52.4%). AUC values were between 0.5 and 0.6 (95% CI). Subgroup analyses revealed decreased sensitivity (47.8-61.5%) but increased specificity (100%), highlighting inconsistencies. In total, up to 49.0% of the pulmonary nodules were classified as intermediate risk. CT scan type influenced performance (p = 0.03), with better classification accuracy on breath-held CT scans.
AI-based software models are not ready for standalone clinical use in pulmonary nodule classification due to low specificity, a high false-negative rate and a high proportion of intermediate-risk classifications.
Question How accurate are commercially available AI models for the classification of pulmonary nodules compared to the gold standard of histopathology? Findings The evaluated AI models demonstrated moderate sensitivity, low specificity and high false-negative rates. Up to 49% of pulmonary nodules were classified as intermediate risk. Clinical relevance The high false-negative rates could influence radiologists' decision-making, leading to an increased number of interventions or unnecessary surgical procedures.
肺癌是癌症相关死亡的主要原因。虽然早期检测可提高生存率,但区分肺部恶性结节和良性结节仍然具有挑战性。有人提出使用人工智能(AI)来提高诊断准确性,但其临床可靠性仍在研究中。在此,我们旨在评估AI模型在肺部结节分类中的诊断性能。
这项单中心回顾性研究使用三种AI软件模型分析了CT扫描检测到的肺部结节(4 - 30毫米)。计算了敏感性、特异性、假阳性率和假阴性率。以组织病理学为金标准,使用受试者操作特征(ROC)曲线下面积(AUC)评估诊断准确性。亚组分析基于结节大小和组织病理学分类。使用回归分析评估影像参数的影响。
共分析了158个结节(n = 30个良性,n = 128个恶性)。一个AI模型将大多数结节分类为中等风险,无法进一步评估准确性。其他模型显示出中等敏感性(53.1 - 70.3%)但特异性较低(46.7 - 66.7%),导致高假阳性率(45.5 - 52.4%)。AUC值在0.5至0.6之间(95%CI)。亚组分析显示敏感性降低(47.8 - 61.5%)但特异性增加(100%),突出了不一致性。总共高达49.0%的肺部结节被分类为中等风险。CT扫描类型影响性能(p = 0.03),屏气CT扫描的分类准确性更高。
基于AI的软件模型由于特异性低、假阴性率高和中等风险分类比例高,尚未准备好用于肺部结节分类的独立临床应用。
问题与组织病理学金标准相比,市售AI模型对肺部结节分类的准确性如何?发现评估的AI模型显示出中等敏感性、低特异性和高假阴性率。高达49%的肺部结节被分类为中等风险。临床意义高假阴性率可能影响放射科医生的决策,导致干预数量增加或不必要的外科手术。