Arzamasov Kirill, Vasilev Yuriy, Zelenova Maria, Pestrenin Lev, Busygina Yulia, Bobrovskaya Tatiana, Chetverikov Sergey, Shikhmuradov David, Pankratov Andrey, Kirpichev Yury, Sinitsyn Valentin, Son Irina, Omelyanskaya Olga
State Budget-Funded Health Care Institution of the City of Moscow "Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department", Moscow, Russian Federation.
MIREA - Russian Technological University, Moscow, Russian Federation.
Quant Imaging Med Surg. 2024 Aug 1;14(8):5288-5303. doi: 10.21037/qims-24-160. Epub 2024 Jul 25.
The integration of artificial intelligence (AI) into medicine is growing, with some experts predicting its standalone use soon. However, skepticism remains due to limited positive outcomes from independent validations. This research evaluates AI software's effectiveness in analyzing chest X-rays (CXR) to identify lung nodules, a possible lung cancer indicator.
This retrospective study analyzed 7,670,212 record pairs from radiological exams conducted between 2020 and 2022 during the Moscow Computer Vision Experiment, focusing on CXR and computed tomography (CT) scans. All images were acquired during clinical routine. The final dataset comprised 100 CXR images (50 with lung nodules, 50 without), selected consecutively and based on inclusion and exclusion criteria, to evaluate the performance of all five AI-based solutions, participating in the Moscow Computer Vision Experiment and analyzing CXR. The evaluation was performed in 3 stages. In the first stage, the probability of a nodule in the lung obtained from AI services was compared with the Ground Truth (1-there is a nodule, 0-there is no nodule). In the second stage, 3 radiologists evaluated the segmentation of nodules performed by the AI services (1-nodule correctly segmented, 0-nodule incorrectly segmented or not segmented at all). In the third stage, the same radiologists additionally evaluated the classification of the nodules (1-nodule correctly segmented and classified, 0-all other cases). The results obtained in stages 2 and 3 were compared with Ground Truth, which was common to all three stages. For each stage, diagnostic accuracy metrics were calculated for each AI service.
Three software solutions (Celsus, Lunit INSIGHT CXR, and qXR) demonstrated diagnostic metrics that matched or surpassed the vendor specifications, and achieved the highest area under the receiver operating characteristic curve (AUC) of 0.956 [95% confidence interval (CI): 0.918 to 0.994]. However, when evaluated by three radiologists for accurate nodule segmentation and classification, all solutions performed below the vendor-declared metrics, with the highest AUC reaching 0.812 (95% CI: 0.744 to 0.879). Meanwhile, all AI services demonstrated 100% specificity at stages 2 and 3 of the study.
To ensure the reliability and applicability of AI-based software, it is crucial to validate performance metrics using high-quality datasets and engage radiologists in the evaluation process. Developers are recommended to improve the accuracy of the underlying models before allowing the standalone use of the software for lung nodule detection. The dataset created during the study may be accessed at https://mosmed.ai/datasets/mosmeddatargogksnalichiemiotsutstviemlegochnihuzlovtipvii/.
人工智能(AI)在医学领域的应用日益广泛,一些专家预测其将很快实现独立使用。然而,由于独立验证的积极成果有限,人们仍持怀疑态度。本研究评估了人工智能软件在分析胸部X光(CXR)以识别肺结节(一种可能的肺癌指标)方面的有效性。
这项回顾性研究分析了2020年至2022年莫斯科计算机视觉实验期间进行的放射学检查中的7,670,212对记录,重点是胸部X光和计算机断层扫描(CT)。所有图像均在临床常规检查中获取。最终数据集包括100张胸部X光图像(50张有肺结节,50张无肺结节),根据纳入和排除标准连续选择,以评估参与莫斯科计算机视觉实验并分析胸部X光的所有五种基于人工智能的解决方案的性能。评估分三个阶段进行。在第一阶段,将人工智能服务得出的肺部有结节的概率与真实情况(1-有结节,0-无结节)进行比较。在第二阶段,3名放射科医生评估人工智能服务对结节的分割情况(1-结节分割正确,0-结节分割错误或未分割)。在第三阶段,同样的放射科医生还评估了结节的分类情况(1-结节分割并分类正确,0-所有其他情况)。将第二和第三阶段获得的结果与三个阶段通用的真实情况进行比较。对于每个阶段,计算每个人工智能服务的诊断准确性指标。
三种软件解决方案(塞尔苏斯、Lunit INSIGHT CXR和qXR)展示了与供应商规格匹配或超越供应商规格的诊断指标,并在接收器操作特征曲线(AUC)下达到了最高面积0.956[95%置信区间(CI):0.918至0.994]。然而,当由三名放射科医生评估结节的准确分割和分类时,所有解决方案的表现均低于供应商宣称的指标,最高AUC达到0.812(95%CI:0.744至0.879)。同时,在研究的第二和第三阶段所有人工智能服务的特异性均为100%。
为确保基于人工智能的软件的可靠性和适用性,使用高质量数据集验证性能指标并让放射科医生参与评估过程至关重要。建议开发者在允许软件独立用于肺结节检测之前提高基础模型的准确性。可通过https://mosmed.ai/datasets/mosmeddatargogksnalichiemiotsutstviemlegochnihuzlovtipvii/访问研究期间创建的数据集。