Lee Ro Woon, Lee Kyu Hong, Yun Jae Sung, Kim Myung Sub, Choi Hyun Seok
Department of Radiology, Inha University College of Medicine, Incheon 22332, Republic of Korea.
Department of Radiology, Ajou University School of Medicine, Suwon 16499, Republic of Korea.
J Clin Med. 2024 Nov 22;13(23):7057. doi: 10.3390/jcm13237057.
This study investigated the diagnostic capabilities of two AI-based tools, M4CXR (research-only version) and ChatGPT-4o, in chest X-ray interpretation. M4CXR is a specialized cloud-based system using advanced large language models (LLMs) for generating comprehensive radiology reports, while ChatGPT, built on the GPT-4 architecture, offers potential in settings with limited radiological expertise. This study evaluated 826 anonymized chest X-ray images from Inha University Hospital. Two experienced radiologists independently assessed the performance of M4CXR and ChatGPT across multiple diagnostic parameters. The evaluation focused on diagnostic accuracy, false findings, location accuracy, count accuracy, and the presence of hallucinations. Interobserver agreement was quantified using Cohen's kappa coefficient. M4CXR consistently demonstrated superior performance compared to ChatGPT across all evaluation metrics. For diagnostic accuracy, M4CXR achieved approximately 60-62% acceptability ratings compared to ChatGPT's 42-45%. Both systems showed high interobserver agreement rates, with M4CXR generally displaying stronger consistency. Notably, M4CXR showed better performance in anatomical localization (76-77.5% accuracy) compared to ChatGPT (36-36.5%) and demonstrated fewer instances of hallucination. The findings highlight the complementary potential of these AI technologies in medical diagnostics. While M4CXR shows stronger performance in specialized radiological analysis, the integration of both systems could potentially optimize diagnostic workflows. This study emphasizes the role of AI in augmenting human expertise rather than replacing it, suggesting that a combined approach leveraging both AI capabilities and clinical judgment could enhance patient care outcomes.
本研究调查了两种基于人工智能的工具——M4CXR(仅限研究版本)和ChatGPT-4o在胸部X光解读方面的诊断能力。M4CXR是一个基于云的专门系统,使用先进的大语言模型(LLMs)生成全面的放射学报告,而基于GPT-4架构构建的ChatGPT在放射学专业知识有限的环境中具有潜力。本研究评估了来自仁荷大学医院的826张匿名胸部X光图像。两名经验丰富的放射科医生独立评估了M4CXR和ChatGPT在多个诊断参数方面的表现。评估重点在于诊断准确性、假阳性结果、位置准确性、计数准确性以及幻觉的存在情况。观察者间的一致性使用科恩kappa系数进行量化。在所有评估指标上,M4CXR始终表现出比ChatGPT更优的性能。在诊断准确性方面,M4CXR的可接受率约为60 - 62%,而ChatGPT为42 - 45%。两个系统都显示出较高的观察者间一致率,M4CXR总体表现出更强的一致性。值得注意的是,M4CXR在解剖定位方面表现更好(准确率为76 - 77.5%),而ChatGPT为36 - 36.5%,且幻觉情况更少。这些发现凸显了这些人工智能技术在医学诊断中的互补潜力。虽然M4CXR在专业放射学分析中表现更强,但两个系统的整合可能会优化诊断流程。本研究强调了人工智能在增强人类专业知识而非取代它方面的作用,表明结合人工智能能力和临床判断的综合方法可以改善患者护理结果。