Mola Serkan, Yıldırım Alp, Gül Enis Burak
Cardiovascular Surgery Department, Ankara Bilkent City Hospital, 06800 Ankara, Turkey.
Cardiovascular Surgery Department, Ankara Atatürk Sanatoryum Training and Research Hospital, 06290 Ankara, Turkey.
Rev Cardiovasc Med. 2025 Aug 19;26(8):38705. doi: 10.31083/RCM38705. eCollection 2025 Aug.
This study aimed to investigate the performance of two versions of ChatGPT (o1 and 4o) in making decisions about coronary revascularization and to compare the recommendations of these versions with those of a multidisciplinary Heart Team. Moreover, the study aimed to assess whether the decisions generated by ChatGPT, based on the internal knowledge base of the system and clinical guidelines, align with expert recommendations in real-world coronary artery disease management. Given the increasing prevalence and processing capabilities of large language models, such as ChatGPT, this comparison offers insights into the potential applicability of these systems in complex clinical decision-making.
We conducted a retrospective study at a single center, which included 128 patients who underwent coronary angiography between August and September 2024. The demographics, medical history, current medications, echocardiographic findings, and angiographic findings for each patient were provided to the two ChatGPT versions. The two models were then asked to choose one of three treatment options: coronary artery bypass grafting (CABG), percutaneous coronary intervention (PCI), or medical therapy, and to justify their choice. Performance was assessed using metrics such as accuracy, sensitivity, specificity, precision, F1 score, Cohen's kappa, and Shannon's entropy.
The Heart Team recommended CABG for 78.1% of the patients, PCI for 12.5%, and medical therapy for 9.4%. ChatGPT o1 demonstrated higher sensitivity in identifying patients who needed CABG (82%) but lower sensitivity for PCI (43.7%), whereas ChatGPT 4o performed better in recognizing PCI candidates (68.7%) but was less accurate for CABG cases (43%). Both models struggled to identify patients suitable for medical therapy, with no correct predictions in this category. Agreement with the Heart Team was low (Cohen's kappa: 0.17 for o1 and 0.03 for 4o). Notably, these errors were often attributed to the limited understanding of the model in a clinical context and the inability to analyze angiographic images directly.
While ChatGPT-based artificial intelligence (AI) models show promise in assisting with cardiac care decisions, the current limitations of these models emphasize the need for further development. Incorporating imaging data and enhancing comprehension of clinical context is essential to improve the reliability of these AI models in real-world medical settings.
本研究旨在调查两个版本的ChatGPT(版本1和版本4)在冠状动脉血运重建决策方面的表现,并将这些版本的建议与多学科心脏团队的建议进行比较。此外,该研究旨在评估ChatGPT基于系统内部知识库和临床指南生成的决策是否与现实世界中冠状动脉疾病管理的专家建议一致。鉴于诸如ChatGPT等大型语言模型的普及率和处理能力不断提高,这种比较为这些系统在复杂临床决策中的潜在适用性提供了见解。
我们在一个单一中心进行了一项回顾性研究,纳入了2024年8月至9月期间接受冠状动脉造影的128例患者。将每位患者的人口统计学、病史、当前用药情况、超声心动图检查结果和血管造影检查结果提供给两个ChatGPT版本。然后要求这两个模型从三种治疗选择中选择一种:冠状动脉旁路移植术(CABG)、经皮冠状动脉介入治疗(PCI)或药物治疗,并为其选择提供理由。使用准确性、敏感性、特异性、精确性、F1分数、科恩kappa系数和香农熵等指标评估表现。
心脏团队建议78.1%的患者进行CABG,12.5%的患者进行PCI,9.4%的患者进行药物治疗。ChatGPT版本1在识别需要CABG的患者方面表现出较高的敏感性(82%),但对PCI的敏感性较低(43.7%),而ChatGPT版本4在识别PCI候选患者方面表现更好(68.7%),但对CABG病例的准确性较低(43%)。两个模型都难以识别适合药物治疗的患者,在这一类别中没有正确预测。与心脏团队的一致性较低(版本1的科恩kappa系数为0.17,版本4为0.03)。值得注意的是,这些错误往往归因于模型在临床背景下的理解有限以及无法直接分析血管造影图像。
虽然基于ChatGPT的人工智能(AI)模型在协助心脏护理决策方面显示出前景,但这些模型目前的局限性强调了进一步开发的必要性。纳入成像数据并增强对临床背景的理解对于提高这些AI模型在现实世界医疗环境中的可靠性至关重要。