Suppr超能文献

大语言模型在心胸麻醉中的比较准确性评估:Claude和ChatGPT-4在专科委员会风格问题上的性能分析

Comparative Accuracy Assessment of Large Language Models in Cardiothoracic Anesthesia: A Performance Analysis of Claude and ChatGPT-4 on Subspecialty Board-Style Questions.

作者信息

Beil Jordan, Aggarwal Ishan, Devarapalli Mallikarjuna

机构信息

Anesthesiology and Perioperative Medicine, Augusta University Medical College of Georgia, Augusta, USA.

出版信息

Cureus. 2025 Jul 23;17(7):e88591. doi: 10.7759/cureus.88591. eCollection 2025 Jul.

Abstract

Background The integration of artificial intelligence (AI) into healthcare has accelerated rapidly since the public release of ChatGPT (Open AI, San Francisco, California, United States) in 2022. While large language models (LLMs) have demonstrated proficiency in general medical knowledge and licensing examinations, their performance in specialized medical subspecialties remains largely unexplored. Objective  The objective of this study was to compare the accuracy of two prominent LLMs, Claude (Anthropic PBC, San Francisco, California, United States) and ChatGPT-4, in answering cardiothoracic anesthesia board-style questions and evaluate their potential for clinical decision support in this subspecialty. Methods We developed a Python-based framework to systematically evaluate LLM performance on 100 custom multiple-choice questions covering cardiothoracic anesthesia topics including arrhythmia management, electrophysiology procedures, pacemaker programming, and perioperative complications. Questions were presented to both Claude and ChatGPT-4 via their respective application programming interface (APIs), with responses compared against expert-validated correct answers. The primary outcome was overall accuracy percentage for each model. Results  Claude achieved 32% accuracy (32/100 questions), while ChatGPT-4 achieved 23% accuracy (23/100 questions), representing a 9% point difference (p < 0.05). Both models performed below the threshold typically considered acceptable for clinical decision-making (≥80%). Performance varied across question domains, with both models demonstrating marked difficulty in questions requiring complex electrophysiological reasoning and visual data interpretation (e.g., ECG and imaging-based cases). Conclusions  Current LLMs demonstrate limited accuracy in subspecialty-level cardiothoracic anesthesia knowledge, highlighting the need for specialized training datasets and model refinement before clinical implementation. These findings underscore the importance of subspecialty-specific validation before deploying AI tools in specialized medical domains.

摘要

背景 自2022年ChatGPT(美国加利福尼亚州旧金山的OpenAI公司)公开发布以来,人工智能(AI)在医疗保健领域的整合迅速加速。虽然大语言模型(LLMs)在一般医学知识和执照考试中已展现出能力,但它们在医学专科领域的表现仍 largely unexplored。目的 本研究的目的是比较两种著名的大语言模型Claude(美国加利福尼亚州旧金山的Anthropic PBC公司)和ChatGPT-4在回答心胸麻醉委员会风格问题方面的准确性,并评估它们在该专科领域临床决策支持中的潜力。方法 我们开发了一个基于Python的框架,以系统地评估大语言模型在100个涵盖心胸麻醉主题(包括心律失常管理、电生理程序、起搏器编程和围手术期并发症)的自定义多项选择题上的表现。通过各自的应用程序编程接口(API)向Claude和ChatGPT-4提出问题,并将回答与专家验证的正确答案进行比较。主要结果是每个模型的总体准确率。结果 Claude的准确率为32%(100个问题中的32个),而ChatGPT-4的准确率为23%(100个问题中的23个),相差9个百分点(p < 0.05)。两个模型的表现均低于通常认为可接受的临床决策阈值(≥80%)。在不同问题领域的表现有所不同,两个模型在需要复杂电生理推理和视觉数据解读(如心电图和基于影像的病例)的问题上都表现出明显困难。结论 当前的大语言模型在专科水平的心胸麻醉知识方面准确性有限,这凸显了在临床应用前需要专门的训练数据集和模型优化。这些发现强调了在专门医学领域部署人工智能工具前进行专科特异性验证的重要性。

相似文献

本文引用的文献

1
Clinicians must participate in the development of multimodal AI.临床医生必须参与多模态人工智能的开发。
EClinicalMedicine. 2025 May 23;84:103252. doi: 10.1016/j.eclinm.2025.103252. eCollection 2025 Jun.
9
ChatGPT: the future of discharge summaries?ChatGPT:出院小结的未来?
Lancet Digit Health. 2023 Mar;5(3):e107-e108. doi: 10.1016/S2589-7500(23)00021-3. Epub 2023 Feb 6.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验