大语言模型在心胸麻醉中的比较准确性评估：Claude和ChatGPT-4在专科委员会风格问题上的性能分析

Comparative Accuracy Assessment of Large Language Models in Cardiothoracic Anesthesia: A Performance Analysis of Claude and ChatGPT-4 on Subspecialty Board-Style Questions.

作者信息

Beil Jordan, Aggarwal Ishan, Devarapalli Mallikarjuna

机构信息

Anesthesiology and Perioperative Medicine, Augusta University Medical College of Georgia, Augusta, USA.

出版信息

Cureus. 2025 Jul 23;17(7):e88591. doi: 10.7759/cureus.88591. eCollection 2025 Jul.

DOI:10.7759/cureus.88591

PMID:40861690

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12372857/

Abstract

Background The integration of artificial intelligence (AI) into healthcare has accelerated rapidly since the public release of ChatGPT (Open AI, San Francisco, California, United States) in 2022. While large language models (LLMs) have demonstrated proficiency in general medical knowledge and licensing examinations, their performance in specialized medical subspecialties remains largely unexplored. Objective The objective of this study was to compare the accuracy of two prominent LLMs, Claude (Anthropic PBC, San Francisco, California, United States) and ChatGPT-4, in answering cardiothoracic anesthesia board-style questions and evaluate their potential for clinical decision support in this subspecialty. Methods We developed a Python-based framework to systematically evaluate LLM performance on 100 custom multiple-choice questions covering cardiothoracic anesthesia topics including arrhythmia management, electrophysiology procedures, pacemaker programming, and perioperative complications. Questions were presented to both Claude and ChatGPT-4 via their respective application programming interface (APIs), with responses compared against expert-validated correct answers. The primary outcome was overall accuracy percentage for each model. Results Claude achieved 32% accuracy (32/100 questions), while ChatGPT-4 achieved 23% accuracy (23/100 questions), representing a 9% point difference (p < 0.05). Both models performed below the threshold typically considered acceptable for clinical decision-making (≥80%). Performance varied across question domains, with both models demonstrating marked difficulty in questions requiring complex electrophysiological reasoning and visual data interpretation (e.g., ECG and imaging-based cases). Conclusions Current LLMs demonstrate limited accuracy in subspecialty-level cardiothoracic anesthesia knowledge, highlighting the need for specialized training datasets and model refinement before clinical implementation. These findings underscore the importance of subspecialty-specific validation before deploying AI tools in specialized medical domains.

摘要

背景自2022年ChatGPT（美国加利福尼亚州旧金山的OpenAI公司）公开发布以来，人工智能（AI）在医疗保健领域的整合迅速加速。虽然大语言模型（LLMs）在一般医学知识和执照考试中已展现出能力，但它们在医学专科领域的表现仍 largely unexplored。目的本研究的目的是比较两种著名的大语言模型Claude（美国加利福尼亚州旧金山的Anthropic PBC公司）和ChatGPT-4在回答心胸麻醉委员会风格问题方面的准确性，并评估它们在该专科领域临床决策支持中的潜力。方法我们开发了一个基于Python的框架，以系统地评估大语言模型在100个涵盖心胸麻醉主题（包括心律失常管理、电生理程序、起搏器编程和围手术期并发症）的自定义多项选择题上的表现。通过各自的应用程序编程接口（API）向Claude和ChatGPT-4提出问题，并将回答与专家验证的正确答案进行比较。主要结果是每个模型的总体准确率。结果 Claude的准确率为32%（100个问题中的32个），而ChatGPT-4的准确率为23%（100个问题中的23个），相差9个百分点（p < 0.05）。两个模型的表现均低于通常认为可接受的临床决策阈值（≥80%）。在不同问题领域的表现有所不同，两个模型在需要复杂电生理推理和视觉数据解读（如心电图和基于影像的病例）的问题上都表现出明显困难。结论当前的大语言模型在专科水平的心胸麻醉知识方面准确性有限，这凸显了在临床应用前需要专门的训练数据集和模型优化。这些发现强调了在专门医学领域部署人工智能工具前进行专科特异性验证的重要性。

相似文献

Comparative Accuracy Assessment of Large Language Models in Cardiothoracic Anesthesia: A Performance Analysis of Claude and ChatGPT-4 on Subspecialty Board-Style Questions.大语言模型在心胸麻醉中的比较准确性评估：Claude和ChatGPT-4在专科委员会风格问题上的性能分析

Cureus. 2025 Jul 23;17(7):e88591. doi: 10.7759/cureus.88591. eCollection 2025 Jul.

Comparative performance of ChatGPT, Gemini, and final-year emergency medicine clerkship students in answering multiple-choice questions: implications for the use of AI in medical education.ChatGPT、Gemini与急诊医学实习最后一年学生在回答多项选择题方面的表现比较：人工智能在医学教育中的应用启示

Int J Emerg Med. 2025 Aug 7;18(1):146. doi: 10.1186/s12245-025-00949-6.

Performance of Large Language Models in Numerical Versus Semantic Medical Knowledge: Cross-Sectional Benchmarking Study on Evidence-Based Questions and Answers.大型语言模型在数值与语义医学知识方面的表现：基于循证问答的横断面基准研究

J Med Internet Res. 2025 Jul 14;27:e64452. doi: 10.2196/64452.

Comparative Analysis of LLMs' Performance On a Practice Radiography Certification Exam.大语言模型在放射实践认证考试中的性能比较分析

Radiol Technol. 2025 May-Jun;96(5):334-342.

Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.在医学视觉问答中评估Bard Gemini Pro和GPT-4 Vision对学生表现的影响：比较案例研究

JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592.

Performance of artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and Bard in the American Society for Metabolic and Bariatric Surgery textbook of bariatric surgery questions.人工智能在减重手术中的表现：ChatGPT-4、Bing 和 Bard 在《美国代谢与减重外科学会减重手术教科书》减重手术问题中的比较分析。

Surg Obes Relat Dis. 2024 Jul;20(7):609-613. doi: 10.1016/j.soard.2024.04.014. Epub 2024 May 8.

Clinical Management of Wasp Stings Using Large Language Models: Cross-Sectional Evaluation Study.使用大语言模型对黄蜂蜇伤进行临床管理：横断面评估研究

J Med Internet Res. 2025 Jun 4;27:e67489. doi: 10.2196/67489.

[Preliminary exploration of the applications of five large language models in the field of oral auxiliary diagnosis, treatment and health consultation].五种大语言模型在口腔辅助诊断、治疗及健康咨询领域的应用初探

Zhonghua Kou Qiang Yi Xue Za Zhi. 2025 Jul 30;60(8):871-878. doi: 10.3760/cma.j.cn112144-20241107-00418.

Large Language Models and Empathy: Systematic Review.大语言模型与同理心：系统综述

J Med Internet Res. 2024 Dec 11;26:e52597. doi: 10.2196/52597.

Comparison of ChatGPT and Internet Research for Clinical Research and Decision-Making in Occupational Medicine: Randomized Controlled Trial.ChatGPT与互联网搜索用于职业医学临床研究和决策的比较：随机对照试验

JMIR Form Res. 2025 May 20;9:e63857. doi: 10.2196/63857.

本文引用的文献

Clinicians must participate in the development of multimodal AI.临床医生必须参与多模态人工智能的开发。

EClinicalMedicine. 2025 May 23;84:103252. doi: 10.1016/j.eclinm.2025.103252. eCollection 2025 Jun.

Strategic Considerations for Selecting Artificial Intelligence Solutions for Institutional Integration: A Single-Center Experience.机构整合中选择人工智能解决方案的战略考量：单中心经验

Mayo Clin Proc Digit Health. 2024 Nov 5;2(4):665-676. doi: 10.1016/j.mcpdig.2024.10.004. eCollection 2024 Dec.

Analyzing evaluation methods for large language models in the medical field: a scoping review.分析医学领域大语言模型的评价方法：范围综述。

BMC Med Inform Decis Mak. 2024 Nov 29;24(1):366. doi: 10.1186/s12911-024-02709-7.

Evaluating Large Language Models in Dental Anesthesiology: A Comparative Analysis of ChatGPT-4, Claude 3 Opus, and Gemini 1.0 on the Japanese Dental Society of Anesthesiology Board Certification Exam.评估牙科麻醉学中的大语言模型：ChatGPT-4、Claude 3 Opus和Gemini 1.0在日本麻醉学牙科协会委员会认证考试中的比较分析。

Cureus. 2024 Sep 27;16(9):e70302. doi: 10.7759/cureus.70302. eCollection 2024 Sep.

Use of Artificial Intelligence Chatbots for Cancer Treatment Information.使用人工智能聊天机器人获取癌症治疗信息。

JAMA Oncol. 2023 Oct 1;9(10):1459-1462. doi: 10.1001/jamaoncol.2023.2954.

Assessment of ChatGPT success with specialty medical knowledge using anaesthesiology board examination practice questions.使用麻醉学委员会考试练习题评估ChatGPT在专业医学知识方面的表现。

Br J Anaesth. 2023 Aug;131(2):e31-e34. doi: 10.1016/j.bja.2023.04.017. Epub 2023 May 18.

Diagnostic Accuracy of Differential-Diagnosis Lists Generated by Generative Pretrained Transformer 3 Chatbot for Clinical Vignettes with Common Chief Complaints: A Pilot Study.基于生成式预训练 Transformer 3 聊天机器人为常见主诉临床病例生成鉴别诊断列表的诊断准确性：一项初步研究。

Int J Environ Res Public Health. 2023 Feb 15;20(4):3378. doi: 10.3390/ijerph20043378.

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.ChatGPT在美国医师执照考试中的表现：使用大语言模型进行人工智能辅助医学教育的潜力。

PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.

ChatGPT: the future of discharge summaries?ChatGPT：出院小结的未来？

Lancet Digit Health. 2023 Mar;5(3):e107-e108. doi: 10.1016/S2589-7500(23)00021-3. Epub 2023 Feb 6.

MySurgeryRisk: Development and Validation of a Machine-learning Risk Algorithm for Major Complications and Death After Surgery.MySurgeryRisk：一种用于手术主要并发症和死亡风险预测的机器学习算法的开发和验证。

Ann Surg. 2019 Apr;269(4):652-662. doi: 10.1097/SLA.0000000000002706.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验