Beil Jordan, Aggarwal Ishan, Devarapalli Mallikarjuna
Anesthesiology and Perioperative Medicine, Augusta University Medical College of Georgia, Augusta, USA.
Cureus. 2025 Jul 23;17(7):e88591. doi: 10.7759/cureus.88591. eCollection 2025 Jul.
Background The integration of artificial intelligence (AI) into healthcare has accelerated rapidly since the public release of ChatGPT (Open AI, San Francisco, California, United States) in 2022. While large language models (LLMs) have demonstrated proficiency in general medical knowledge and licensing examinations, their performance in specialized medical subspecialties remains largely unexplored. Objective The objective of this study was to compare the accuracy of two prominent LLMs, Claude (Anthropic PBC, San Francisco, California, United States) and ChatGPT-4, in answering cardiothoracic anesthesia board-style questions and evaluate their potential for clinical decision support in this subspecialty. Methods We developed a Python-based framework to systematically evaluate LLM performance on 100 custom multiple-choice questions covering cardiothoracic anesthesia topics including arrhythmia management, electrophysiology procedures, pacemaker programming, and perioperative complications. Questions were presented to both Claude and ChatGPT-4 via their respective application programming interface (APIs), with responses compared against expert-validated correct answers. The primary outcome was overall accuracy percentage for each model. Results Claude achieved 32% accuracy (32/100 questions), while ChatGPT-4 achieved 23% accuracy (23/100 questions), representing a 9% point difference (p < 0.05). Both models performed below the threshold typically considered acceptable for clinical decision-making (≥80%). Performance varied across question domains, with both models demonstrating marked difficulty in questions requiring complex electrophysiological reasoning and visual data interpretation (e.g., ECG and imaging-based cases). Conclusions Current LLMs demonstrate limited accuracy in subspecialty-level cardiothoracic anesthesia knowledge, highlighting the need for specialized training datasets and model refinement before clinical implementation. These findings underscore the importance of subspecialty-specific validation before deploying AI tools in specialized medical domains.
背景 自2022年ChatGPT(美国加利福尼亚州旧金山的OpenAI公司)公开发布以来,人工智能(AI)在医疗保健领域的整合迅速加速。虽然大语言模型(LLMs)在一般医学知识和执照考试中已展现出能力,但它们在医学专科领域的表现仍 largely unexplored。目的 本研究的目的是比较两种著名的大语言模型Claude(美国加利福尼亚州旧金山的Anthropic PBC公司)和ChatGPT-4在回答心胸麻醉委员会风格问题方面的准确性,并评估它们在该专科领域临床决策支持中的潜力。方法 我们开发了一个基于Python的框架,以系统地评估大语言模型在100个涵盖心胸麻醉主题(包括心律失常管理、电生理程序、起搏器编程和围手术期并发症)的自定义多项选择题上的表现。通过各自的应用程序编程接口(API)向Claude和ChatGPT-4提出问题,并将回答与专家验证的正确答案进行比较。主要结果是每个模型的总体准确率。结果 Claude的准确率为32%(100个问题中的32个),而ChatGPT-4的准确率为23%(100个问题中的23个),相差9个百分点(p < 0.05)。两个模型的表现均低于通常认为可接受的临床决策阈值(≥80%)。在不同问题领域的表现有所不同,两个模型在需要复杂电生理推理和视觉数据解读(如心电图和基于影像的病例)的问题上都表现出明显困难。结论 当前的大语言模型在专科水平的心胸麻醉知识方面准确性有限,这凸显了在临床应用前需要专门的训练数据集和模型优化。这些发现强调了在专门医学领域部署人工智能工具前进行专科特异性验证的重要性。