• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大语言模型应用于心胸外科手术:2023年四种模型在美国胸外科医师委员会考试题目上的性能对比分析

Large Language Models Take on Cardiothoracic Surgery: A Comparative Analysis of the Performance of Four Models on American Board of Thoracic Surgery Exam Questions in 2023.

作者信息

Khalpey Zain, Kumar Ujjawal, King Nicholas, Abraham Alyssa, Khalpey Amina H

机构信息

Khalpey AI Lab, Department of Cardiothoracic Surgery, HonorHealth, Scottsdale, USA.

Department of Research, Applied & Translational AI Research Institute (ATARI), Scottsdale, USA.

出版信息

Cureus. 2024 Jul 22;16(7):e65083. doi: 10.7759/cureus.65083. eCollection 2024 Jul.

DOI:10.7759/cureus.65083
PMID:39171020
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11337141/
Abstract

Objectives Large language models (LLMs), for example, ChatGPT, have performed exceptionally well in various fields. Of note, their success in answering postgraduate medical examination questions has been previously reported, indicating their possible utility in surgical education and training. This study evaluated the performance of four different LLMs on the American Board of Thoracic Surgery's (ABTS) Self-Education and Self-Assessment in Thoracic Surgery (SESATS) XIII question bank to investigate the potential applications of these LLMs in the education and training of future surgeons. Methods The dataset in this study comprised 400 best-of-four questions from the SESATS XIII exam. This included 220 adult cardiac surgery questions, 140 general thoracic surgery questions, 20 congenital cardiac surgery questions, and 20 cardiothoracic critical care questions. The GPT-3.5 (OpenAI, San Francisco, CA) and GPT-4 (OpenAI) models were evaluated, as well as Med-PaLM 2 (Google Inc., Mountain View, CA) and Claude 2 (Anthropic Inc., San Francisco, CA), and their respective performances were compared. The subspecialties included were adult cardiac, general thoracic, congenital cardiac, and critical care. Questions requiring visual information, such as clinical images or radiology, were excluded. Results GPT-4 demonstrated a significant improvement over GPT-3.5 overall (87.0% vs. 51.8% of questions answered correctly, p < 0.0001). GPT-4 also exhibited consistently improved performance across all subspecialties, with accuracy rates ranging from 70.0% to 90.0%, compared to 35.0% to 60.0% for GPT-3.5. When using the GPT-4 model, ChatGPT performed significantly better on the adult cardiac and general thoracic subspecialties (p < 0.0001). Conclusions Large language models, such as ChatGPT with the GPT-4 model, demonstrate impressive skill in understanding complex cardiothoracic surgical clinical information, achieving an overall accuracy rate of nearly 90.0% on the SESATS question bank. Our study shows significant improvement between successive GPT iterations. As LLM technology continues to evolve, its potential use in surgical education, training, and continuous medical education is anticipated to enhance patient outcomes and safety in the future.

摘要

目标 例如ChatGPT这样的大语言模型在各个领域都表现出色。值得注意的是,此前已有报道称它们在回答研究生医学考试问题方面取得了成功,这表明它们在外科教育和培训中可能具有实用性。本研究评估了四种不同的大语言模型在美国胸外科医师委员会(ABTS)的胸外科自我教育与自我评估(SESATS)XIII题库上的表现,以探究这些大语言模型在未来外科医生教育和培训中的潜在应用。方法 本研究中的数据集包括来自SESATS XIII考试的400道四选一的最佳问题。这包括220道成人心脏外科问题、140道普通胸外科问题、20道先天性心脏外科问题和20道心胸重症监护问题。对GPT-3.5(OpenAI,加利福尼亚州旧金山)和GPT-4(OpenAI)模型进行了评估,以及Med-PaLM 2(谷歌公司,加利福尼亚州山景城)和Claude 2(Anthropic公司,加利福尼亚州旧金山),并比较了它们各自的表现。所涵盖的亚专业包括成人心脏、普通胸、先天性心脏和重症监护。需要视觉信息(如临床图像或放射学)的问题被排除。结果 GPT-4总体上比GPT-3.5有显著提高(正确回答的问题比例分别为87.0%和51.8%,p<0.0001)。GPT-4在所有亚专业中的表现也持续提升,准确率在70.0%至90.0%之间,而GPT-3.5的准确率为35.0%至60.0%。使用GPT-4模型时,ChatGPT在成人心脏和普通胸亚专业上的表现明显更好(p<0.0001)。结论 像配备GPT-4模型的ChatGPT这样的大语言模型在理解复杂的心胸外科临床信息方面展现出令人印象深刻的能力,在SESATS题库上的总体准确率接近90.0%。我们的研究表明GPT的连续迭代之间有显著改进。随着大语言模型技术不断发展,预计其在外科教育、培训和继续医学教育中的潜在应用将在未来提高患者的治疗效果和安全性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9c69/11337141/b4a65a6300b7/cureus-0016-00000065083-i04.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9c69/11337141/48c4ab52022b/cureus-0016-00000065083-i01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9c69/11337141/a13191dc3f61/cureus-0016-00000065083-i02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9c69/11337141/7bf37d5573e6/cureus-0016-00000065083-i03.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9c69/11337141/b4a65a6300b7/cureus-0016-00000065083-i04.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9c69/11337141/48c4ab52022b/cureus-0016-00000065083-i01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9c69/11337141/a13191dc3f61/cureus-0016-00000065083-i02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9c69/11337141/7bf37d5573e6/cureus-0016-00000065083-i03.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9c69/11337141/b4a65a6300b7/cureus-0016-00000065083-i04.jpg

相似文献

1
Large Language Models Take on Cardiothoracic Surgery: A Comparative Analysis of the Performance of Four Models on American Board of Thoracic Surgery Exam Questions in 2023.大语言模型应用于心胸外科手术:2023年四种模型在美国胸外科医师委员会考试题目上的性能对比分析
Cureus. 2024 Jul 22;16(7):e65083. doi: 10.7759/cureus.65083. eCollection 2024 Jul.
2
Stratified Evaluation of GPT's Question Answering in Surgery Reveals Artificial Intelligence (AI) Knowledge Gaps.对GPT在外科手术中问答的分层评估揭示了人工智能(AI)的知识差距。
Cureus. 2023 Nov 14;15(11):e48788. doi: 10.7759/cureus.48788. eCollection 2023 Nov.
3
ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models.ChatGPT走进手术室:在大语言模型时代评估GPT-4在外科教育与培训中的表现及其潜力。
Ann Surg Treat Res. 2023 May;104(5):269-273. doi: 10.4174/astr.2023.104.5.269. Epub 2023 Apr 28.
4
Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions.比较流行的大语言模型在国家医学考试委员会样题上的表现。
Cureus. 2024 Mar 11;16(3):e55991. doi: 10.7759/cureus.55991. eCollection 2024 Mar.
5
Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard.评估印度全国医预考用大型语言模型:GPT-3.5、GPT-4 和 Bard 的比较分析。
JMIR Med Educ. 2024 Feb 21;10:e51523. doi: 10.2196/51523.
6
Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations.ChatGPT和GPT-4在神经外科笔试中的表现。
Neurosurgery. 2023 Dec 1;93(6):1353-1365. doi: 10.1227/neu.0000000000002632. Epub 2023 Aug 15.
7
Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank.ChatGPT、GPT-4和谷歌巴德在神经外科口试准备题库上的表现。
Neurosurgery. 2023 Nov 1;93(5):1090-1098. doi: 10.1227/neu.0000000000002551. Epub 2023 Jun 12.
8
Artificial Intelligence for Anesthesiology Board-Style Examination Questions: Role of Large Language Models.人工智能在麻醉学 board 式考试问题中的应用:大语言模型的作用。
J Cardiothorac Vasc Anesth. 2024 May;38(5):1251-1259. doi: 10.1053/j.jvca.2024.01.032. Epub 2024 Feb 1.
9
Performance of Progressive Generations of GPT on an Exam Designed for Certifying Physicians as Certified Clinical Densitometrists.GPT 各代产品在专为认证医师为认证临床骨密度技师而设计的考试中的表现。
J Clin Densitom. 2024 Apr-Jun;27(2):101480. doi: 10.1016/j.jocd.2024.101480. Epub 2024 Feb 17.
10
Programming Chatbots Using Natural Language: Generating Cervical Spine MRI Impressions.使用自然语言编程聊天机器人:生成颈椎MRI影像报告
Cureus. 2024 Sep 14;16(9):e69410. doi: 10.7759/cureus.69410. eCollection 2024 Sep.

引用本文的文献

1
A bibliometric analysis of large language model-based AI chatbots in surgery.基于大语言模型的人工智能聊天机器人在外科手术中的文献计量分析
Ann Med Surg (Lond). 2025 May 12;87(7):4127-4138. doi: 10.1097/MS9.0000000000003234. eCollection 2025 Jul.
2
Postoperative complication management: How do large language models measure up to human expertise?术后并发症管理:大语言模型与人类专业知识相比如何?
PLOS Digit Health. 2025 Aug 1;4(8):e0000933. doi: 10.1371/journal.pdig.0000933. eCollection 2025 Aug.
3
Large language models versus traditional textbooks: optimizing learning for plastic surgery case preparation.

本文引用的文献

1
Academic Surgery in the Era of Large Language Models: A Review.大语言模型时代的外科学术:综述。
JAMA Surg. 2024 Apr 1;159(4):445-450. doi: 10.1001/jamasurg.2023.6496.
2
Adherence of a Large Language Model to Clinical Guidelines for Craniofacial Plastic and Reconstructive Surgeries.大型语言模型对颅面整形与重建手术临床指南的遵循情况。
Ann Plast Surg. 2024 Mar 1;92(3):261-262. doi: 10.1097/SAP.0000000000003757. Epub 2024 Jan 6.
3
Performance of large language models at the MRCS Part A: a tool for medical education?大型语言模型在MRCS A部分的表现:一种医学教育工具?
大型语言模型与传统教科书:优化整形手术病例准备的学习
BMC Med Educ. 2025 Jul 1;25(1):984. doi: 10.1186/s12909-025-07550-8.
4
A comparative analysis of DeepSeek R1, DeepSeek-R1-Lite, OpenAi o1 Pro, and Grok 3 performance on ophthalmology board-style questions.DeepSeek R1、DeepSeek-R1-Lite、OpenAi o1 Pro和Grok 3在眼科委员会式问题上的性能比较分析。
Sci Rep. 2025 Jul 2;15(1):23101. doi: 10.1038/s41598-025-08601-2.
5
Preparing for Vascular Surgery Board Certification: A Comparative Study Using Large Language Models.为血管外科委员会认证做准备:一项使用大语言模型的比较研究。
Cureus. 2025 May 10;17(5):e83848. doi: 10.7759/cureus.83848. eCollection 2025 May.
6
Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis.大型语言模型回答临床研究问题的准确性:系统评价与网络荟萃分析
J Med Internet Res. 2025 Apr 30;27:e64486. doi: 10.2196/64486.
7
Large Language Models' Responses to Spinal Cord Injury: A Comparative Study of Performance.大语言模型对脊髓损伤的反应:性能比较研究
J Med Syst. 2025 Mar 25;49(1):39. doi: 10.1007/s10916-025-02170-7.
Ann R Coll Surg Engl. 2023 Dec 1. doi: 10.1308/rcsann.2023.0085.
4
AI chatbots and (mis)information in public health: impact on vulnerable communities.人工智能聊天机器人与公共卫生领域的(错误)信息:对弱势群体的影响
Front Public Health. 2023 Oct 31;11:1226776. doi: 10.3389/fpubh.2023.1226776. eCollection 2023.
5
Artificial intelligence and increasing misinformation.人工智能与日益泛滥的错误信息。
Br J Psychiatry. 2024 Feb;224(2):33-35. doi: 10.1192/bjp.2023.136.
6
Unraveling the Ethical Enigma: Artificial Intelligence in Healthcare.解开伦理谜团:医疗保健领域的人工智能
Cureus. 2023 Aug 10;15(8):e43262. doi: 10.7759/cureus.43262. eCollection 2023 Aug.
7
ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations.医学领域的ChatGPT:其应用、优势、局限性、未来前景及伦理考量概述
Front Artif Intell. 2023 May 4;6:1169595. doi: 10.3389/frai.2023.1169595. eCollection 2023.
8
ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models.ChatGPT走进手术室:在大语言模型时代评估GPT-4在外科教育与培训中的表现及其潜力。
Ann Surg Treat Res. 2023 May;104(5):269-273. doi: 10.4174/astr.2023.104.5.269. Epub 2023 Apr 28.
9
Addressing bias in artificial intelligence for public health surveillance.解决公共卫生监测人工智能中的偏见问题。
J Med Ethics. 2024 Feb 20;50(3):190-194. doi: 10.1136/jme-2022-108875.
10
Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.ChatGPT在美国医师执照考试中的表现:使用大语言模型进行人工智能辅助医学教育的潜力。
PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.