Khalpey Zain, Kumar Ujjawal, King Nicholas, Abraham Alyssa, Khalpey Amina H
Khalpey AI Lab, Department of Cardiothoracic Surgery, HonorHealth, Scottsdale, USA.
Department of Research, Applied & Translational AI Research Institute (ATARI), Scottsdale, USA.
Cureus. 2024 Jul 22;16(7):e65083. doi: 10.7759/cureus.65083. eCollection 2024 Jul.
Objectives Large language models (LLMs), for example, ChatGPT, have performed exceptionally well in various fields. Of note, their success in answering postgraduate medical examination questions has been previously reported, indicating their possible utility in surgical education and training. This study evaluated the performance of four different LLMs on the American Board of Thoracic Surgery's (ABTS) Self-Education and Self-Assessment in Thoracic Surgery (SESATS) XIII question bank to investigate the potential applications of these LLMs in the education and training of future surgeons. Methods The dataset in this study comprised 400 best-of-four questions from the SESATS XIII exam. This included 220 adult cardiac surgery questions, 140 general thoracic surgery questions, 20 congenital cardiac surgery questions, and 20 cardiothoracic critical care questions. The GPT-3.5 (OpenAI, San Francisco, CA) and GPT-4 (OpenAI) models were evaluated, as well as Med-PaLM 2 (Google Inc., Mountain View, CA) and Claude 2 (Anthropic Inc., San Francisco, CA), and their respective performances were compared. The subspecialties included were adult cardiac, general thoracic, congenital cardiac, and critical care. Questions requiring visual information, such as clinical images or radiology, were excluded. Results GPT-4 demonstrated a significant improvement over GPT-3.5 overall (87.0% vs. 51.8% of questions answered correctly, p < 0.0001). GPT-4 also exhibited consistently improved performance across all subspecialties, with accuracy rates ranging from 70.0% to 90.0%, compared to 35.0% to 60.0% for GPT-3.5. When using the GPT-4 model, ChatGPT performed significantly better on the adult cardiac and general thoracic subspecialties (p < 0.0001). Conclusions Large language models, such as ChatGPT with the GPT-4 model, demonstrate impressive skill in understanding complex cardiothoracic surgical clinical information, achieving an overall accuracy rate of nearly 90.0% on the SESATS question bank. Our study shows significant improvement between successive GPT iterations. As LLM technology continues to evolve, its potential use in surgical education, training, and continuous medical education is anticipated to enhance patient outcomes and safety in the future.
目标 例如ChatGPT这样的大语言模型在各个领域都表现出色。值得注意的是,此前已有报道称它们在回答研究生医学考试问题方面取得了成功,这表明它们在外科教育和培训中可能具有实用性。本研究评估了四种不同的大语言模型在美国胸外科医师委员会(ABTS)的胸外科自我教育与自我评估(SESATS)XIII题库上的表现,以探究这些大语言模型在未来外科医生教育和培训中的潜在应用。方法 本研究中的数据集包括来自SESATS XIII考试的400道四选一的最佳问题。这包括220道成人心脏外科问题、140道普通胸外科问题、20道先天性心脏外科问题和20道心胸重症监护问题。对GPT-3.5(OpenAI,加利福尼亚州旧金山)和GPT-4(OpenAI)模型进行了评估,以及Med-PaLM 2(谷歌公司,加利福尼亚州山景城)和Claude 2(Anthropic公司,加利福尼亚州旧金山),并比较了它们各自的表现。所涵盖的亚专业包括成人心脏、普通胸、先天性心脏和重症监护。需要视觉信息(如临床图像或放射学)的问题被排除。结果 GPT-4总体上比GPT-3.5有显著提高(正确回答的问题比例分别为87.0%和51.8%,p<0.0001)。GPT-4在所有亚专业中的表现也持续提升,准确率在70.0%至90.0%之间,而GPT-3.5的准确率为35.0%至60.0%。使用GPT-4模型时,ChatGPT在成人心脏和普通胸亚专业上的表现明显更好(p<0.0001)。结论 像配备GPT-4模型的ChatGPT这样的大语言模型在理解复杂的心胸外科临床信息方面展现出令人印象深刻的能力,在SESATS题库上的总体准确率接近90.0%。我们的研究表明GPT的连续迭代之间有显著改进。随着大语言模型技术不断发展,预计其在外科教育、培训和继续医学教育中的潜在应用将在未来提高患者的治疗效果和安全性。