Suppr超能文献

人工智能大语言模型在麻醉学领域的临床知识与推理能力:关于美国麻醉学委员会考试的比较研究

Clinical Knowledge and Reasoning Abilities of AI Large Language Models in Anesthesiology: A Comparative Study on the American Board of Anesthesiology Examination.

作者信息

Angel Mirana C, Rinehart Joseph B, Cannesson Maxime P, Baldi Pierre

机构信息

From the Department of Computer Science, University of California Irvine, Irvine, California.

Institute for Genomics and Bioinformatics, University of California Irvine, Irvine, California.

出版信息

Anesth Analg. 2024 Aug 1;139(2):349-356. doi: 10.1213/ANE.0000000000006892. Epub 2024 Apr 19.

Abstract

BACKGROUND

Over the past decade, artificial intelligence (AI) has expanded significantly with increased adoption across various industries, including medicine. Recently, AI-based large language models such as Generative Pretrained Transformer-3 (GPT-3), Bard, and Generative Pretrained Transformer-3 (GPT-4) have demonstrated remarkable language capabilities. While previous studies have explored their potential in general medical knowledge tasks, here we assess their clinical knowledge and reasoning abilities in a specialized medical context.

METHODS

We studied and compared the performance of all 3 models on both the written and oral portions of the comprehensive and challenging American Board of Anesthesiology (ABA) examination, which evaluates candidates' knowledge and competence in anesthesia practice.

RESULTS

Our results reveal that only GPT-4 successfully passed the written examination, achieving an accuracy of 78% on the basic section and 80% on the advanced section. In comparison, the less recent or smaller GPT-3 and Bard models scored 58% and 47% on the basic examination, and 50% and 46% on the advanced examination, respectively. Consequently, only GPT-4 was evaluated in the oral examination, with examiners concluding that it had a reasonable possibility of passing the structured oral examination. Additionally, we observe that these models exhibit varying degrees of proficiency across distinct topics, which could serve as an indicator of the relative quality of information contained in the corresponding training datasets. This may also act as a predictor for determining which anesthesiology subspecialty is most likely to witness the earliest integration with AI.

CONCLUSIONS

GPT-4 outperformed GPT-3 and Bard on both basic and advanced sections of the written ABA examination, and actual board examiners considered GPT-4 to have a reasonable possibility of passing the real oral examination; these models also exhibit varying degrees of proficiency across distinct topics.

摘要

背景

在过去十年中,人工智能(AI)得到了显著发展,在包括医学在内的各个行业中的应用越来越广泛。最近,基于人工智能的大型语言模型,如生成式预训练变换器3(GPT-3)、巴德(Bard)和生成式预训练变换器4(GPT-4),展现出了卓越的语言能力。虽然之前的研究探讨了它们在一般医学知识任务中的潜力,但在此我们评估它们在专业医学背景下的临床知识和推理能力。

方法

我们研究并比较了这三种模型在全面且具有挑战性的美国麻醉医师委员会(ABA)考试的笔试和口试部分的表现,该考试评估考生在麻醉实践中的知识和能力。

结果

我们的结果显示,只有GPT-4成功通过了笔试,在基础部分的准确率为78%,在高级部分为80%。相比之下,较旧或较小的GPT-3和巴德模型在基础考试中的得分分别为58%和47%,在高级考试中的得分分别为50%和46%。因此,只有GPT-4进入了口试环节,考官们认为它有合理的可能性通过结构化口试。此外,我们观察到这些模型在不同主题上表现出不同程度的熟练程度,这可以作为相应训练数据集中所含信息相对质量的一个指标。这也可能作为一个预测指标,用于确定哪个麻醉亚专业最有可能最早与人工智能融合。

结论

在ABA笔试的基础和高级部分,GPT-4的表现均优于GPT-3和巴德,并且实际的考官认为GPT-4有合理的可能性通过实际口试;这些模型在不同主题上也表现出不同程度的熟练程度。

相似文献

引用本文的文献

本文引用的文献

3
The impact of artificial intelligence on human society and bioethics.人工智能对人类社会和生物伦理学的影响。
Tzu Chi Med J. 2020 Aug 14;32(4):339-343. doi: 10.4103/tcmj.tcmj_71_20. eCollection 2020 Oct-Dec.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验