Suppr超能文献

生成式预训练转换器(GPTs)在加拿大家庭医生学院认证考试中的表现。

Performance of generative pre-trained transformers (GPTs) in Certification Examination of the College of Family Physicians of Canada.

机构信息

Department of Family Medicine, Faculty of Medicine, University of Saskatchewan, Nipawin, Saskatchewan, Canada.

Department of Family Medicine, Saskatchewan Health Authority, Riverside Health Complex, Turtleford, Saskatchewan, Canada.

出版信息

Fam Med Community Health. 2024 May 28;12(Suppl 1):e002626. doi: 10.1136/fmch-2023-002626.

Abstract

INTRODUCTION

The application of large language models such as generative pre-trained transformers (GPTs) has been promising in medical education, and its performance has been tested for different medical exams. This study aims to assess the performance of GPTs in responding to a set of sample questions of short-answer management problems (SAMPs) from the certification exam of the College of Family Physicians of Canada (CFPC).

METHOD

Between August 8th and 25th, 2023, we used GPT-3.5 and GPT-4 in five rounds to answer a sample of 77 SAMPs questions from the CFPC website. Two independent certified family physician reviewers scored AI-generated responses twice: first, according to the CFPC answer key (ie, CFPC score), and second, based on their knowledge and other references (ie, Reviews' score). An ordinal logistic generalised estimating equations (GEE) model was applied to analyse repeated measures across the five rounds.

RESULT

According to the CFPC answer key, 607 (73.6%) lines of answers by GPT-3.5 and 691 (81%) by GPT-4 were deemed accurate. Reviewer's scoring suggested that about 84% of the lines of answers provided by GPT-3.5 and 93% of GPT-4 were correct. The GEE analysis confirmed that over five rounds, the likelihood of achieving a higher CFPC Score Percentage for GPT-4 was 2.31 times more than GPT-3.5 (OR: 2.31; 95% CI: 1.53 to 3.47; p<0.001). Similarly, the Reviewers' Score percentage for responses provided by GPT-4 over 5 rounds were 2.23 times more likely to exceed those of GPT-3.5 (OR: 2.23; 95% CI: 1.22 to 4.06; p=0.009). Running the GPTs after a one week interval, regeneration of the prompt or using or not using the prompt did not significantly change the CFPC score percentage.

CONCLUSION

In our study, we used GPT-3.5 and GPT-4 to answer complex, open-ended sample questions of the CFPC exam and showed that more than 70% of the answers were accurate, and GPT-4 outperformed GPT-3.5 in responding to the questions. Large language models such as GPTs seem promising for assisting candidates of the CFPC exam by providing potential answers. However, their use for family medicine education and exam preparation needs further studies.

摘要

简介

生成式预训练转换器(如 GPT)等大型语言模型在医学教育中的应用具有广阔前景,其在不同医学考试中的性能已得到检验。本研究旨在评估 GPT 对加拿大家庭医生学院(CFPC)认证考试网站上一系列短答题管理问题(SAMP)样本问题的回答能力。

方法

在 2023 年 8 月 8 日至 25 日期间,我们使用 GPT-3.5 和 GPT-4 在五轮中回答了 77 个来自 CFPC 网站的 SAMP 问题样本。两名独立的认证家庭医生评审员两次根据 AI 生成的回复进行评分:首先,根据 CFPC 答案(即 CFPC 分数);其次,根据他们的知识和其他参考资料(即评审分数)。采用有序逻辑广义估计方程(GEE)模型对五轮重复测量进行分析。

结果

根据 CFPC 答案,GPT-3.5 的 607(73.6%)行回答和 GPT-4 的 691(81%)行回答被认为是准确的。评审员的评分表明,GPT-3.5 提供的约 84%的行回答和 GPT-4 的 93%的行回答是正确的。GEE 分析证实,在五轮中,GPT-4 达到更高 CFPC 分数百分比的可能性是 GPT-3.5 的 2.31 倍(OR:2.31;95%CI:1.53 至 3.47;p<0.001)。同样,GPT-4 在五轮中提供的评审分数百分比超过 GPT-3.5 的可能性是 GPT-3.5 的 2.23 倍(OR:2.23;95%CI:1.22 至 4.06;p=0.009)。间隔一周后运行 GPTs,重新生成提示或使用或不使用提示并未显著改变 CFPC 分数百分比。

结论

在本研究中,我们使用 GPT-3.5 和 GPT-4 回答 CFPC 考试的复杂、开放式样本问题,结果表明,超过 70%的答案是准确的,GPT-4 在回答问题方面优于 GPT-3.5。GPT 等大型语言模型似乎有望通过提供潜在答案来帮助 CFPC 考试的考生。然而,它们在家庭医学教育和考试准备中的应用需要进一步研究。

相似文献

本文引用的文献

1
ChatGPT in healthcare: A taxonomy and systematic review.ChatGPT 在医疗保健中的应用:分类法与系统综述。
Comput Methods Programs Biomed. 2024 Mar;245:108013. doi: 10.1016/j.cmpb.2024.108013. Epub 2024 Jan 15.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验