Department of Family Medicine, Faculty of Medicine, University of Saskatchewan, Nipawin, Saskatchewan, Canada.
Department of Family Medicine, Saskatchewan Health Authority, Riverside Health Complex, Turtleford, Saskatchewan, Canada.
Fam Med Community Health. 2024 May 28;12(Suppl 1):e002626. doi: 10.1136/fmch-2023-002626.
The application of large language models such as generative pre-trained transformers (GPTs) has been promising in medical education, and its performance has been tested for different medical exams. This study aims to assess the performance of GPTs in responding to a set of sample questions of short-answer management problems (SAMPs) from the certification exam of the College of Family Physicians of Canada (CFPC).
Between August 8th and 25th, 2023, we used GPT-3.5 and GPT-4 in five rounds to answer a sample of 77 SAMPs questions from the CFPC website. Two independent certified family physician reviewers scored AI-generated responses twice: first, according to the CFPC answer key (ie, CFPC score), and second, based on their knowledge and other references (ie, Reviews' score). An ordinal logistic generalised estimating equations (GEE) model was applied to analyse repeated measures across the five rounds.
According to the CFPC answer key, 607 (73.6%) lines of answers by GPT-3.5 and 691 (81%) by GPT-4 were deemed accurate. Reviewer's scoring suggested that about 84% of the lines of answers provided by GPT-3.5 and 93% of GPT-4 were correct. The GEE analysis confirmed that over five rounds, the likelihood of achieving a higher CFPC Score Percentage for GPT-4 was 2.31 times more than GPT-3.5 (OR: 2.31; 95% CI: 1.53 to 3.47; p<0.001). Similarly, the Reviewers' Score percentage for responses provided by GPT-4 over 5 rounds were 2.23 times more likely to exceed those of GPT-3.5 (OR: 2.23; 95% CI: 1.22 to 4.06; p=0.009). Running the GPTs after a one week interval, regeneration of the prompt or using or not using the prompt did not significantly change the CFPC score percentage.
In our study, we used GPT-3.5 and GPT-4 to answer complex, open-ended sample questions of the CFPC exam and showed that more than 70% of the answers were accurate, and GPT-4 outperformed GPT-3.5 in responding to the questions. Large language models such as GPTs seem promising for assisting candidates of the CFPC exam by providing potential answers. However, their use for family medicine education and exam preparation needs further studies.
生成式预训练转换器(如 GPT)等大型语言模型在医学教育中的应用具有广阔前景,其在不同医学考试中的性能已得到检验。本研究旨在评估 GPT 对加拿大家庭医生学院(CFPC)认证考试网站上一系列短答题管理问题(SAMP)样本问题的回答能力。
在 2023 年 8 月 8 日至 25 日期间,我们使用 GPT-3.5 和 GPT-4 在五轮中回答了 77 个来自 CFPC 网站的 SAMP 问题样本。两名独立的认证家庭医生评审员两次根据 AI 生成的回复进行评分:首先,根据 CFPC 答案(即 CFPC 分数);其次,根据他们的知识和其他参考资料(即评审分数)。采用有序逻辑广义估计方程(GEE)模型对五轮重复测量进行分析。
根据 CFPC 答案,GPT-3.5 的 607(73.6%)行回答和 GPT-4 的 691(81%)行回答被认为是准确的。评审员的评分表明,GPT-3.5 提供的约 84%的行回答和 GPT-4 的 93%的行回答是正确的。GEE 分析证实,在五轮中,GPT-4 达到更高 CFPC 分数百分比的可能性是 GPT-3.5 的 2.31 倍(OR:2.31;95%CI:1.53 至 3.47;p<0.001)。同样,GPT-4 在五轮中提供的评审分数百分比超过 GPT-3.5 的可能性是 GPT-3.5 的 2.23 倍(OR:2.23;95%CI:1.22 至 4.06;p=0.009)。间隔一周后运行 GPTs,重新生成提示或使用或不使用提示并未显著改变 CFPC 分数百分比。
在本研究中,我们使用 GPT-3.5 和 GPT-4 回答 CFPC 考试的复杂、开放式样本问题,结果表明,超过 70%的答案是准确的,GPT-4 在回答问题方面优于 GPT-3.5。GPT 等大型语言模型似乎有望通过提供潜在答案来帮助 CFPC 考试的考生。然而,它们在家庭医学教育和考试准备中的应用需要进一步研究。