Suppr超能文献

人工智能会欺骗住院医师委员会吗?推荐信的随机多中心分析

Can Artificial Intelligence Deceive Residency Committees? A Randomized Multicenter Analysis of Letters of Recommendation.

作者信息

Simister Samuel K, Huish Eric G, Tsai Eugene Y, Le Hai V, Halim Andrea, Tuason Dominick, Meehan John P, Leshikar Holly B, Saiz Augustine M, Lum Zachary C

机构信息

From the University of California, Davis, Sacramento, CA (Simister, Le, Meehan, Leshikar, Saiz, and Lum), the San Joaquin General Hospital, French Camp, CA (Huish), the Cedars Sinai, Los Angeles, CA (Tsai), and the Yale University, New Haven, CT (Halim and Tuason).

出版信息

J Am Acad Orthop Surg. 2025 Mar 15;33(6):e348-e355. doi: 10.5435/JAAOS-D-24-00438. Epub 2024 Dec 12.

Abstract

INTRODUCTION

The introduction of generative artificial intelligence (AI) may have a profound effect on residency applications. In this study, we explore the abilities of AI-generated letters of recommendation (LORs) by evaluating the accuracy of orthopaedic surgery residency selection committee members to identify LORs written by human or AI authors.

METHODS

In a multicenter, single-blind trial, a total of 45 LORs (15 human, 15 ChatGPT, and 15 Google BARD) were curated. In a random fashion, seven faculty reviewers from four residency programs were asked to grade each of the 45 LORs based on the 11 characteristics outlined in the American Orthopaedic Associations standardized LOR, as well as a 1 to 10 scale on how they would rank the applicant, their desire of having the applicant in the program, and if they thought the letter was generated by a human or AI author. Analysis included descriptives, ordinal regression, and a receiver operator characteristic curve to compare accuracy based on the number of letters reviewed.

RESULTS

Faculty reviewers correctly identified 40% (42/105) of human-generated and 63% (132/210) of AI-generated letters ( P < 0.001), which did not increase over time (AUC 0.451, P = 0.102). When analyzed by perceived author, letters marked as human generated had significantly higher means for all variables ( P = 0.01). BARD did markedly better than human authors in accuracy (3.25 [1.79 to 5.92], P < 0.001), adaptability (1.29 [1.02 to 1.65], P = 0.034), and perceived commitment (1.56 [0.99 to 2.47], P < 0.055). Additional analysis controlling for reviewer background showed no differences in outcomes based on experience or familiarity with the AI programs.

CONCLUSION

Faculty members were unsuccessful in determining the difference between human-generated and AI-generated LORs 50% of the time, which suggests that AI can generate LORs similarly to human authors. This highlights the importance for selection committees to reconsider the role and influence of LORs on residency applications.

摘要

引言

生成式人工智能(AI)的引入可能会对住院医师申请产生深远影响。在本研究中,我们通过评估骨科住院医师选拔委员会成员识别由人类或AI作者撰写的推荐信(LOR)的准确性,来探索AI生成的LOR的能力。

方法

在一项多中心、单盲试验中,共收集了45封LOR(15封由人类撰写、15封由ChatGPT撰写、15封由谷歌BARD撰写)。以随机方式,邀请来自四个住院医师项目的七名教员评审员根据美国矫形外科学会标准化LOR中列出的11项特征,以及从1到10的评分标准,对这45封LOR中的每一封进行评分,评分内容包括他们将如何对申请人进行排名、他们希望该申请人加入该项目的意愿,以及他们是否认为这封信是由人类或AI作者生成的。分析包括描述性统计、有序回归以及基于审查信件数量比较准确性的受试者操作特征曲线。

结果

教员评审员正确识别出40%(42/105)的由人类撰写的信件和63%(132/210)的由AI生成的信件(P<0.001),且随着时间推移这一比例并未增加(曲线下面积为0.451,P = 0.102)。按感知作者进行分析时,标记为人类撰写的信件在所有变量上的平均分显著更高(P = 0.01)。BARD在准确性(3.25 [1.79至5.92],P<0.001)、适应性(1.29 [1.02至1.65],P = 0.034)和感知承诺(1.56 [0.99至2.47],P<0.055)方面明显优于人类作者。控制评审员背景的进一步分析表明,基于经验或对AI程序的熟悉程度,结果没有差异。

结论

教员成员在50%的情况下无法确定由人类撰写和由AI生成的LOR之间的差异,这表明AI可以生成与人类作者相似的LOR。这凸显了选拔委员会重新考虑LOR在住院医师申请中的作用和影响的重要性。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验