Department of Surgery, Community Medical Center, RWJ/Barnabas Health, Tom's River, New Jersey.
Department of Surgery, Robert Wood Johnson Medical School, New Brunswick, New Jersey.
J Surg Educ. 2024 Jun;81(6):780-785. doi: 10.1016/j.jsurg.2024.02.009. Epub 2024 Apr 27.
Advances in artificial intelligence (AI) have given rise to sophisticated algorithms capable of generating human-like text. The goal of this study was to evaluate the ability of human reviewers to reliably differentiate personal statements (PS) written by human authors from those generated by AI software.
Four personal statements from the archives of two surgical program directors were de-identified and used as the human samples. Two AI platforms were used to generate nine additional PS.
Four surgeons from the residency selection committees of two surgical residency programs of a large multihospital system served as blinded reviewers. AI was also asked to evaluate each PS sample for authorship.
Sensitivity, specificity and accuracy of the reviewers in identifying the PS author were calculated. Kappa statistic for correlation between the hypothesized author and the true author were calculated. Inter-rater reliability was calculated using the kappa statistic with Light's modification given more than two reviewers in a fully-crossed design. Logistic regression was performed with to model the impact of perceived creativity, writing quality, and authorship or the likelihood of offering an interview.
Human reviewer sensitivity for identifying an AI-generated PS was 0.87 with specificity of 0.37 and overall accuracy of 0.55. The level of agreement by kappa statistic of the reviewer estimate of authorship and the true authorship was 0.19 (slight agreement). The reviewers themselves had an inter-rater reliability of 0.067 (poor), with only complete agreement (four out of four reviewers) on two PS, both authored by humans. The odds ratio of offering an interview (compared to a composite of "backup" status or no interview) to a perceived human author was 7 times that of a perceived AI author (95% confidence interval 1.5276 to 32.0758, p=0.0144). AI hypothesized human authorship for twelve of the PS, with the last one "unsure."
The increasing pervasiveness of AI will have far-reaching effects including on the resident application and recruitment process. Identifying AI-generated personal statements is exceedingly difficult. With the decreasing availability of objective data to assess applicants, a review and potential restructuring of the approach to resident recruitment may be warranted.
人工智能(AI)的进步催生了能够生成类人文本的复杂算法。本研究的目的是评估人类评审员可靠地区分由人类作者撰写的个人陈述(PS)与由 AI 软件生成的 PS 的能力。
两个外科项目主任档案中的四个个人陈述被去识别,并用作人类样本。使用两个 AI 平台生成了另外九个 PS。
来自大型多医院系统两个外科住院医师计划的住院医师选拔委员会的四名外科医生作为盲审员。AI 还被要求评估每个 PS 样本的作者身份。
计算评审员识别 PS 作者的敏感性、特异性和准确性。计算假设作者与真实作者之间相关性的 Kappa 统计量。对于完全交叉设计中超过两名评审员的情况,使用 Light 修改后的 Kappa 统计量计算评分者间可靠性。使用逻辑回归来模拟感知创造力、写作质量和作者身份或提供面试的可能性对接受面试的可能性的影响。
人类评审员识别 AI 生成 PS 的敏感性为 0.87,特异性为 0.37,整体准确性为 0.55。评审员对作者身份的估计与真实作者身份的 Kappa 统计量的一致性水平为 0.19(轻度一致)。评审员本身的评分者间可靠性为 0.067(差),只有在两个 PS 上完全一致(四名评审员中的四名),这两个 PS 都是由人类撰写的。与“候补”状态或无面试相比,接受访谈的可能性(相对于复合)的优势比为感知到的 AI 作者的 7 倍(95%置信区间 1.5276 至 32.0758,p=0.0144)。AI 假设十二篇 PS 的作者是人类,最后一篇是“不确定”。
AI 的普及将产生深远的影响,包括对住院医师申请和招聘过程的影响。识别 AI 生成的个人陈述极其困难。随着评估申请人的客观数据可用性的降低,可能需要对住院医师招聘的方法进行审查和潜在的重组。