人工智能能骗过住院医师选拔委员会吗？真实申请者与生成式人工智能个人陈述分析，一项随机、单盲多中心研究。

Can Artificial Intelligence Fool Residency Selection Committees? Analysis of Personal Statements by Real Applicants and Generative AI, a Randomized, Single-Blind Multicenter Study.

作者信息

Lum Zachary C, Guntupalli Lohitha, Saiz Augustine M, Leshikar Holly, Le Hai V, Meehan John P, Huish Eric G

机构信息

Department of Surgery, Kiran Patel School of Osteopathic and Allopathic Medicine, Nova Southeastern University, Davie, Florida.

Department of Orthopaedic Surgery, School of Medicine, University of California: Davis, Sacramento, California.

出版信息

JB JS Open Access. 2024 Oct 24;9(4). doi: 10.2106/JBJS.OA.24.00028. eCollection 2024 Oct-Dec.

DOI:10.2106/JBJS.OA.24.00028

PMID:39450246

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11498924/

Abstract

INTRODUCTION

The potential capabilities of generative artificial intelligence (AI) tools have been relatively unexplored, particularly in the realm of creating personalized statements for medical students applying to residencies. This study aimed to investigate the ability of generative AI, specifically ChatGPT and Google BARD, to generate personal statements and assess whether faculty on residency selection committees could (1) evaluate differences between real and AI statements and (2) determine differences based on 13 defined and specific metrics of a personal statement.

METHODS

Fifteen real personal statements were used to generate 15 unique and distinct personal statements from ChatGPT and BARD each, resulting in a total of 45 statements. Statements were then randomized, blinded, and presented to a group of faculty reviewers on residency selection committees. Reviewers assessed the statements by 14 metrics including if the personal statement was AI-generated or real. Comparison of all metrics was performed.

RESULTS

Faculty correctly identified 88% (79/90) real statements, 90% (81/90) BARD, and 44% (40/90) ChatGPT statements. Accuracy of identifying real and BARD statements was 89%, but this dropped to 74% when including ChatGPT. In addition, the accuracy did not increase as faculty members reviewed more personal statements (area under the curve [AUC] 0.498, p = 0.966). BARD performed poorer than both real and ChatGPT across all metrics (p < 0.001). Comparing real with ChatGPT, there was no difference in most metrics, except for Personal Interests, Reasons for Choosing Residency, Career Goals, Compelling Nature and Originality, and all favoring the real personal statements (p = 0.001, p = 0.002, p < 0.001, p < 0.001, and p < 0.001, respectively).

CONCLUSION

Faculty members accurately identified real and BARD statements, but ChatGPT deceived them 56% of the time. Although AI can craft convincing statements that are sometimes indistinguishable from real ones, replicating the humanistic experience, personal nuances, and individualistic elements found in real personal statements is difficult. Residency selection committees might want to prioritize these particular metrics while assessing personal statements, given the growing capabilities of AI in this arena.

CLINICAL RELEVANCE

Residency selection committees may want to prioritize certain metrics unique to the human element such as personal interests, reasons for choosing residency, career goals, compelling nature, and originality when evaluating personal statements.

摘要

引言

生成式人工智能（AI）工具的潜在能力尚未得到充分探索，尤其是在为申请住院医师培训项目的医学生撰写个性化陈述方面。本研究旨在调查生成式AI，特别是ChatGPT和谷歌BARD生成个人陈述的能力，并评估住院医师选拔委员会的教员是否能够（1）评估真实陈述与AI生成陈述之间的差异，以及（2）根据个人陈述的13项明确且具体的指标确定差异。

方法

使用15篇真实的个人陈述，分别从ChatGPT和BARD生成15篇独特且不同的个人陈述，共得到45篇陈述。然后将这些陈述随机化、遮蔽，并呈现给住院医师选拔委员会的一组教员评审员。评审员通过14项指标评估这些陈述，包括个人陈述是否由AI生成或真实。对所有指标进行了比较。

结果

教员正确识别出88%（79/90）的真实陈述、90%（81/90）的BARD生成陈述和44%（40/90）的ChatGPT生成陈述。识别真实陈述和BARD生成陈述的准确率为89%，但纳入ChatGPT后降至74%。此外，随着教员评审更多个人陈述，准确率并未提高（曲线下面积[AUC]为0.498，p = 0.966）。在所有指标上，BARD的表现均不如真实陈述和ChatGPT（p < 0.001）。将真实陈述与ChatGPT生成陈述进行比较，除个人兴趣、选择住院医师培训项目的原因、职业目标、吸引力和原创性外多数指标无差异，且这些方面均更有利于真实的个人陈述（分别为p = 0.001、p = 0.002、p < 0.001、p < 0.001和p < 0.001）。

结论

教员能够准确识别真实陈述和BARD生成的陈述，但ChatGPT有56%的时间能骗过他们。尽管AI可以撰写令人信服的陈述，有时与真实陈述难以区分，但复制真实个人陈述中的人文体验、个人细微差别和个性化元素却很困难。鉴于AI在这一领域的能力不断增强，住院医师选拔委员会在评估个人陈述时可能希望优先考虑这些特定指标。

临床意义

住院医师选拔委员会在评估个人陈述时，可能希望优先考虑某些独特的人文元素指标，如个人兴趣、选择住院医师培训项目的原因、职业目标、吸引力和原创性。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

人工智能能骗过住院医师选拔委员会吗？真实申请者与生成式人工智能个人陈述分析，一项随机、单盲多中心研究。

Can Artificial Intelligence Fool Residency Selection Committees? Analysis of Personal Statements by Real Applicants and Generative AI, a Randomized, Single-Blind Multicenter Study.

作者信息

机构信息

出版信息

INTRODUCTION

METHODS

RESULTS

CONCLUSION

CLINICAL RELEVANCE

引言

方法

结果

结论

临床意义

相似文献

引用本文的文献

本文引用的文献

相似文献

引用本文的文献

本文引用的文献

人工智能能骗过住院医师选拔委员会吗？真实申请者与生成式人工智能个人陈述分析，一项随机、单盲多中心研究。

Can Artificial Intelligence Fool Residency Selection Committees? Analysis of Personal Statements by Real Applicants and Generative AI, a Randomized, Single-Blind Multicenter Study.

作者信息

机构信息

出版信息

INTRODUCTION

METHODS

RESULTS

CONCLUSION

CLINICAL RELEVANCE

引言

方法

结果

结论

临床意义