Department of Biomedical Engineering, University of Rochester, Rochester, New York.
School of Medicine and Dentistry, University of Rochester, Rochester, New York.
J Surg Res. 2024 Sep;301:504-511. doi: 10.1016/j.jss.2024.06.020. Epub 2024 Jul 22.
Large language models like Chat Generative Pre-Trained Transformer (ChatGPT) are increasingly used in academic writing. Faculty may consider use of artificial intelligence (AI)-generated responses a form of cheating. We sought to determine whether general surgery residency faculty could detect AI versus human-written responses to a text prompt; hypothesizing that faculty would not be able to reliably differentiate AI versus human-written responses.
Ten essays were generated using a text prompt, "Tell us in 1-2 paragraphs why you are considering the University of Rochester for General Surgery residency" (Current trainees: n = 5, ChatGPT: n = 5). Ten blinded faculty reviewers rated essays (ten-point Likert scale) on the following criteria: desire to interview, relevance to the general surgery residency, overall impression, and AI- or human-generated; with scores and identification error rates compared between the groups.
There were no differences between groups for %total points (ChatGPT 66.0 ± 13.5%, human 70.0 ± 23.0%, P = 0.508) or identification error rates (ChatGPT 40.0 ± 35.0%, human 20.0 ± 30.0%, P = 0.175). Except for one, all essays were identified incorrectly by at least two reviewers. Essays identified as human-generated received higher overall impression scores (area under the curve: 0.82 ± 0.04, P < 0.01).
Whether use of AI tools for academic purposes should constitute academic dishonesty is controversial. We demonstrate that human and AI-generated essays are similar in quality, but there is bias against presumed AI-generated essays. Faculty are not able to reliably differentiate human from AI-generated essays, thus bias may be misdirected. AI-tools are becoming ubiquitous and their use is not easily detected. Faculty must expect these tools to play increasing roles in medical education.
像 Chat Generative Pre-Trained Transformer(ChatGPT)这样的大型语言模型越来越多地用于学术写作。教师可能会将人工智能(AI)生成的回复视为一种作弊形式。我们旨在确定普通外科住院医师教师是否能够检测到文本提示的 AI 与人工书写的回复;假设教师无法可靠地区分 AI 与人工书写的回复。
使用文本提示“用 1-2 段话告诉我们为什么您考虑在罗切斯特大学攻读普通外科住院医师”生成了 10 篇文章(现任学员:n=5,ChatGPT:n=5)。十名盲审教师根据以下标准对文章进行评分(十分制李克特量表):面试意愿、与普通外科住院医师相关度、整体印象和 AI 或人工生成;并比较两组的分数和识别错误率。
在总分百分比(ChatGPT 66.0±13.5%,人类 70.0±23.0%,P=0.508)或识别错误率(ChatGPT 40.0±35.0%,人类 20.0±30.0%,P=0.175)方面,两组之间没有差异。除了一篇文章外,所有文章都至少被两名审稿人错误识别。被识别为人工生成的文章获得了更高的整体印象评分(曲线下面积:0.82±0.04,P<0.01)。
是否应将 AI 工具用于学术目的视为学术不诚实是有争议的。我们表明,AI 生成的论文和人工撰写的论文在质量上相似,但对假定的 AI 生成的论文存在偏见。教师无法可靠地区分人工和 AI 生成的论文,因此这种偏见可能是错误的。AI 工具变得无处不在,其使用也不容易被发现。教师必须期望这些工具在医学教育中发挥越来越重要的作用。