Suppr超能文献

新的人工智能 ChatGPT 在 2022 年泌尿科自我评估研究项目中表现不佳。

New Artificial Intelligence ChatGPT Performs Poorly on the 2022 Self-assessment Study Program for Urology.

机构信息

MD/PhD Scholars Program, University of Nebraska Medical Center, Omaha, Nebraska.

College of Medicine, University of Nebraska Medical Center, Omaha, Nebraska.

出版信息

Urol Pract. 2023 Jul;10(4):409-415. doi: 10.1097/UPJ.0000000000000406. Epub 2023 Jun 5.

Abstract

INTRODUCTION

Large language models have demonstrated impressive capabilities, but application to medicine remains unclear. We seek to evaluate the use of ChatGPT on the American Urological Association Self-assessment Study Program as an educational adjunct for urology trainees and practicing physicians.

METHODS

One hundred fifty questions from the 2022 Self-assessment Study Program exam were screened, and those containing visual assets (n=15) were removed. The remaining items were encoded as open ended or multiple choice. ChatGPT's output was coded as correct, incorrect, or indeterminate; if indeterminate, responses were regenerated up to 2 times. Concordance, quality, and accuracy were ascertained by 3 independent researchers and reviewed by 2 physician adjudicators. A new session was started for each entry to avoid crossover learning.

RESULTS

ChatGPT was correct on 36/135 (26.7%) open-ended and 38/135 (28.2%) multiple-choice questions. Indeterminate responses were generated in 40 (29.6%) and 4 (3.0%), respectively. Of the correct responses, 24/36 (66.7%) and 36/38 (94.7%) were on initial output, 8 (22.2%) and 1 (2.6%) on second output, and 4 (11.1%) and 1 (2.6%) on final output, respectively. Although regeneration decreased indeterminate responses, proportion of correct responses did not increase. For open-ended and multiple-choice questions, ChatGPT provided consistent justifications for incorrect answers and remained concordant between correct and incorrect answers.

CONCLUSIONS

ChatGPT previously demonstrated promise on medical licensing exams; however, application to the 2022 Self-assessment Study Program was not demonstrated. Performance improved with multiple-choice over open-ended questions. More importantly were the persistent justifications for incorrect responses-left unchecked, utilization of ChatGPT in medicine may facilitate medical misinformation.

摘要

简介

大型语言模型展现出了令人印象深刻的能力,但在医学领域的应用仍不明确。我们旨在评估 ChatGPT 在 2022 年美国泌尿外科学会自我评估学习计划中的应用,将其作为泌尿科住院医师和执业医师的教育辅助工具。

方法

对 2022 年自我评估学习计划考试的 150 个问题进行了筛选,去除了包含视觉资产的问题(n=15)。剩余的问题被编码为开放式或多项选择题。ChatGPT 的输出被编码为正确、错误或不确定;如果不确定,则会重新生成最多 2 次响应。由 3 位独立研究人员确定一致性、质量和准确性,并由 2 位医师裁判进行审查。为避免交叉学习,每个条目都会启动一个新的会话。

结果

ChatGPT 在 36/135 个开放式和 38/135 个多项选择题中是正确的。不确定的回答分别产生了 40(29.6%)和 4(3.0%)。在正确的回答中,24/36(66.7%)和 36/38(94.7%)是首次输出,8/36(22.2%)和 1/38(2.6%)是第二次输出,4/36(11.1%)和 1/38(2.6%)是最终输出。虽然重新生成减少了不确定的回答,但正确回答的比例并没有增加。对于开放式和多项选择题,ChatGPT 为错误答案提供了一致的解释,并且在正确和错误答案之间保持一致。

结论

ChatGPT 之前在医学执照考试中表现出了潜力;然而,在 2022 年的自我评估学习计划中并没有得到应用。多项选择题的表现优于开放式问题。更重要的是,错误答案的持续解释——如果不加以检查,在医学中使用 ChatGPT 可能会促进医学错误信息的传播。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验