MD/PhD Scholars Program, University of Nebraska Medical Center, Omaha, Nebraska.
College of Medicine, University of Nebraska Medical Center, Omaha, Nebraska.
Urol Pract. 2023 Jul;10(4):409-415. doi: 10.1097/UPJ.0000000000000406. Epub 2023 Jun 5.
Large language models have demonstrated impressive capabilities, but application to medicine remains unclear. We seek to evaluate the use of ChatGPT on the American Urological Association Self-assessment Study Program as an educational adjunct for urology trainees and practicing physicians.
One hundred fifty questions from the 2022 Self-assessment Study Program exam were screened, and those containing visual assets (n=15) were removed. The remaining items were encoded as open ended or multiple choice. ChatGPT's output was coded as correct, incorrect, or indeterminate; if indeterminate, responses were regenerated up to 2 times. Concordance, quality, and accuracy were ascertained by 3 independent researchers and reviewed by 2 physician adjudicators. A new session was started for each entry to avoid crossover learning.
ChatGPT was correct on 36/135 (26.7%) open-ended and 38/135 (28.2%) multiple-choice questions. Indeterminate responses were generated in 40 (29.6%) and 4 (3.0%), respectively. Of the correct responses, 24/36 (66.7%) and 36/38 (94.7%) were on initial output, 8 (22.2%) and 1 (2.6%) on second output, and 4 (11.1%) and 1 (2.6%) on final output, respectively. Although regeneration decreased indeterminate responses, proportion of correct responses did not increase. For open-ended and multiple-choice questions, ChatGPT provided consistent justifications for incorrect answers and remained concordant between correct and incorrect answers.
ChatGPT previously demonstrated promise on medical licensing exams; however, application to the 2022 Self-assessment Study Program was not demonstrated. Performance improved with multiple-choice over open-ended questions. More importantly were the persistent justifications for incorrect responses-left unchecked, utilization of ChatGPT in medicine may facilitate medical misinformation.
大型语言模型展现出了令人印象深刻的能力,但在医学领域的应用仍不明确。我们旨在评估 ChatGPT 在 2022 年美国泌尿外科学会自我评估学习计划中的应用,将其作为泌尿科住院医师和执业医师的教育辅助工具。
对 2022 年自我评估学习计划考试的 150 个问题进行了筛选,去除了包含视觉资产的问题(n=15)。剩余的问题被编码为开放式或多项选择题。ChatGPT 的输出被编码为正确、错误或不确定;如果不确定,则会重新生成最多 2 次响应。由 3 位独立研究人员确定一致性、质量和准确性,并由 2 位医师裁判进行审查。为避免交叉学习,每个条目都会启动一个新的会话。
ChatGPT 在 36/135 个开放式和 38/135 个多项选择题中是正确的。不确定的回答分别产生了 40(29.6%)和 4(3.0%)。在正确的回答中,24/36(66.7%)和 36/38(94.7%)是首次输出,8/36(22.2%)和 1/38(2.6%)是第二次输出,4/36(11.1%)和 1/38(2.6%)是最终输出。虽然重新生成减少了不确定的回答,但正确回答的比例并没有增加。对于开放式和多项选择题,ChatGPT 为错误答案提供了一致的解释,并且在正确和错误答案之间保持一致。
ChatGPT 之前在医学执照考试中表现出了潜力;然而,在 2022 年的自我评估学习计划中并没有得到应用。多项选择题的表现优于开放式问题。更重要的是,错误答案的持续解释——如果不加以检查,在医学中使用 ChatGPT 可能会促进医学错误信息的传播。