Ng Olivia, Phua Dong Haur, Chu Jowe, Wilding Lucy V E, Mogali Sreenivasulu Reddy, Cleland Jennifer
Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore.
Emergency Department, Tan Tock Seng Hospital, Singapore, Singapore.
Med Sci Educ. 2024 Nov 26;35(2):629-632. doi: 10.1007/s40670-024-02232-4. eCollection 2025 Apr.
While large language models (LLMs) are often used to generate and answer exam questions, limited work compares their performance across multiple iterations using item statistics. This study aims to fill that gap by investigating answering patterns of how LLMs respond to single-best answer (SBA) questions, comparing their performance to that of students. Forty-one SBA questions for first-year medical students were assessed using the most easily assessable and free-to-use GPT3.5 and Gemini across 100 iterations. Both LLMs exhibited more repetitive and clustered answering patterns compared to students, which can be problematic as it may compound mistakes by repeating error selection. Distractor analysis revealed that students performed better when managing multiple options in the SBA format. We found that these free-to-use LLMs are inferior to well-trained students or specialists in handling technical questions. We have also highlighted concerns on LLMs' contextual interpretation of these items and the need of human oversight in the medical education assessment process.
虽然大语言模型(LLMs)常被用于生成和回答考试问题,但利用题目统计数据对其在多个迭代中的表现进行比较的研究却很有限。本研究旨在通过调查大语言模型对单项最佳答案(SBA)问题的回答模式来填补这一空白,并将其表现与学生的表现进行比较。使用最易于评估且免费使用的GPT3.5和Gemini,对面向一年级医学生的41道SBA问题进行了100次迭代评估。与学生相比,这两种大语言模型都表现出更多重复和集中的回答模式,这可能会有问题,因为重复错误选择可能会使错误加剧。干扰项分析表明,学生在处理SBA格式的多个选项时表现更好。我们发现,这些免费使用的大语言模型在处理技术问题方面不如训练有素的学生或专家。我们还强调了对大语言模型对这些题目的情境解释的担忧,以及医学教育评估过程中人工监督的必要性。