使用拉施模型评估聊天机器人在台湾精神科医师执照考试中的表现。

Assessing the Performance of Chatbots on the Taiwan Psychiatry Licensing Examination Using the Rasch Model.

作者信息

Chang Yu, Su Chu-Yun, Liu Yi-Chun

机构信息

Department of Psychiatry, Changhua Christian Hospital, Changhua 500, Taiwan.

Taichung Municipal Taichung Special Education School for The Hearing Impaired, Taichung 407, Taiwan.

出版信息

Healthcare (Basel). 2024 Nov 18;12(22):2305. doi: 10.3390/healthcare12222305.

DOI:10.3390/healthcare12222305

PMID:39595502

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11594248/

Abstract

BACKGROUND/OBJECTIVES: The potential and limitations of chatbots in medical education and clinical decision support, particularly in specialized fields like psychiatry, remain unknown. By using the Rasch model, our study aimed to evaluate the performance of various state-of-the-art chatbots on psychiatry licensing exam questions to explore their strengths and weaknesses.

METHODS

We assessed the performance of 22 leading chatbots, selected based on LMArena benchmark rankings, using 100 multiple-choice questions from the 2024 Taiwan psychiatry licensing examination, a nationally standardized test required for psychiatric licensure in Taiwan. Chatbot responses were scored for correctness, and we used the Rasch model to evaluate chatbot ability.

RESULTS

Chatbots released after February 2024 passed the exam, with ChatGPT-o1-preview achieving the highest score of 85. ChatGPT-o1-preview showed a statistically significant superiority in ability ( < 0.001), with a 1.92 logits improvement compared to the passing threshold. It demonstrated strengths in complex psychiatric problems and ethical understanding, yet it presented limitations in up-to-date legal updates and specialized psychiatry knowledge, such as recent amendments to the Mental Health Act, psychopharmacology, and advanced neuroimaging.

CONCLUSIONS

Chatbot technology could be a valuable tool for medical education and clinical decision support in psychiatry, and as technology continues to advance, these models are likely to play an increasingly integral role in psychiatric practice.

摘要

背景/目的：聊天机器人在医学教育和临床决策支持中的潜力和局限性，尤其是在精神病学等专业领域，仍然未知。通过使用拉施模型，我们的研究旨在评估各种最先进的聊天机器人在精神病学执照考试问题上的表现，以探索它们的优势和劣势。

方法

我们使用来自2024年台湾精神病学执照考试的100道多项选择题，评估了22个领先的聊天机器人的表现，这些聊天机器人是根据LMArena基准排名挑选出来的，该考试是台湾精神病学执照所需的全国标准化考试。对聊天机器人的回答进行正确性评分，并使用拉施模型评估聊天机器人的能力。

结果

2024年2月之后发布的聊天机器人通过了考试，ChatGPT-o1-preview获得了85分的最高分。ChatGPT-o1-preview在能力方面显示出统计学上的显著优势（<0.001），与及格阈值相比，对数几率提高了1.92。它在复杂的精神病学问题和伦理理解方面表现出优势，但在最新的法律更新和专业精神病学知识方面存在局限性，如《精神卫生法》的近期修订、精神药理学和先进的神经影像学。