Chow Julie Chi, Cheng Teng Yun, Chien Tsair-Wei, Chou Willy
Department of Pediatrics, Chi Mei Medical Center, Tainan, Taiwan.
Department of Pediatrics, School of Medicine, College of Medicine, Chung Shan Medical University, Taichung, Taiwan.
JMIR Form Res. 2024 Aug 8;8:e46800. doi: 10.2196/46800.
BACKGROUND: ChatGPT (OpenAI), a state-of-the-art large language model, has exhibited remarkable performance in various specialized applications. Despite the growing popularity and efficacy of artificial intelligence, there is a scarcity of studies that assess ChatGPT's competence in addressing multiple-choice questions (MCQs) using KIDMAP of Rasch analysis-a website tool used to evaluate ChatGPT's performance in MCQ answering. OBJECTIVE: This study aims to (1) showcase the utility of the website (Rasch analysis, specifically RaschOnline), and (2) determine the grade achieved by ChatGPT when compared to a normal sample. METHODS: The capability of ChatGPT was evaluated using 10 items from the English tests conducted for Taiwan college entrance examinations in 2023. Under a Rasch model, 300 simulated students with normal distributions were simulated to compete with ChatGPT's responses. RaschOnline was used to generate 5 visual presentations, including item difficulties, differential item functioning, item characteristic curve, Wright map, and KIDMAP, to address the research objectives. RESULTS: The findings revealed the following: (1) the difficulty of the 10 items increased in a monotonous pattern from easier to harder, represented by logits (-2.43, -1.78, -1.48, -0.64, -0.1, 0.33, 0.59, 1.34, 1.7, and 2.47); (2) evidence of differential item functioning was observed between gender groups for item 5 (P=.04); (3) item 5 displayed a good fit to the Rasch model (P=.61); (4) all items demonstrated a satisfactory fit to the Rasch model, indicated by Infit mean square errors below the threshold of 1.5; (5) no significant difference was found in the measures obtained between gender groups (P=.83); (6) a significant difference was observed among ability grades (P<.001); and (7) ChatGPT's capability was graded as A, surpassing grades B to E. CONCLUSIONS: By using RaschOnline, this study provides evidence that ChatGPT possesses the ability to achieve a grade A when compared to a normal sample. It exhibits excellent proficiency in answering MCQs from the English tests conducted in 2023 for the Taiwan college entrance examinations.
背景:ChatGPT(OpenAI)是一种先进的大型语言模型,在各种专业应用中表现出色。尽管人工智能越来越受欢迎且功效显著,但缺乏使用Rasch分析的KIDMAP(一种用于评估ChatGPT在回答多项选择题方面表现的网站工具)来评估ChatGPT回答多项选择题能力的研究。 目的:本研究旨在(1)展示该网站(Rasch分析,特别是RaschOnline)的实用性,以及(2)确定ChatGPT与正常样本相比所达到的成绩等级。 方法:使用2023年台湾大学入学考试英语测试中的10道题目评估ChatGPT的能力。在Rasch模型下,模拟300名具有正态分布的学生与ChatGPT的回答进行竞争。使用RaschOnline生成5种可视化展示,包括题目难度、题目差异功能、题目特征曲线、赖特图和KIDMAP,以实现研究目标。 结果:研究结果显示如下:(1)10道题目的难度从易到难呈单调递增模式,以对数单位表示为(-2.43、-1.78、-1.48、-0.64、-0.1、0.33、0.59、1.34、1.7和2.47);(2)第5题在不同性别组之间观察到题目差异功能的证据(P = 0.04);(3)第5题与Rasch模型拟合良好(P = 0.61);(4)所有题目均显示与Rasch模型拟合良好,通过拟合均方误差低于1.5的阈值表明;(5)不同性别组之间获得的测量值没有显著差异(P = 0.83);(6)能力等级之间存在显著差异(P < 0.001);(7)ChatGPT的能力被评为A,超过了B至E等级。 结论:通过使用RaschOnline,本研究提供了证据表明,与正常样本相比,ChatGPT具有获得A级成绩的能力。它在回答2023年台湾大学入学考试英语测试中的多项选择题方面表现出卓越的水平。
Front Med (Lausanne). 2023-12-13
J Med Internet Res. 2024-8-20
J Eval Clin Pract. 2024-9
Comput Methods Programs Biomed. 2024-3
N Engl J Med. 2023-3-30
PLOS Digit Health. 2023-2-9
PLOS Digit Health. 2023-2-9
Med Educ Online. 2023-12