Department of Medical Statistics and Informatics, Faculty of Medicine, University of Niš, Niš, Serbia.
Faculty of Medicine, University of Niš, Niš, Serbia.
J Educ Eval Health Prof. 2023;20:28. doi: 10.3352/jeehp.2023.20.28. Epub 2023 Oct 16.
This study aimed to assess the performance of ChatGPT (GPT-3.5 and GPT-4) as a study tool in solving biostatistical problems and to identify any potential drawbacks that might arise from using ChatGPT in medical education, particularly in solving practical biostatistical problems.
ChatGPT was tested to evaluate its ability to solve biostatistical problems from the Handbook of Medical Statistics by Peacock and Peacock in this descriptive study. Tables from the problems were transformed into textual questions. Ten biostatistical problems were randomly chosen and used as text-based input for conversation with ChatGPT (versions 3.5 and 4).
GPT-3.5 solved 5 practical problems in the first attempt, related to categorical data, cross-sectional study, measuring reliability, probability properties, and the t-test. GPT-3.5 failed to provide correct answers regarding analysis of variance, the chi-square test, and sample size within 3 attempts. GPT-4 also solved a task related to the confidence interval in the first attempt and solved all questions within 3 attempts, with precise guidance and monitoring.
The assessment of both versions of ChatGPT performance in 10 biostatistical problems revealed that GPT-3.5 and 4’s performance was below average, with correct response rates of 5 and 6 out of 10 on the first attempt. GPT-4 succeeded in providing all correct answers within 3 attempts. These findings indicate that students must be aware that this tool, even when providing and calculating different statistical analyses, can be wrong, and they should be aware of ChatGPT’s limitations and be careful when incorporating this model into medical education.
本研究旨在评估 ChatGPT(GPT-3.5 和 GPT-4)作为解决生物统计学问题的学习工具的性能,并确定在医学教育中使用 ChatGPT 可能存在的任何潜在缺陷,特别是在解决实际生物统计学问题方面。
在这项描述性研究中,我们使用 Peacock 和 Peacock 的《医学统计学手册》中的问题来测试 ChatGPT 解决生物统计学问题的能力。将问题表转换为文本问题。随机选择 10 个生物统计学问题,并将其用作与 ChatGPT(版本 3.5 和 4)进行对话的基于文本的输入。
GPT-3.5 在第一次尝试中解决了 5 个实际问题,涉及分类数据、横断面研究、测量可靠性、概率性质和 t 检验。GPT-3.5 在 3 次尝试内未能提供关于方差分析、卡方检验和样本量的正确答案。GPT-4 也在第一次尝试中解决了一个与置信区间有关的任务,并在 3 次尝试内解决了所有问题,提供了精确的指导和监督。
评估 ChatGPT 两个版本在 10 个生物统计学问题中的表现发现,GPT-3.5 和 4 的表现低于平均水平,第一次尝试的正确回答率分别为 5 和 6。GPT-4 成功在 3 次尝试内提供了所有正确答案。这些发现表明,学生必须意识到,即使该工具提供和计算不同的统计分析,也可能是错误的,他们应该意识到 ChatGPT 的局限性,并在将该模型纳入医学教育时小心谨慎。