Kaneda Yudai, Takahashi Ryo, Kaneda Uiri, Akashima Shiori, Okita Haruna, Misaki Sadaya, Yamashiro Akimi, Ozaki Akihiko, Tanimoto Tetsuya
College of Medicine, Hokkaido University, Hokkaido, JPN.
Department of Rehabilitation Medicine, Sonodakai Joint Replacement Center Hospital, Tokyo, JPN.
Cureus. 2023 Aug 3;15(8):e42924. doi: 10.7759/cureus.42924. eCollection 2023 Aug.
Purpose The purpose of this study was to evaluate the changes in capabilities between the Generative Pre-trained Transformer (GPT)-3.5 and GPT-4 versions of the large-scale language model ChatGPT within a Japanese medical context. Methods The study involved ChatGPT versions 3.5 and 4 responding to questions from the 112th Japanese National Nursing Examination (JNNE). The study comprised three analyses: correct answer rate and score rate calculations, comparisons between GPT-3.5 and GPT-4, and comparisons of correct answer rates for conversation questions. Results ChatGPT versions 3.5 and 4 responded to 237 out of 238 Japanese questions from the 112th JNNE. While GPT-3.5 achieved an overall accuracy rate of 59.9%, failing to meet the passing standards in compulsory and general/scenario-based questions, scoring 58.0% and 58.3%, respectively, GPT-4 had an accuracy rate of 79.7%, satisfying the passing standards by scoring 90.0% and 77.7%, respectively. For each problem type, GPT-4 showed a higher accuracy rate than GPT-3.5. Specifically, the accuracy rates for compulsory questions improved from 58.0% with GPT-3.5 to 90.0% with GPT-4. For general questions, the rates went from 64.6% with GPT-3.5 to 75.6% with GPT-4. In scenario-based questions, the accuracy rates improved substantially from 51.7% with GPT-3.5 to 80.0% with GPT-4. For conversation questions, GPT-3.5 had an accuracy rate of 73.3% and GPT-4 had an accuracy rate of 93.3%. Conclusions The GPT-4 version of ChatGPT displayed performance sufficient to pass the JNNE, significantly improving from GPT-3.5. This suggests specialized medical training could make such models beneficial in Japanese clinical settings, aiding decision-making. However, user awareness and training are crucial, given potential inaccuracies in ChatGPT's responses. Hence, responsible usage with an understanding of its capabilities and limitations is vital to best support healthcare professionals and patients.
目的 本研究的目的是在日本医学背景下评估大型语言模型ChatGPT的生成式预训练变换器(GPT)-3.5和GPT-4版本之间能力的变化。方法 该研究让ChatGPT 3.5和4版本回答第112次日本国家护士考试(JNNE)的问题。该研究包括三项分析:正确率和得分率计算、GPT-3.5和GPT-4之间的比较以及对话问题的正确率比较。结果 ChatGPT 3.5和4版本回答了第112次JNNE的238道日语问题中的237道。虽然GPT-3.5的总体准确率为59.9%,在必答题和基于一般/情景的问题中未达到及格标准,分别得分为58.0%和58.3%,但GPT-4的准确率为79.7%,通过分别得分为90.0%和77.7%满足了及格标准。对于每种问题类型,GPT-4的准确率均高于GPT-3.5。具体而言,必答题的准确率从GPT-3.5的58.0%提高到GPT-4的90.0%。对于一般问题,得分率从GPT-3.5的64.6%提高到GPT-4的75.6%。在基于情景的问题中,准确率从GPT-3.5的51.7%大幅提高到GPT-4的80.0%。对于对话问题,GPT-3.5的准确率为73.3%,GPT-4的准确率为93.3%。结论 ChatGPT的GPT-4版本表现出足以通过JNNE的性能,与GPT-3.5相比有显著提高。这表明专门的医学训练可以使此类模型在日本临床环境中发挥作用,辅助决策。然而,鉴于ChatGPT回答可能存在不准确之处,用户意识和培训至关重要。因此,在了解其能力和局限性的情况下负责任地使用对于最佳支持医疗保健专业人员和患者至关重要。