Suppr超能文献

评估GPT-3.5和GPT-4在2023年日本护理考试中的表现。

Assessing the Performance of GPT-3.5 and GPT-4 on the 2023 Japanese Nursing Examination.

作者信息

Kaneda Yudai, Takahashi Ryo, Kaneda Uiri, Akashima Shiori, Okita Haruna, Misaki Sadaya, Yamashiro Akimi, Ozaki Akihiko, Tanimoto Tetsuya

机构信息

College of Medicine, Hokkaido University, Hokkaido, JPN.

Department of Rehabilitation Medicine, Sonodakai Joint Replacement Center Hospital, Tokyo, JPN.

出版信息

Cureus. 2023 Aug 3;15(8):e42924. doi: 10.7759/cureus.42924. eCollection 2023 Aug.

Abstract

Purpose The purpose of this study was to evaluate the changes in capabilities between the Generative Pre-trained Transformer (GPT)-3.5 and GPT-4 versions of the large-scale language model ChatGPT within a Japanese medical context. Methods The study involved ChatGPT versions 3.5 and 4 responding to questions from the 112th Japanese National Nursing Examination (JNNE). The study comprised three analyses: correct answer rate and score rate calculations, comparisons between GPT-3.5 and GPT-4, and comparisons of correct answer rates for conversation questions. Results ChatGPT versions 3.5 and 4 responded to 237 out of 238 Japanese questions from the 112th JNNE. While GPT-3.5 achieved an overall accuracy rate of 59.9%, failing to meet the passing standards in compulsory and general/scenario-based questions, scoring 58.0% and 58.3%, respectively, GPT-4 had an accuracy rate of 79.7%, satisfying the passing standards by scoring 90.0% and 77.7%, respectively. For each problem type, GPT-4 showed a higher accuracy rate than GPT-3.5. Specifically, the accuracy rates for compulsory questions improved from 58.0% with GPT-3.5 to 90.0% with GPT-4. For general questions, the rates went from 64.6% with GPT-3.5 to 75.6% with GPT-4. In scenario-based questions, the accuracy rates improved substantially from 51.7% with GPT-3.5 to 80.0% with GPT-4. For conversation questions, GPT-3.5 had an accuracy rate of 73.3% and GPT-4 had an accuracy rate of 93.3%. Conclusions The GPT-4 version of ChatGPT displayed performance sufficient to pass the JNNE, significantly improving from GPT-3.5. This suggests specialized medical training could make such models beneficial in Japanese clinical settings, aiding decision-making. However, user awareness and training are crucial, given potential inaccuracies in ChatGPT's responses. Hence, responsible usage with an understanding of its capabilities and limitations is vital to best support healthcare professionals and patients.

摘要

目的 本研究的目的是在日本医学背景下评估大型语言模型ChatGPT的生成式预训练变换器(GPT)-3.5和GPT-4版本之间能力的变化。方法 该研究让ChatGPT 3.5和4版本回答第112次日本国家护士考试(JNNE)的问题。该研究包括三项分析:正确率和得分率计算、GPT-3.5和GPT-4之间的比较以及对话问题的正确率比较。结果 ChatGPT 3.5和4版本回答了第112次JNNE的238道日语问题中的237道。虽然GPT-3.5的总体准确率为59.9%,在必答题和基于一般/情景的问题中未达到及格标准,分别得分为58.0%和58.3%,但GPT-4的准确率为79.7%,通过分别得分为90.0%和77.7%满足了及格标准。对于每种问题类型,GPT-4的准确率均高于GPT-3.5。具体而言,必答题的准确率从GPT-3.5的58.0%提高到GPT-4的90.0%。对于一般问题,得分率从GPT-3.5的64.6%提高到GPT-4的75.6%。在基于情景的问题中,准确率从GPT-3.5的51.7%大幅提高到GPT-4的80.0%。对于对话问题,GPT-3.5的准确率为73.3%,GPT-4的准确率为93.3%。结论 ChatGPT的GPT-4版本表现出足以通过JNNE的性能,与GPT-3.5相比有显著提高。这表明专门的医学训练可以使此类模型在日本临床环境中发挥作用,辅助决策。然而,鉴于ChatGPT回答可能存在不准确之处,用户意识和培训至关重要。因此,在了解其能力和局限性的情况下负责任地使用对于最佳支持医疗保健专业人员和患者至关重要。

相似文献

引用本文的文献

2
Performance evaluation of large language models for the national nursing examination in Japan.日本国家护士考试中大型语言模型的性能评估
Digit Health. 2025 May 27;11:20552076251346571. doi: 10.1177/20552076251346571. eCollection 2025 Jan-Dec.

本文引用的文献

9
A deep learning system for differential diagnosis of skin diseases.深度学习系统用于皮肤病的鉴别诊断。
Nat Med. 2020 Jun;26(6):900-908. doi: 10.1038/s41591-020-0842-3. Epub 2020 May 18.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验