Sawamura Shogo, Kohiyama Kengo, Takenaka Takahiro, Sera Tatsuya, Inoue Tadatoshi, Nagai Takashi
Department of Rehabilitation, Heisei College of Health Sciences, Gifu, JPN.
Cureus. 2025 Jan 6;17(1):e76989. doi: 10.7759/cureus.76989. eCollection 2025 Jan.
Background and objective Recent advancements in large language models (LLMs) have expanded their applications in medical and healthcare settings. LLMs have demonstrated high performance in various national examinations for healthcare professionals. Open Artificial Intelligence Model Version 1 (OpenAI-o1) attained remarkable accuracy in the Japanese National Examination for Medical Practitioners, whereas Generative Pre-trained Transformer Model Version 4 (GPT-4o) has excelled in image-based tasks, thus suggesting a complementary relationship between the two models. However, their performance in the field of physical therapy, particularly in the Japanese National Examination, remains poorly understood. This study aimed to assess the performance of OpenAI-o1 and GPT-4o in the 59th Japanese National Examination for Physical Therapists (JNEPT) in 2024 Methods A total of 168 text-only questions were administered to OpenAI-o1, and 23 image-based questions were given to GPT-4o, in a zero-shot prompting format. Accuracy was evaluated by comparing the model outputs with the official correct answers released by the Ministry of Health, Labor, and Welfare. Two faculty members specializing in the National Examination for Physical Therapists reviewed all generated explanations for accuracy. Results OpenAI-o1 achieved a correctness rate of 97.0% (163/168 questions) and an explanation accuracy of 86.4% (146/168). In contrast, the GPT-4o attained a correctness rate of 56.5% (13/23 questions) and an explanation accuracy of 52.2% (12/23). OpenAI-o1's primary explanatory errors involved outdated or incorrect knowledge (13 questions), overly simplified discussions (six questions), and misinterpretation of question intent (three questions). GPT-4o's most common error type was the misinterpretation of a question's intent due to difficulties in image analysis (eight questions), along with three instances of knowledge-level inaccuracies. Conclusions OpenAI-o1 exhibited high accuracy and solid explanatory quality, indicating strong adaptability to both general and specialized content in physical therapy, and showed potential utility in medical education and remote healthcare support. GPT-4o, while showing enhanced multimodal capabilities compared with previous models, requires further optimization in image-based reasoning and domain-specific training. These findings underscore the promising role of LLMs in healthcare and medical education while highlighting the importance of ongoing refinement to meet the rigorous demands of clinical and educational environments.
背景与目的 大语言模型(LLMs)的最新进展扩大了其在医疗保健领域的应用。LLMs在针对医疗保健专业人员的各种国家考试中表现出了高性能。开放人工智能模型版本1(OpenAI-o1)在日本国家医师考试中取得了显著的准确率,而生成式预训练变换器模型版本4(GPT-4o)在基于图像的任务中表现出色,这表明这两种模型之间存在互补关系。然而,它们在物理治疗领域的表现,特别是在日本国家考试中的表现,仍知之甚少。本研究旨在评估OpenAI-o1和GPT-4o在2024年第59届日本国家物理治疗师考试(JNEPT)中的表现。方法 以零样本提示格式,向OpenAI-o1总共提出了168个纯文本问题,向GPT-4o提出了23个基于图像的问题。通过将模型输出与厚生劳动省公布的官方正确答案进行比较来评估准确率。两名专门研究物理治疗师国家考试的教员审查了所有生成的解释的准确性。结果 OpenAI-o1的正确率为97.0%(163/168个问题),解释准确率为86.4%(146/168)。相比之下,GPT-4o的正确率为56.5%(13/23个问题),解释准确率为52.2%(12/23)。OpenAI-o1的主要解释错误包括过时或不正确的知识(13个问题)、过于简化的讨论(6个问题)以及对问题意图的误解(3个问题)。GPT-4o最常见的错误类型是由于图像分析困难导致对问题意图的误解(8个问题),以及3个知识层面不准确的情况。结论 OpenAI-o1表现出高准确率和可靠的解释质量,表明其对物理治疗中的一般和专业内容都有很强的适应性,并在医学教育和远程医疗支持中显示出潜在的实用性。GPT-4o虽然与以前的模型相比具有增强的多模态能力,但在基于图像的推理和特定领域训练方面需要进一步优化。这些发现强调了LLMs在医疗保健和医学教育中的有前景的作用,同时突出了持续改进以满足临床和教育环境的严格要求的重要性。