Ishikawa Yu, Higashi Akitaka, Arai Nozomu, Ozo Daisuke, Hasegawa Wataru, Imamura Tetsuya, Matsumoto Zenbei, Nambo Hidetaka, Karashima Shigehiro
Department of Health Promotion and Medicine of the Future, Kanazawa University Graduate School of Medical Sciences, Kanazawa, Japan.
Emerging Media Initiative, Kanazawa University, Kanazawa, Japan.
Endocr J. 2025 Sep 11. doi: 10.1507/endocrj.EJ25-0201.
GPT-4o, a general-purpose large language model, has a Retrieval-Augmented Variant (GPT-4o-RAG) that can assist in dietary counseling. However, research on its application in this field remains lacking. To bridge this gap, we used the Japanese National Examination for Registered Dietitians as a standardized benchmark for evaluation. Three language models-GPT-4o, GPT-4o-mini, and GPT-4o-RAG-were assessed using 599 publicly available multiple-choice questions from the 2022-2024 national examinations. For each model, we generated answers to each question five times and based our evaluation on these multiple outputs to assess response variability and robustness. A custom pipeline was implemented for GPT-4o-RAG to retrieve guideline-based documents for integration with GPT-generated responses. Accuracy rates, variance, and response consistency were evaluated. Term Frequency-Inverse Document Frequency analysis was conducted to compare word characteristics in correctly and incorrectly answered questions. All three models achieved accuracy rates >60%, the passing threshold. GPT-4o-RAG demonstrated the highest accuracy (83.5% ± 0.3%), followed by GPT-4o (82.1% ± 1.0%), and GPT-4o-mini (70.0% ± 1.4%). While the accuracy improvement of GPT-4o-RAG over GPT-4o was not statistically significant (p = 0.12), it exhibited significantly lower variance and higher response consistency (97.3% vs. 91.2-95.2%, p < 0.001). GPT-4o-RAG outperformed other models in applied and clinical nutrition categories but showed limited performance on numerical questions. Term Frequency-Inverse Document Frequency analysis suggested that incorrect answers were more frequently associated with numerical terms. GPT-4o-RAG improved response consistency and domain-specific performance, suggesting utility in clinical nutrition. However, limitations in numerical reasoning and individualized guidance warrant further development and validation.
GPT-4o是一种通用的大型语言模型,它有一个检索增强变体(GPT-4o-RAG),可协助进行饮食咨询。然而,关于其在该领域应用的研究仍然不足。为了弥补这一差距,我们将日本注册营养师国家考试用作评估的标准化基准。使用来自2022年至2024年国家考试的599道公开多项选择题,对三种语言模型——GPT-4o、GPT-4o-mini和GPT-4o-RAG进行了评估。对于每个模型,我们对每个问题生成五次答案,并基于这些多个输出进行评估,以评估回答的可变性和稳健性。为GPT-4o-RAG实施了一个定制管道,以检索基于指南的文档,以便与GPT生成的回答进行整合。评估了准确率、方差和回答一致性。进行了词频-逆文档频率分析,以比较正确和错误回答问题中的词特征。所有三种模型的准确率均超过及格阈值60%。GPT-4o-RAG的准确率最高(83.5%±0.3%),其次是GPT-4o(82.1%±1.0%)和GPT-4o-mini(70.0%±1.4%)。虽然GPT-4o-RAG相对于GPT-4o的准确率提高在统计学上不显著(p = 0.12),但它表现出显著更低的方差和更高的回答一致性(97.3%对91.2%-95.2%,p < 0.001)。GPT-4o-RAG在应用和临床营养类别方面优于其他模型,但在数值问题上表现有限。词频-逆文档频率分析表明,错误答案更频繁地与数值术语相关。GPT-4o-RAG提高了回答一致性和特定领域的性能,表明其在临床营养方面有用。然而,数值推理和个性化指导方面的局限性需要进一步开发和验证。