• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

评估先进的大型语言模型在医学知识方面的有效性:使用日本国家医学考试的比较研究。

Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination.

机构信息

Department of Health Communication, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan.

Department of Health Communication, School of Public Health, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan.

出版信息

Int J Med Inform. 2025 Jan;193:105673. doi: 10.1016/j.ijmedinf.2024.105673. Epub 2024 Oct 28.

DOI:10.1016/j.ijmedinf.2024.105673
PMID:39471700
Abstract

UNLABELLED

Study aims and objectives. This study aims to evaluate the accuracy of medical knowledge in the most advanced LLMs (GPT-4o, GPT-4, Gemini 1.5 Pro, and Claude 3 Opus) as of 2024. It is the first to evaluate these LLMs using a non-English medical licensing exam. The insights from this study will guide educators, policymakers, and technical experts in the effective use of AI in medical education and clinical diagnosis.

METHOD

Authors inputted 790 questions from Japanese National Medical Examination into the chat windows of the LLMs to obtain responses. Two authors independently assessed the correctness. Authors analyzed the overall accuracy rates of the LLMs and compared their performance on image and non-image questions, questions of varying difficulty levels, general and clinical questions, and questions from different medical specialties. Additionally, authors examined the correlation between the number of publications and LLMs' performance in different medical specialties.

RESULTS

GPT-4o achieved highest accuracy rate of 89.2% and outperformed the other LLMs in overall performance and each specific category. All four LLMs performed better on non-image questions than image questions, with a 10% accuracy gap. They also performed better on easy questions compared to normal and difficult ones. GPT-4o achieved a 95.0% accuracy rate on easy questions, marking it as an effective knowledge source for medical education. Four LLMs performed worst on "Gastroenterology and Hepatology" specialty. There was a positive correlation between the number of publications and LLM performance in different specialties.

CONCLUSIONS

GPT-4o achieved an overall accuracy rate close to 90%, with 95.0% on easy questions, significantly outperforming the other LLMs. This indicates GPT-4o's potential as a knowledge source for easy questions. Image-based questions and question difficulty significantly impact LLM accuracy. "Gastroenterology and Hepatology" is the specialty with the lowest performance. The LLMs' performance across medical specialties correlates positively with the number of related publications.

摘要

未加标签

研究目的和目标。本研究旨在评估截至 2024 年最先进的大型语言模型(GPT-4o、GPT-4、Gemini 1.5 Pro 和 Claude 3 Opus)的医学知识准确性。这是首次使用非英语医学执照考试来评估这些大型语言模型。本研究的结果将指导教育工作者、政策制定者和技术专家在医学教育和临床诊断中有效使用人工智能。

方法

作者将 790 个来自日本国家医学考试的问题输入大型语言模型的聊天窗口以获取答案。两位作者独立评估了正确性。作者分析了大型语言模型的总体准确率,并比较了它们在图像和非图像问题、不同难度级别的问题、一般和临床问题以及不同医学专业问题上的表现。此外,作者还研究了出版物数量与大型语言模型在不同医学专业中的表现之间的相关性。

结果

GPT-4o 的准确率最高,达到 89.2%,在整体表现和每个具体类别中均优于其他大型语言模型。所有四个大型语言模型在非图像问题上的表现均优于图像问题,准确率差距为 10%。它们在简单问题上的表现也优于正常和困难问题。GPT-4o 在简单问题上的准确率达到 95.0%,这表明它是医学教育的有效知识来源。四个大型语言模型在“胃肠病学和肝脏病学”专业的表现最差。在不同专业中,出版物数量与大型语言模型表现之间存在正相关关系。

结论

GPT-4o 的总体准确率接近 90%,简单问题的准确率为 95.0%,明显优于其他大型语言模型。这表明 GPT-4o 有可能成为简单问题的知识来源。图像问题和问题难度对大型语言模型的准确性有显著影响。“胃肠病学和肝脏病学”是表现最差的专业。大型语言模型在医学专业中的表现与相关出版物数量呈正相关。

相似文献

1
Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination.评估先进的大型语言模型在医学知识方面的有效性:使用日本国家医学考试的比较研究。
Int J Med Inform. 2025 Jan;193:105673. doi: 10.1016/j.ijmedinf.2024.105673. Epub 2024 Oct 28.
2
ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis.ChatGPT-4 在 USMLE 学科和临床技能中的全能表现:比较分析。
JMIR Med Educ. 2024 Nov 6;10:e63430. doi: 10.2196/63430.
3
Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.ChatGPT 在全球医学执照考试不同版本中的表现:系统评价和荟萃分析。
J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.
4
Accuracy and quality of ChatGPT-4o and Google Gemini performance on image-based neurosurgery board questions.ChatGPT-4o和谷歌Gemini在基于图像的神经外科委员会问题上的表现准确性和质量。
Neurosurg Rev. 2025 Mar 25;48(1):320. doi: 10.1007/s10143-025-03472-7.
5
Benchmarking Vision Capabilities of Large Language Models in Surgical Examination Questions.大型语言模型在外科检查问题中的视觉能力基准测试
J Surg Educ. 2025 Apr;82(4):103442. doi: 10.1016/j.jsurg.2025.103442. Epub 2025 Feb 9.
6
Performance of Multimodal Large Language Models in Japanese Diagnostic Radiology Board Examinations (2021-2023).多模态大语言模型在日本诊断放射学委员会考试(2021 - 2023年)中的表现
Acad Radiol. 2025 May;32(5):2394-2401. doi: 10.1016/j.acra.2024.10.035. Epub 2024 Nov 8.
7
Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.在医学视觉问答中评估Bard Gemini Pro和GPT-4 Vision对学生表现的影响:比较案例研究
JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592.
8
An Evaluation of the Performance of OpenAI-o1 and GPT-4o in the Japanese National Examination for Physical Therapists.OpenAI-o1和GPT-4o在日本物理治疗师国家考试中的表现评估
Cureus. 2025 Jan 6;17(1):e76989. doi: 10.7759/cureus.76989. eCollection 2025 Jan.
9
Evaluating the performance of GPT-3.5, GPT-4, and GPT-4o in the Chinese National Medical Licensing Examination.评估GPT-3.5、GPT-4和GPT-4o在中国国家医师资格考试中的表现。
Sci Rep. 2025 Apr 23;15(1):14119. doi: 10.1038/s41598-025-98949-2.
10
Evaluating AI proficiency in nuclear cardiology: Large language models take on the board preparation exam.评估人工智能在核心脏病学方面的熟练程度:大型语言模型参加资格考试。
J Nucl Cardiol. 2025 Mar;45:102089. doi: 10.1016/j.nuclcard.2024.102089. Epub 2024 Nov 29.

引用本文的文献

1
Performance of Large Language Models in Numerical Versus Semantic Medical Knowledge: Cross-Sectional Benchmarking Study on Evidence-Based Questions and Answers.大型语言模型在数值与语义医学知识方面的表现:基于循证问答的横断面基准研究
J Med Internet Res. 2025 Jul 14;27:e64452. doi: 10.2196/64452.
2
Assessing the Accuracy of Diagnostic Capabilities of Large Language Models.评估大语言模型诊断能力的准确性。
Diagnostics (Basel). 2025 Jun 29;15(13):1657. doi: 10.3390/diagnostics15131657.
3
An integrating RAG-LLM and deep Q-network framework for intelligent fish control systems.
一种用于智能鱼类控制系统的集成RAG-LLM和深度Q网络框架。
Sci Rep. 2025 Jul 1;15(1):21377. doi: 10.1038/s41598-025-05892-3.
4
Leveraging large language models for patient-ventilator asynchrony detection.利用大语言模型进行患者-呼吸机不同步检测。
BMJ Health Care Inform. 2025 Jun 27;32(1):e101426. doi: 10.1136/bmjhci-2024-101426.
5
Evaluating the performance of GPT-3.5, GPT-4, and GPT-4o in the Chinese National Medical Licensing Examination.评估GPT-3.5、GPT-4和GPT-4o在中国国家医师资格考试中的表现。
Sci Rep. 2025 Apr 23;15(1):14119. doi: 10.1038/s41598-025-98949-2.
6
Comparative Analysis of ChatGPT-4o and Gemini Advanced Performance on Diagnostic Radiology In-Training Exams.ChatGPT-4o与Gemini在放射诊断学培训考试中的性能对比分析
Cureus. 2025 Mar 20;17(3):e80874. doi: 10.7759/cureus.80874. eCollection 2025 Mar.
7
Preliminary evaluation of ChatGPT model iterations in emergency department diagnostics.ChatGPT模型迭代在急诊科诊断中的初步评估。
Sci Rep. 2025 Mar 26;15(1):10426. doi: 10.1038/s41598-025-95233-1.
8
Evaluating the quality of medical content on YouTube using large language models.使用大语言模型评估YouTube上医学内容的质量。
Sci Rep. 2025 Mar 22;15(1):9906. doi: 10.1038/s41598-025-94208-6.
9
Accuracy of latest large language models in answering multiple choice questions in dentistry: A comparative study.最新大语言模型在回答牙科多项选择题方面的准确性:一项比较研究。
PLoS One. 2025 Jan 29;20(1):e0317423. doi: 10.1371/journal.pone.0317423. eCollection 2025.
10
Large Language Models in Dental Licensing Examinations: Systematic Review and Meta-Analysis.大型语言模型在牙科执照考试中的应用:系统评价与荟萃分析
Int Dent J. 2025 Feb;75(1):213-222. doi: 10.1016/j.identj.2024.10.014. Epub 2024 Nov 12.