评估先进的大型语言模型在医学知识方面的有效性：使用日本国家医学考试的比较研究。

Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination.

机构信息

Department of Health Communication, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan.

Department of Health Communication, School of Public Health, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan.

出版信息

Int J Med Inform. 2025 Jan;193:105673. doi: 10.1016/j.ijmedinf.2024.105673. Epub 2024 Oct 28.

DOI:10.1016/j.ijmedinf.2024.105673

PMID:39471700

Abstract

UNLABELLED

Study aims and objectives. This study aims to evaluate the accuracy of medical knowledge in the most advanced LLMs (GPT-4o, GPT-4, Gemini 1.5 Pro, and Claude 3 Opus) as of 2024. It is the first to evaluate these LLMs using a non-English medical licensing exam. The insights from this study will guide educators, policymakers, and technical experts in the effective use of AI in medical education and clinical diagnosis.

METHOD

Authors inputted 790 questions from Japanese National Medical Examination into the chat windows of the LLMs to obtain responses. Two authors independently assessed the correctness. Authors analyzed the overall accuracy rates of the LLMs and compared their performance on image and non-image questions, questions of varying difficulty levels, general and clinical questions, and questions from different medical specialties. Additionally, authors examined the correlation between the number of publications and LLMs' performance in different medical specialties.

RESULTS

GPT-4o achieved highest accuracy rate of 89.2% and outperformed the other LLMs in overall performance and each specific category. All four LLMs performed better on non-image questions than image questions, with a 10% accuracy gap. They also performed better on easy questions compared to normal and difficult ones. GPT-4o achieved a 95.0% accuracy rate on easy questions, marking it as an effective knowledge source for medical education. Four LLMs performed worst on "Gastroenterology and Hepatology" specialty. There was a positive correlation between the number of publications and LLM performance in different specialties.

CONCLUSIONS

GPT-4o achieved an overall accuracy rate close to 90%, with 95.0% on easy questions, significantly outperforming the other LLMs. This indicates GPT-4o's potential as a knowledge source for easy questions. Image-based questions and question difficulty significantly impact LLM accuracy. "Gastroenterology and Hepatology" is the specialty with the lowest performance. The LLMs' performance across medical specialties correlates positively with the number of related publications.

摘要

未加标签

研究目的和目标。本研究旨在评估截至 2024 年最先进的大型语言模型（GPT-4o、GPT-4、Gemini 1.5 Pro 和 Claude 3 Opus）的医学知识准确性。这是首次使用非英语医学执照考试来评估这些大型语言模型。本研究的结果将指导教育工作者、政策制定者和技术专家在医学教育和临床诊断中有效使用人工智能。

方法

作者将 790 个来自日本国家医学考试的问题输入大型语言模型的聊天窗口以获取答案。两位作者独立评估了正确性。作者分析了大型语言模型的总体准确率，并比较了它们在图像和非图像问题、不同难度级别的问题、一般和临床问题以及不同医学专业问题上的表现。此外，作者还研究了出版物数量与大型语言模型在不同医学专业中的表现之间的相关性。

结果

GPT-4o 的准确率最高，达到 89.2%，在整体表现和每个具体类别中均优于其他大型语言模型。所有四个大型语言模型在非图像问题上的表现均优于图像问题，准确率差距为 10%。它们在简单问题上的表现也优于正常和困难问题。GPT-4o 在简单问题上的准确率达到 95.0%，这表明它是医学教育的有效知识来源。四个大型语言模型在“胃肠病学和肝脏病学”专业的表现最差。在不同专业中，出版物数量与大型语言模型表现之间存在正相关关系。

结论

GPT-4o 的总体准确率接近 90%，简单问题的准确率为 95.0%，明显优于其他大型语言模型。这表明 GPT-4o 有可能成为简单问题的知识来源。图像问题和问题难度对大型语言模型的准确性有显著影响。“胃肠病学和肝脏病学”是表现最差的专业。大型语言模型在医学专业中的表现与相关出版物数量呈正相关。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

评估先进的大型语言模型在医学知识方面的有效性：使用日本国家医学考试的比较研究。

Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination.

机构信息

出版信息

UNLABELLED

METHOD

RESULTS

CONCLUSIONS

未加标签

方法

结果

结论

相似文献

引用本文的文献

评估先进的大型语言模型在医学知识方面的有效性：使用日本国家医学考试的比较研究。

Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination.

机构信息

出版信息

UNLABELLED

METHOD

RESULTS

CONCLUSIONS

未加标签

方法

结果

结论

相似文献

引用本文的文献