全球医学考试中的大语言模型：平台开发与综合分析

Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis.

作者信息

Zong Hui, Wu Rongrong, Cha Jiaxue, Wang Jiao, Wu Erman, Li Jiakun, Zhou Yi, Zhang Chi, Feng Weizhe, Shen Bairong

机构信息

Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, China.

Shanghai Key Laboratory of Signaling and Disease Research, School of Life Sciences and Technology, Tongji University, Shanghai, China.

出版信息

J Med Internet Res. 2024 Dec 27;26:e66114. doi: 10.2196/66114.

DOI:10.2196/66114

PMID:39729356

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11724220/

Abstract

BACKGROUND

Large language models (LLMs) are increasingly integrated into medical education, with transformative potential for learning and assessment. However, their performance across diverse medical exams globally has remained underexplored.

OBJECTIVE

This study aims to introduce MedExamLLM, a comprehensive platform designed to systematically evaluate the performance of LLMs on medical exams worldwide. Specifically, the platform seeks to (1) compile and curate performance data for diverse LLMs on worldwide medical exams; (2) analyze trends and disparities in LLM capabilities across geographic regions, languages, and contexts; and (3) provide a resource for researchers, educators, and developers to explore and advance the integration of artificial intelligence in medical education.

METHODS

A systematic search was conducted on April 25, 2024, in the PubMed database to identify relevant publications. Inclusion criteria encompassed peer-reviewed, English-language, original research articles that evaluated at least one LLM on medical exams. Exclusion criteria included review articles, non-English publications, preprints, and studies without relevant data on LLM performance. The screening process for candidate publications was independently conducted by 2 researchers to ensure accuracy and reliability. Data, including exam information, data process information, model performance, data availability, and references, were manually curated, standardized, and organized. These curated data were integrated into the MedExamLLM platform, enabling its functionality to visualize and analyze LLM performance across geographic, linguistic, and exam characteristics. The web platform was developed with a focus on accessibility, interactivity, and scalability to support continuous data updates and user engagement.

RESULTS

A total of 193 articles were included for final analysis. MedExamLLM comprised information for 16 LLMs on 198 medical exams conducted in 28 countries across 15 languages from the year 2009 to the year 2023. The United States accounted for the highest number of medical exams and related publications, with English being the dominant language used in these exams. The Generative Pretrained Transformer (GPT) series models, especially GPT-4, demonstrated superior performance, achieving pass rates significantly higher than other LLMs. The analysis revealed significant variability in the capabilities of LLMs across different geographic and linguistic contexts.

CONCLUSIONS

MedExamLLM is an open-source, freely accessible, and publicly available online platform providing comprehensive performance evaluation information and evidence knowledge about LLMs on medical exams around the world. The MedExamLLM platform serves as a valuable resource for educators, researchers, and developers in the fields of clinical medicine and artificial intelligence. By synthesizing evidence on LLM capabilities, the platform provides valuable insights to support the integration of artificial intelligence into medical education. Limitations include potential biases in the data source and the exclusion of non-English literature. Future research should address these gaps and explore methods to enhance LLM performance in diverse contexts.

摘要

背景

大语言模型（LLMs）越来越多地融入医学教育，对学习和评估具有变革潜力。然而，它们在全球各种医学考试中的表现仍未得到充分探索。

目的

本研究旨在推出MedExamLLM，这是一个综合平台，旨在系统评估大语言模型在全球医学考试中的表现。具体而言，该平台旨在（1）汇编和整理各种大语言模型在全球医学考试中的性能数据；（2）分析大语言模型能力在地理区域、语言和背景方面的趋势和差异；（3）为研究人员、教育工作者和开发者提供一个资源，以探索和推进人工智能在医学教育中的整合。

方法

2024年4月25日在PubMed数据库中进行了系统搜索，以识别相关出版物。纳入标准包括经过同行评审、英文的原创研究文章，这些文章评估了至少一种大语言模型在医学考试中的表现。排除标准包括综述文章、非英文出版物、预印本以及没有大语言模型性能相关数据的研究。候选出版物的筛选过程由2名研究人员独立进行，以确保准确性和可靠性。数据，包括考试信息、数据处理信息、模型性能、数据可用性和参考文献，进行了人工整理、标准化和组织。这些整理后的数据被整合到MedExamLLM平台，使其能够可视化和分析大语言模型在地理、语言和考试特征方面的性能。网络平台的开发注重可访问性、交互性和可扩展性，以支持持续的数据更新和用户参与。

结果

共纳入193篇文章进行最终分析。MedExamLLM包含了2009年至2023年期间在28个国家以15种语言进行的198次医学考试中16种大语言模型的信息。美国的医学考试和相关出版物数量最多，这些考试中使用的主要语言是英语。生成式预训练变换器（GPT）系列模型，尤其是GPT-4，表现出卓越的性能，通过率显著高于其他大语言模型。分析显示，大语言模型的能力在不同地理和语言背景下存在显著差异。

结论

MedExamLLM是一个开源、免费访问且公开可用的在线平台，提供了关于大语言模型在全球医学考试中的综合性能评估信息和证据知识。MedExamLLM平台是临床医学和人工智能领域的教育工作者、研究人员和开发者的宝贵资源。通过综合大语言模型能力的证据，该平台提供了有价值的见解，以支持人工智能融入医学教育。局限性包括数据源可能存在的偏差以及非英文文献的排除。未来的研究应解决这些差距，并探索在不同背景下提高大语言模型性能的方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3740/11724220/62b4b74fcef1/jmir_v26i1e66114_fig1.jpg

相似文献

Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis.

J Med Internet Res. 2024 Dec 27;26:e66114. doi: 10.2196/66114.

The Accuracy and Capability of Artificial Intelligence Solutions in Health Care Examinations and Certificates: Systematic Review and Meta-Analysis.

J Med Internet Res. 2024 Nov 5;26:e56532. doi: 10.2196/56532.

Large language models for conducting systematic reviews: on the rise, but not yet ready for use-a scoping review.

J Clin Epidemiol. 2025 May;181:111746. doi: 10.1016/j.jclinepi.2025.111746. Epub 2025 Feb 26.

Examining the Role of Large Language Models in Orthopedics: Systematic Review.

J Med Internet Res. 2024 Nov 15;26:e59607. doi: 10.2196/59607.

A systematic review of large language model (LLM) evaluations in clinical medicine.

BMC Med Inform Decis Mak. 2025 Mar 7;25(1):117. doi: 10.1186/s12911-025-02954-4.

Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.

JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391.

A systematic review of large language models and their implications in medical education.

Med Educ. 2024 Nov;58(11):1276-1285. doi: 10.1111/medu.15402. Epub 2024 Apr 19.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review.

JMIR Med Inform. 2024 May 10;12:e53787. doi: 10.2196/53787.

Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential.

J Oral Maxillofac Surg. 2025 Mar;83(3):382-389. doi: 10.1016/j.joms.2024.11.007. Epub 2024 Nov 19.

引用本文的文献

A multi-model longitudinal assessment of ChatGPT performance on medical residency examinations.

Front Artif Intell. 2025 Aug 22;8:1614874. doi: 10.3389/frai.2025.1614874. eCollection 2025.

Artificial General Intelligence and Its Threat to Public Health.

J Eval Clin Pract. 2025 Sep;31(6):e70269. doi: 10.1111/jep.70269.

The performance of ChatGPT on medical image-based assessments and implications for medical education.

BMC Med Educ. 2025 Aug 23;25(1):1192. doi: 10.1186/s12909-025-07752-0.

Comparative analysis of the performance of the large language models DeepSeek-V3, DeepSeek-R1, open AI-O3 mini and open AI-O3 mini high in urology.

World J Urol. 2025 Jul 7;43(1):416. doi: 10.1007/s00345-025-05757-4.

Evaluating the Performance of ChatGPT on Board-Style Examination Questions in Ophthalmology: A Meta-Analysis.

J Med Syst. 2025 Jul 5;49(1):94. doi: 10.1007/s10916-025-02227-7.

Advancements in Herpes Zoster Diagnosis, Treatment, and Management: Systematic Review of Artificial Intelligence Applications.

J Med Internet Res. 2025 Jun 30;27:e71970. doi: 10.2196/71970.

NDDRF 2.0: An update and expansion of risk factor knowledge base for personalized prevention of neurodegenerative diseases.

Alzheimers Dement. 2025 May;21(5):e70282. doi: 10.1002/alz.70282.

AI in Home Care-Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study.

J Med Internet Res. 2025 Apr 28;27:e70703. doi: 10.2196/70703.

Enhancing ophthalmology students' awareness of retinitis pigmentosa: assessing the efficacy of ChatGPT in AI-assisted teaching of rare diseases-a quasi-experimental study.

Front Med (Lausanne). 2025 Mar 18;12:1534294. doi: 10.3389/fmed.2025.1534294. eCollection 2025.

本文引用的文献

Comparison of Medical Research Abstracts Written by Surgical Trainees and Senior Surgeons or Generated by Large Language Models.

JAMA Netw Open. 2024 Aug 1;7(8):e2425373. doi: 10.1001/jamanetworkopen.2024.25373.

Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.

J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.

Exploring the potential of artificial intelligence to enhance the writing of english academic papers by non-native english-speaking medical students - the educational application of ChatGPT.

BMC Med Educ. 2024 Jul 9;24(1):736. doi: 10.1186/s12909-024-05738-y.

Ethical Considerations and Fundamental Principles of Large Language Models in Medical Education: Viewpoint.

J Med Internet Res. 2024 Aug 1;26:e60083. doi: 10.2196/60083.

ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: a literature review.

Postgrad Med J. 2024 Oct 18;100(1189):858-865. doi: 10.1093/postmj/qgae065.

Constructing knowledge: the role of AI in medical learning.

J Am Med Inform Assoc. 2024 Aug 1;31(8):1797-1798. doi: 10.1093/jamia/ocae124.

The application of large language models in medicine: A scoping review.

iScience. 2024 Apr 23;27(5):109713. doi: 10.1016/j.isci.2024.109713. eCollection 2024 May 17.

An evaluation of the capabilities of language models and nurses in providing neonatal clinical decision support.

Int J Nurs Stud. 2024 Jul;155:104771. doi: 10.1016/j.ijnurstu.2024.104771. Epub 2024 Apr 9.

Large language models for biomedicine: foundations, opportunities, challenges, and best practices.

J Am Med Inform Assoc. 2024 Sep 1;31(9):2114-2124. doi: 10.1093/jamia/ocae074.

A systematic review of large language models and their implications in medical education.

Med Educ. 2024 Nov;58(11):1276-1285. doi: 10.1111/medu.15402. Epub 2024 Apr 19.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

全球医学考试中的大语言模型：平台开发与综合分析

Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis.

作者信息

Zong Hui, Wu Rongrong, Cha Jiaxue, Wang Jiao, Wu Erman, Li Jiakun, Zhou Yi, Zhang Chi, Feng Weizhe, Shen Bairong

机构信息

Shanghai Key Laboratory of Signaling and Disease Research, School of Life Sciences and Technology, Tongji University, Shanghai, China.