Zong Hui, Wu Rongrong, Cha Jiaxue, Wang Jiao, Wu Erman, Li Jiakun, Zhou Yi, Zhang Chi, Feng Weizhe, Shen Bairong
Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, China.
Shanghai Key Laboratory of Signaling and Disease Research, School of Life Sciences and Technology, Tongji University, Shanghai, China.
J Med Internet Res. 2024 Dec 27;26:e66114. doi: 10.2196/66114.
Large language models (LLMs) are increasingly integrated into medical education, with transformative potential for learning and assessment. However, their performance across diverse medical exams globally has remained underexplored.
This study aims to introduce MedExamLLM, a comprehensive platform designed to systematically evaluate the performance of LLMs on medical exams worldwide. Specifically, the platform seeks to (1) compile and curate performance data for diverse LLMs on worldwide medical exams; (2) analyze trends and disparities in LLM capabilities across geographic regions, languages, and contexts; and (3) provide a resource for researchers, educators, and developers to explore and advance the integration of artificial intelligence in medical education.
A systematic search was conducted on April 25, 2024, in the PubMed database to identify relevant publications. Inclusion criteria encompassed peer-reviewed, English-language, original research articles that evaluated at least one LLM on medical exams. Exclusion criteria included review articles, non-English publications, preprints, and studies without relevant data on LLM performance. The screening process for candidate publications was independently conducted by 2 researchers to ensure accuracy and reliability. Data, including exam information, data process information, model performance, data availability, and references, were manually curated, standardized, and organized. These curated data were integrated into the MedExamLLM platform, enabling its functionality to visualize and analyze LLM performance across geographic, linguistic, and exam characteristics. The web platform was developed with a focus on accessibility, interactivity, and scalability to support continuous data updates and user engagement.
A total of 193 articles were included for final analysis. MedExamLLM comprised information for 16 LLMs on 198 medical exams conducted in 28 countries across 15 languages from the year 2009 to the year 2023. The United States accounted for the highest number of medical exams and related publications, with English being the dominant language used in these exams. The Generative Pretrained Transformer (GPT) series models, especially GPT-4, demonstrated superior performance, achieving pass rates significantly higher than other LLMs. The analysis revealed significant variability in the capabilities of LLMs across different geographic and linguistic contexts.
MedExamLLM is an open-source, freely accessible, and publicly available online platform providing comprehensive performance evaluation information and evidence knowledge about LLMs on medical exams around the world. The MedExamLLM platform serves as a valuable resource for educators, researchers, and developers in the fields of clinical medicine and artificial intelligence. By synthesizing evidence on LLM capabilities, the platform provides valuable insights to support the integration of artificial intelligence into medical education. Limitations include potential biases in the data source and the exclusion of non-English literature. Future research should address these gaps and explore methods to enhance LLM performance in diverse contexts.
大语言模型(LLMs)越来越多地融入医学教育,对学习和评估具有变革潜力。然而,它们在全球各种医学考试中的表现仍未得到充分探索。
本研究旨在推出MedExamLLM,这是一个综合平台,旨在系统评估大语言模型在全球医学考试中的表现。具体而言,该平台旨在(1)汇编和整理各种大语言模型在全球医学考试中的性能数据;(2)分析大语言模型能力在地理区域、语言和背景方面的趋势和差异;(3)为研究人员、教育工作者和开发者提供一个资源,以探索和推进人工智能在医学教育中的整合。
2024年4月25日在PubMed数据库中进行了系统搜索,以识别相关出版物。纳入标准包括经过同行评审、英文的原创研究文章,这些文章评估了至少一种大语言模型在医学考试中的表现。排除标准包括综述文章、非英文出版物、预印本以及没有大语言模型性能相关数据的研究。候选出版物的筛选过程由2名研究人员独立进行,以确保准确性和可靠性。数据,包括考试信息、数据处理信息、模型性能、数据可用性和参考文献,进行了人工整理、标准化和组织。这些整理后的数据被整合到MedExamLLM平台,使其能够可视化和分析大语言模型在地理、语言和考试特征方面的性能。网络平台的开发注重可访问性、交互性和可扩展性,以支持持续的数据更新和用户参与。
共纳入193篇文章进行最终分析。MedExamLLM包含了2009年至2023年期间在28个国家以15种语言进行的198次医学考试中16种大语言模型的信息。美国的医学考试和相关出版物数量最多,这些考试中使用的主要语言是英语。生成式预训练变换器(GPT)系列模型,尤其是GPT-4,表现出卓越的性能,通过率显著高于其他大语言模型。分析显示,大语言模型的能力在不同地理和语言背景下存在显著差异。
MedExamLLM是一个开源、免费访问且公开可用的在线平台,提供了关于大语言模型在全球医学考试中的综合性能评估信息和证据知识。MedExamLLM平台是临床医学和人工智能领域的教育工作者、研究人员和开发者的宝贵资源。通过综合大语言模型能力的证据,该平台提供了有价值的见解,以支持人工智能融入医学教育。局限性包括数据源可能存在的偏差以及非英文文献的排除。未来的研究应解决这些差距,并探索在不同背景下提高大语言模型性能的方法。