文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

基础医学考试中与大语言模型准确性相关的因素:横断面研究

Factors Associated With the Accuracy of Large Language Models in Basic Medical Science Examinations: Cross-Sectional Study.

作者信息

Kaewboonlert Naritsaret, Poontananggul Jiraphon, Pongsuwan Natthipong, Bhakdisongkhram Gun

机构信息

Institute of Medicine, Suranaree University of Technology, 111 University Avenue, Nakhon Ratchasima, 30000, Thailand, 66 44223956.

出版信息

JMIR Med Educ. 2025 Jan 13;11:e58898. doi: 10.2196/58898.


DOI:10.2196/58898
PMID:39846415
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11745146/
Abstract

BACKGROUND: Artificial intelligence (AI) has become widely applied across many fields, including medical education. Content validation and its answers are based on training datasets and the optimization of each model. The accuracy of large language model (LLMs) in basic medical examinations and factors related to their accuracy have also been explored. OBJECTIVE: We evaluated factors associated with the accuracy of LLMs (GPT-3.5, GPT-4, Google Bard, and Microsoft Bing) in answering multiple-choice questions from basic medical science examinations. METHODS: We used questions that were closely aligned with the content and topic distribution of Thailand's Step 1 National Medical Licensing Examination. Variables such as the difficulty index, discrimination index, and question characteristics were collected. These questions were then simultaneously input into ChatGPT (with GPT-3.5 and GPT-4), Microsoft Bing, and Google Bard, and their responses were recorded. The accuracy of these LLMs and the associated factors were analyzed using multivariable logistic regression. This analysis aimed to assess the effect of various factors on model accuracy, with results reported as odds ratios (ORs). RESULTS: The study revealed that GPT-4 was the top-performing model, with an overall accuracy of 89.07% (95% CI 84.76%-92.41%), significantly outperforming the others (P<.001). Microsoft Bing followed with an accuracy of 83.69% (95% CI 78.85%-87.80%), GPT-3.5 at 67.02% (95% CI 61.20%-72.48%), and Google Bard at 63.83% (95% CI 57.92%-69.44%). The multivariable logistic regression analysis showed a correlation between question difficulty and model performance, with GPT-4 demonstrating the strongest association. Interestingly, no significant correlation was found between model accuracy and question length, negative wording, clinical scenarios, or the discrimination index for most models, except for Google Bard, which showed varying correlations. CONCLUSIONS: The GPT-4 and Microsoft Bing models demonstrated equal and superior accuracy compared to GPT-3.5 and Google Bard in the domain of basic medical science. The accuracy of these models was significantly influenced by the item's difficulty index, indicating that the LLMs are more accurate when answering easier questions. This suggests that the more accurate models, such as GPT-4 and Bing, can be valuable tools for understanding and learning basic medical science concepts.

摘要

背景:人工智能(AI)已在包括医学教育在内的许多领域得到广泛应用。内容验证及其答案基于训练数据集和每个模型的优化。人们还探索了大语言模型(LLMs)在基础医学考试中的准确性及其相关因素。 目的:我们评估了大语言模型(GPT-3.5、GPT-4、谷歌巴德和微软必应)在回答基础医学考试多项选择题时的准确性相关因素。 方法:我们使用了与泰国第一步国家医学执照考试的内容和主题分布紧密匹配的问题。收集了难度指数、区分指数和问题特征等变量。然后将这些问题同时输入ChatGPT(使用GPT-3.5和GPT-4)、微软必应和谷歌巴德,并记录它们的回答。使用多变量逻辑回归分析这些大语言模型的准确性及其相关因素。该分析旨在评估各种因素对模型准确性的影响,结果以优势比(ORs)报告。 结果:研究表明,GPT-4是表现最佳的模型,总体准确率为89.07%(95%置信区间84.76%-92.41%),显著优于其他模型(P<.001)。微软必应其次,准确率为83.69%(95%置信区间78.85%-87.80%),GPT-3.5为67.02%(95%置信区间61.20%-72.48%),谷歌巴德为63.83%(95%置信区间57.92%-69.44%)。多变量逻辑回归分析显示问题难度与模型性能之间存在相关性,其中GPT-4的相关性最强。有趣的是,除了谷歌巴德显示出不同的相关性外,大多数模型的模型准确性与问题长度、否定措辞、临床情景或区分指数之间均未发现显著相关性。 结论:在基础医学领域,与GPT-3.5和谷歌巴德相比,GPT-4和微软必应模型表现出同等且更高的准确性。这些模型的准确性受题目难度指数的显著影响,这表明大语言模型在回答较简单问题时更准确。这表明,像GPT-4和必应这样更准确的模型可以成为理解和学习基础医学概念的有价值工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5792/11745146/df7ce79293e4/mededu-v11-e58898-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5792/11745146/ad618416b3a7/mededu-v11-e58898-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5792/11745146/51c5c13eaaa6/mededu-v11-e58898-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5792/11745146/df7ce79293e4/mededu-v11-e58898-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5792/11745146/ad618416b3a7/mededu-v11-e58898-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5792/11745146/51c5c13eaaa6/mededu-v11-e58898-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5792/11745146/df7ce79293e4/mededu-v11-e58898-g003.jpg

相似文献

[1]
Factors Associated With the Accuracy of Large Language Models in Basic Medical Science Examinations: Cross-Sectional Study.

JMIR Med Educ. 2025-1-13

[2]
Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.

JMIR Form Res. 2024-12-17

[3]
Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing.

Eur J Orthod. 2024-4-13

[4]
Performance of ChatGPT and Bard on the medical licensing examinations varies across different cultures: a comparison study.

BMC Med Educ. 2024-11-26

[5]
Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard.

JMIR Med Educ. 2024-2-21

[6]
Performance of Large Language Models (ChatGPT, Bing Search, and Google Bard) in Solving Case Vignettes in Physiology.

Cureus. 2023-8-4

[7]
Large Language Models in Hematology Case Solving: A Comparative Study of ChatGPT-3.5, Google Bard, and Microsoft Bing.

Cureus. 2023-8-21

[8]
Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.

J Med Internet Res. 2024-7-25

[9]
Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study.

J Med Internet Res. 2023-12-28

[10]
Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination.

J Orthop Surg (Hong Kong). 2025

引用本文的文献

[1]
A comparative analysis of DeepSeek R1, DeepSeek-R1-Lite, OpenAi o1 Pro, and Grok 3 performance on ophthalmology board-style questions.

Sci Rep. 2025-7-2

[2]
Comparative analysis of language models in addressing syphilis-related queries.

Med Oral Patol Oral Cir Bucal. 2025-7-1

[3]
Evaluating the Accuracy and Reliability of Large Language Models (ChatGPT, Claude, DeepSeek, Gemini, Grok, and Le Chat) in Answering Item-Analyzed Multiple-Choice Questions on Blood Physiology.

Cureus. 2025-4-8

本文引用的文献

[1]
A systematic review of large language models and their implications in medical education.

Med Educ. 2024-11

[2]
Developing Medical Education Curriculum Reform Strategies to Address the Impact of Generative AI: Qualitative Study.

JMIR Med Educ. 2023-11-30

[3]
Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study.

JMIR Form Res. 2023-10-13

[4]
Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study.

JMIR Med Educ. 2023-9-28

[5]
The Effect of a One-Day Workshop on the Quality of Framing Multiple Choice Questions in Physiology in a Medical College in India.

Cureus. 2023-8-24

[6]
Revolutionizing healthcare: the role of artificial intelligence in clinical practice.

BMC Med Educ. 2023-9-22

[7]
The Role of Large Language Models in Medical Education: Applications and Implications.

JMIR Med Educ. 2023-8-14

[8]
Performance and exploration of ChatGPT in medical examination, records and education in Chinese: Pave the way for medical AI.

Int J Med Inform. 2023-9

[9]
Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings.

Ophthalmol Sci. 2023-5-5

[10]
ChatGPT failed Taiwan's Family Medicine Board Exam.

J Chin Med Assoc. 2023-8-1

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索