基础医学考试中与大语言模型准确性相关的因素：横断面研究

Factors Associated With the Accuracy of Large Language Models in Basic Medical Science Examinations: Cross-Sectional Study.

作者信息

Kaewboonlert Naritsaret, Poontananggul Jiraphon, Pongsuwan Natthipong, Bhakdisongkhram Gun

机构信息

Institute of Medicine, Suranaree University of Technology, 111 University Avenue, Nakhon Ratchasima, 30000, Thailand, 66 44223956.

出版信息

JMIR Med Educ. 2025 Jan 13;11:e58898. doi: 10.2196/58898.

DOI:10.2196/58898

PMID:39846415

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11745146/

Abstract

BACKGROUND

Artificial intelligence (AI) has become widely applied across many fields, including medical education. Content validation and its answers are based on training datasets and the optimization of each model. The accuracy of large language model (LLMs) in basic medical examinations and factors related to their accuracy have also been explored.

OBJECTIVE

We evaluated factors associated with the accuracy of LLMs (GPT-3.5, GPT-4, Google Bard, and Microsoft Bing) in answering multiple-choice questions from basic medical science examinations.

METHODS

We used questions that were closely aligned with the content and topic distribution of Thailand's Step 1 National Medical Licensing Examination. Variables such as the difficulty index, discrimination index, and question characteristics were collected. These questions were then simultaneously input into ChatGPT (with GPT-3.5 and GPT-4), Microsoft Bing, and Google Bard, and their responses were recorded. The accuracy of these LLMs and the associated factors were analyzed using multivariable logistic regression. This analysis aimed to assess the effect of various factors on model accuracy, with results reported as odds ratios (ORs).

RESULTS

The study revealed that GPT-4 was the top-performing model, with an overall accuracy of 89.07% (95% CI 84.76%-92.41%), significantly outperforming the others (P<.001). Microsoft Bing followed with an accuracy of 83.69% (95% CI 78.85%-87.80%), GPT-3.5 at 67.02% (95% CI 61.20%-72.48%), and Google Bard at 63.83% (95% CI 57.92%-69.44%). The multivariable logistic regression analysis showed a correlation between question difficulty and model performance, with GPT-4 demonstrating the strongest association. Interestingly, no significant correlation was found between model accuracy and question length, negative wording, clinical scenarios, or the discrimination index for most models, except for Google Bard, which showed varying correlations.

CONCLUSIONS

The GPT-4 and Microsoft Bing models demonstrated equal and superior accuracy compared to GPT-3.5 and Google Bard in the domain of basic medical science. The accuracy of these models was significantly influenced by the item's difficulty index, indicating that the LLMs are more accurate when answering easier questions. This suggests that the more accurate models, such as GPT-4 and Bing, can be valuable tools for understanding and learning basic medical science concepts.

摘要

背景

人工智能（AI）已在包括医学教育在内的许多领域得到广泛应用。内容验证及其答案基于训练数据集和每个模型的优化。人们还探索了大语言模型（LLMs）在基础医学考试中的准确性及其相关因素。

目的

我们评估了大语言模型（GPT-3.5、GPT-4、谷歌巴德和微软必应）在回答基础医学考试多项选择题时的准确性相关因素。

方法

我们使用了与泰国第一步国家医学执照考试的内容和主题分布紧密匹配的问题。收集了难度指数、区分指数和问题特征等变量。然后将这些问题同时输入ChatGPT（使用GPT-3.5和GPT-4）、微软必应和谷歌巴德，并记录它们的回答。使用多变量逻辑回归分析这些大语言模型的准确性及其相关因素。该分析旨在评估各种因素对模型准确性的影响，结果以优势比（ORs）报告。

结果

研究表明，GPT-4是表现最佳的模型，总体准确率为89.07%（95%置信区间84.76%-92.41%），显著优于其他模型（P<.001）。微软必应其次，准确率为83.69%（95%置信区间78.85%-87.80%），GPT-3.5为67.02%（95%置信区间61.20%-72.48%），谷歌巴德为63.83%（95%置信区间57.92%-69.44%）。多变量逻辑回归分析显示问题难度与模型性能之间存在相关性，其中GPT-4的相关性最强。有趣的是，除了谷歌巴德显示出不同的相关性外，大多数模型的模型准确性与问题长度、否定措辞、临床情景或区分指数之间均未发现显著相关性。