Suppr超能文献

Performance of large language models on family medicine licensing exams.

作者信息

Omar Mahmud, Hijazi Kareem, Omar Mohammad, Nadkarni Girish N, Klang Eyal

机构信息

The Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Medical Center, New York, United States.

Maccabi Healthcare Services, Tel-Aviv, Israel.

出版信息

Fam Pract. 2025 Jun 4;42(4). doi: 10.1093/fampra/cmaf035.

Abstract

BACKGROUND AND AIM

Large language models (LLMs) have shown promise in specialized medical exams but remain less explored in family medicine and primary care. This study evaluated eight state-of-the-art LLMs on the official Israeli primary care licensing exam, focusing on prompt design and explanation quality.

METHODS

Two hundred multiple-choice questions were tested using simple and few-shot Chain-of-Thought prompts (prompts that include examples which illustrate reasoning). Performance differences were assessed with Cochran's Q and pairwise McNemar tests. A stress test of the top performer (openAI's o1-preview) examined 30 selected questions, with two physicians scoring explanations for accuracy, logic, and hallucinations (extra or fabricated information not supported by the question).

RESULTS

Five models exceeded the 65% passing threshold under simple prompts; seven did so with few-shot prompts. o1-preview reached 85.5%. In the stress test, explanations were generally coherent and accurate, with 5 of 120 flagged for hallucinations. Inter-rater agreement on explanation scoring was high (weighted kappa 0.773; Intraclass Correlation Coefficient (ICC) 0.776).

CONCLUSIONS

Most tested models performed well on an official family medicine exam, especially with structured prompts. Nonetheless, multiple-choice formats cannot address broader clinical competencies such as physical exams and patient rapport. Future efforts should refine these models to eliminate hallucinations, test for socio-demographic biases, and ensure alignment with real-world demands.

摘要

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验