• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

单智能体和多智能体语言模型在西班牙语医学能力考试中的表现。

Performance of single-agent and multi-agent language models in Spanish language medical competency exams.

作者信息

Altermatt Fernando R, Neyem Andres, Sumonte Nicolas, Mendoza Marcelo, Villagran Ignacio, Lacassie Hector J

机构信息

Division of Anesthesiology, School of Medicine, Pontificia Universidad Católica de Chile, Marcoleta 377, 8320000, Santiago, RM, Chile.

Department of Computer Science, School of Engineering, Pontificia Universidad Católica de Chile, Vicuña Mackenna 6840, 7820436, Santiago, RM, Chile.

出版信息

BMC Med Educ. 2025 May 7;25(1):666. doi: 10.1186/s12909-025-07250-3.

DOI:10.1186/s12909-025-07250-3
PMID:40336004
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12057199/
Abstract

BACKGROUND

Large language models (LLMs) like GPT-4o have shown promise in advancing medical decision-making and education. However, their performance in Spanish-language medical contexts remains underexplored. This study evaluates the effectiveness of single-agent and multi-agent strategies in answering questions from the EUNACOM, a standardized medical licensure exam in Chile, across 21 medical specialties.

METHODS

GPT-4o was tested on 1,062 multiple-choice questions from publicly available EUNACOM preparation materials. Single-agent strategies included Zero-Shot, Few-Shot, Chain-of-Thought (CoT), Self-Reflection, and MED-PROMPT, while multi-agent strategies involved Voting, Weighted Voting, Borda Count, MEDAGENTS, and MDAGENTS. Each strategy was tested under three temperature settings (0.3, 0.6, 1.2). Performance was assessed by accuracy, and statistical analyses, including Kruskal-Wallis and Mann-Whitney U tests, were performed. Computational resource utilization, such as API calls and execution time, was also analyzed.

RESULTS

MDAGENTS achieved the highest accuracy with a mean score of 89.97% (SD = 0.56%), outperforming all other strategies (p < 0.001). MEDAGENTS followed with a mean score of 87.99% (SD = 0.49%), and the CoT with Few-Shot strategy scored 87.67% (SD = 0.12%). Temperature settings did not significantly affect performance (F2,54 = 1.45, p = 0.24). Specialty-level analysis showed the highest accuracies in Psychiatry (95.51%), Neurology (95.49%), and Surgery (95.38%), while lower accuracies were observed in Neonatology (77.54%), Otolaryngology (76.64%), and Urology/Nephrology (76.59%). Notably, several exam questions were correctly answered using simpler single-agent strategies without employing complex reasoning or collaboration frameworks.

CONCLUSIONS AND RELEVANCE

Multi-agent strategies, particularly MDAGENTS, significantly enhance GPT-4o's performance on Spanish-language medical exams, leveraging collaboration to improve diagnostic accuracy. However, simpler single-agent strategies are sufficient to address many questions, high-lighting that only a fraction of standardized medical exams require sophisticated reasoning or multi-agent interaction. These findings suggest potential for LLMs as efficient and scalable tools in Spanish-speaking healthcare, though computational optimization remains a key area for future research.

摘要

背景

像GPT-4o这样的大语言模型在推进医学决策和教育方面已展现出潜力。然而,它们在西班牙语医学背景下的表现仍未得到充分探索。本研究评估了单智能体和多智能体策略在回答来自智利标准化医学执照考试EUNACOM的问题时的有效性,涵盖21个医学专业。

方法

在公开可用的EUNACOM备考材料中的1062道多项选择题上对GPT-4o进行测试。单智能体策略包括零样本、少样本、思维链(CoT)、自我反思和MED-PROMPT,而多智能体策略涉及投票、加权投票、博尔达计数、MEDAGENTS和MDAGENTS。每种策略在三种温度设置(0.3、0.6、1.2)下进行测试。通过准确率评估性能,并进行统计分析,包括克鲁斯卡尔 - 沃利斯检验和曼 - 惠特尼U检验。还分析了计算资源利用率,如API调用和执行时间。

结果

MDAGENTS的准确率最高,平均得分89.97%(标准差 = 0.56%),优于所有其他策略(p < 0.001)。MEDAGENTS其次,平均得分87.99%(标准差 = 0.49%),少样本思维链策略得分为87.67%(标准差 = 0.12%)。温度设置对性能没有显著影响(F2,54 = 1.45,p = 0.24)。专业层面分析显示,精神病学(95.51%)、神经病学(95.49%)和外科(95.38%)的准确率最高,而新生儿学(77.54%)、耳鼻喉科(76.64%)以及泌尿外科/肾脏病学(76.59%)的准确率较低。值得注意的是,一些考试问题使用更简单的单智能体策略就能正确回答,无需采用复杂的推理或协作框架。

结论与意义

多智能体策略,特别是MDAGENTS,通过利用协作提高诊断准确性,显著提升了GPT-4o在西班牙语医学考试中的表现。然而一些更简单的单智能体策略就足以回答许多问题,这凸显出只有一小部分标准化医学考试需要复杂推理或多智能体交互。这些发现表明大语言模型在西班牙语医疗保健领域作为高效且可扩展工具具有潜力,尽管计算优化仍是未来研究的关键领域。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d109/12057199/ef98ea7cdc49/12909_2025_7250_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d109/12057199/b388b37a2fe5/12909_2025_7250_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d109/12057199/a01298083a7e/12909_2025_7250_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d109/12057199/ef98ea7cdc49/12909_2025_7250_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d109/12057199/b388b37a2fe5/12909_2025_7250_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d109/12057199/a01298083a7e/12909_2025_7250_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d109/12057199/ef98ea7cdc49/12909_2025_7250_Fig3_HTML.jpg

相似文献

1
Performance of single-agent and multi-agent language models in Spanish language medical competency exams.单智能体和多智能体语言模型在西班牙语医学能力考试中的表现。
BMC Med Educ. 2025 May 7;25(1):666. doi: 10.1186/s12909-025-07250-3.
2
Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination.评估先进的大型语言模型在医学知识方面的有效性:使用日本国家医学考试的比较研究。
Int J Med Inform. 2025 Jan;193:105673. doi: 10.1016/j.ijmedinf.2024.105673. Epub 2024 Oct 28.
3
ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis.ChatGPT-4 在 USMLE 学科和临床技能中的全能表现:比较分析。
JMIR Med Educ. 2024 Nov 6;10:e63430. doi: 10.2196/63430.
4
Benchmarking Vision Capabilities of Large Language Models in Surgical Examination Questions.大型语言模型在外科检查问题中的视觉能力基准测试
J Surg Educ. 2025 Apr;82(4):103442. doi: 10.1016/j.jsurg.2025.103442. Epub 2025 Feb 9.
5
Evaluating GPT-4o's Performance in the Official European Board of Radiology Exam: A Comprehensive Assessment.评估 GPT-4o 在欧洲放射学委员会官方考试中的表现:全面评估。
Acad Radiol. 2024 Nov;31(11):4365-4371. doi: 10.1016/j.acra.2024.09.005. Epub 2024 Sep 18.
6
GPT-4o’s competency in answering the simulated written European Board of Interventional Radiology exam compared to a medical student and experts in Germany and its ability to generate exam items on interventional radiology: a descriptive study.GPT-4o 在回答模拟的欧洲介入放射学委员会考试方面的能力与德国医学生和专家相比,以及其在介入放射学方面生成考试项目的能力:一项描述性研究。
J Educ Eval Health Prof. 2024;21:21. doi: 10.3352/jeehp.2024.21.21. Epub 2024 Aug 20.
7
Exploring the Performance of ChatGPT Versions 3.5, 4, and 4 With Vision in the Chilean Medical Licensing Examination: Observational Study.探讨 ChatGPT 版本 3.5、4 和 4 与 Vision 在智利医师执照考试中的表现:观察性研究。
JMIR Med Educ. 2024 Apr 29;10:e55048. doi: 10.2196/55048.
8
Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.利用生成式人工智能辅助学习罕见且复杂的诊断:对流行的大型语言模型的定性研究。
JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391.
9
An Evaluation of the Performance of OpenAI-o1 and GPT-4o in the Japanese National Examination for Physical Therapists.OpenAI-o1和GPT-4o在日本物理治疗师国家考试中的表现评估
Cureus. 2025 Jan 6;17(1):e76989. doi: 10.7759/cureus.76989. eCollection 2025 Jan.
10
Evaluating AI proficiency in nuclear cardiology: Large language models take on the board preparation exam.评估人工智能在核心脏病学方面的熟练程度:大型语言模型参加资格考试。
J Nucl Cardiol. 2025 Mar;45:102089. doi: 10.1016/j.nuclcard.2024.102089. Epub 2024 Nov 29.

引用本文的文献

1
AI Agents in Clinical Medicine: A Systematic Review.临床医学中的人工智能代理:一项系统综述。
medRxiv. 2025 Aug 26:2025.08.22.25334232. doi: 10.1101/2025.08.22.25334232.

本文引用的文献

1
EHRAgent: Code Empowers Large Language Models for Few-shot Complex Tabular Reasoning on Electronic Health Records.EHRAgent:代码助力大语言模型在电子健康记录上进行少样本复杂表格推理。
Proc Conf Empir Methods Nat Lang Process. 2024 Nov;2024:22315-22339. doi: 10.18653/v1/2024.emnlp-main.1245.
2
Artificial Intelligence for Language Translation: The Equity Is in the Details.用于语言翻译的人工智能:公平性体现在细节之中。
JAMA. 2024 Nov 5;332(17):1427-1428. doi: 10.1001/jama.2024.15296.
3
Exploring the Performance of ChatGPT Versions 3.5, 4, and 4 With Vision in the Chilean Medical Licensing Examination: Observational Study.
探讨 ChatGPT 版本 3.5、4 和 4 与 Vision 在智利医师执照考试中的表现:观察性研究。
JMIR Med Educ. 2024 Apr 29;10:e55048. doi: 10.2196/55048.
4
Embracing ChatGPT for Medical Education: Exploring Its Impact on Doctors and Medical Students.拥抱 ChatGPT 助力医学教育:探索其对医生和医学生的影响。
JMIR Med Educ. 2024 Apr 10;10:e52483. doi: 10.2196/52483.
5
Large language models for generating medical examinations: systematic review.生成医学检查的大型语言模型:系统评价。
BMC Med Educ. 2024 Mar 29;24(1):354. doi: 10.1186/s12909-024-05239-y.
6
Can large language models reason about medical questions?大型语言模型能对医学问题进行推理吗?
Patterns (N Y). 2024 Mar 1;5(3):100943. doi: 10.1016/j.patter.2024.100943. eCollection 2024 Mar 8.
7
Evaluation of GPT-4 for 10-year cardiovascular risk prediction: Insights from the UK Biobank and KoGES data.利用英国生物银行和韩国基因组与流行病学研究数据评估GPT-4对10年心血管疾病风险的预测能力
iScience. 2024 Jan 24;27(2):109022. doi: 10.1016/j.isci.2024.109022. eCollection 2024 Feb 16.
8
Comparison of large language models in management advice for melanoma: Google's AI BARD, BingAI and ChatGPT.大语言模型在黑色素瘤管理建议方面的比较:谷歌的人工智能BARD、必应人工智能和ChatGPT。
Skin Health Dis. 2023 Nov 28;4(1):e313. doi: 10.1002/ski2.313. eCollection 2024 Feb.
9
Evaluating the Efficacy of ChatGPT in Navigating the Spanish Medical Residency Entrance Examination (MIR): Promising Horizons for AI in Clinical Medicine.评估ChatGPT在应对西班牙医学住院医师入学考试(MIR)中的效果:人工智能在临床医学中的广阔前景。
Clin Pract. 2023 Nov 20;13(6):1460-1487. doi: 10.3390/clinpract13060130.
10
Enhancing Medical Spanish Education and Proficiency to Bridge Healthcare Disparities: A Comprehensive Assessment and Call to Action.加强医学西班牙语教育与能力以弥合医疗保健差距:全面评估与行动呼吁。
Cureus. 2023 Nov 8;15(11):e48512. doi: 10.7759/cureus.48512. eCollection 2023 Nov.