单智能体和多智能体语言模型在西班牙语医学能力考试中的表现。

Performance of single-agent and multi-agent language models in Spanish language medical competency exams.

作者信息

Altermatt Fernando R, Neyem Andres, Sumonte Nicolas, Mendoza Marcelo, Villagran Ignacio, Lacassie Hector J

机构信息

Division of Anesthesiology, School of Medicine, Pontificia Universidad Católica de Chile, Marcoleta 377, 8320000, Santiago, RM, Chile.

Department of Computer Science, School of Engineering, Pontificia Universidad Católica de Chile, Vicuña Mackenna 6840, 7820436, Santiago, RM, Chile.

出版信息

BMC Med Educ. 2025 May 7;25(1):666. doi: 10.1186/s12909-025-07250-3.

DOI:10.1186/s12909-025-07250-3

PMID:40336004

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12057199/

Abstract

BACKGROUND

Large language models (LLMs) like GPT-4o have shown promise in advancing medical decision-making and education. However, their performance in Spanish-language medical contexts remains underexplored. This study evaluates the effectiveness of single-agent and multi-agent strategies in answering questions from the EUNACOM, a standardized medical licensure exam in Chile, across 21 medical specialties.

METHODS

GPT-4o was tested on 1,062 multiple-choice questions from publicly available EUNACOM preparation materials. Single-agent strategies included Zero-Shot, Few-Shot, Chain-of-Thought (CoT), Self-Reflection, and MED-PROMPT, while multi-agent strategies involved Voting, Weighted Voting, Borda Count, MEDAGENTS, and MDAGENTS. Each strategy was tested under three temperature settings (0.3, 0.6, 1.2). Performance was assessed by accuracy, and statistical analyses, including Kruskal-Wallis and Mann-Whitney U tests, were performed. Computational resource utilization, such as API calls and execution time, was also analyzed.

RESULTS

MDAGENTS achieved the highest accuracy with a mean score of 89.97% (SD = 0.56%), outperforming all other strategies (p < 0.001). MEDAGENTS followed with a mean score of 87.99% (SD = 0.49%), and the CoT with Few-Shot strategy scored 87.67% (SD = 0.12%). Temperature settings did not significantly affect performance (F2,54 = 1.45, p = 0.24). Specialty-level analysis showed the highest accuracies in Psychiatry (95.51%), Neurology (95.49%), and Surgery (95.38%), while lower accuracies were observed in Neonatology (77.54%), Otolaryngology (76.64%), and Urology/Nephrology (76.59%). Notably, several exam questions were correctly answered using simpler single-agent strategies without employing complex reasoning or collaboration frameworks.

CONCLUSIONS AND RELEVANCE

Multi-agent strategies, particularly MDAGENTS, significantly enhance GPT-4o's performance on Spanish-language medical exams, leveraging collaboration to improve diagnostic accuracy. However, simpler single-agent strategies are sufficient to address many questions, high-lighting that only a fraction of standardized medical exams require sophisticated reasoning or multi-agent interaction. These findings suggest potential for LLMs as efficient and scalable tools in Spanish-speaking healthcare, though computational optimization remains a key area for future research.

摘要

背景

像GPT-4o这样的大语言模型在推进医学决策和教育方面已展现出潜力。然而，它们在西班牙语医学背景下的表现仍未得到充分探索。本研究评估了单智能体和多智能体策略在回答来自智利标准化医学执照考试EUNACOM的问题时的有效性，涵盖21个医学专业。

方法

在公开可用的EUNACOM备考材料中的1062道多项选择题上对GPT-4o进行测试。单智能体策略包括零样本、少样本、思维链（CoT）、自我反思和MED-PROMPT，而多智能体策略涉及投票、加权投票、博尔达计数、MEDAGENTS和MDAGENTS。每种策略在三种温度设置（0.3、0.6、1.2）下进行测试。通过准确率评估性能，并进行统计分析，包括克鲁斯卡尔 - 沃利斯检验和曼 - 惠特尼U检验。还分析了计算资源利用率，如API调用和执行时间。

结果

MDAGENTS的准确率最高，平均得分89.97%（标准差 = 0.56%），优于所有其他策略（p < 0.001）。MEDAGENTS其次，平均得分87.99%（标准差 = 0.49%），少样本思维链策略得分为87.67%（标准差 = 0.12%）。温度设置对性能没有显著影响（F2,54 = 1.45，p = 0.24）。专业层面分析显示，精神病学（95.51%）、神经病学（95.49%）和外科（95.38%）的准确率最高，而新生儿学（77.54%）、耳鼻喉科（76.64%）以及泌尿外科/肾脏病学（76.59%）的准确率较低。值得注意的是，一些考试问题使用更简单的单智能体策略就能正确回答，无需采用复杂的推理或协作框架。

结论与意义

多智能体策略，特别是MDAGENTS，通过利用协作提高诊断准确性，显著提升了GPT-4o在西班牙语医学考试中的表现。然而一些更简单的单智能体策略就足以回答许多问题，这凸显出只有一小部分标准化医学考试需要复杂推理或多智能体交互。这些发现表明大语言模型在西班牙语医疗保健领域作为高效且可扩展工具具有潜力，尽管计算优化仍是未来研究的关键领域。