临床医学中的人工智能代理:一项系统综述。
AI Agents in Clinical Medicine: A Systematic Review.
作者信息
Gorenshtein Alon, Omar Mahmud, Glicksberg Benjamin S, Nadkarni Girish N, Klang Eyal
机构信息
The Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Medical Center, NY, USA.
The Hasso Plattner Institute for Digital Health at Mount Sinai, Mount Sinai Health System, NY, USA.
出版信息
medRxiv. 2025 Aug 26:2025.08.22.25334232. doi: 10.1101/2025.08.22.25334232.
BACKGROUND
AI agents built on large language models (LLMs) can plan tasks, use external tools, and coordinate with other agents. Unlike standard LLMs, agents can execute multi-step processes, access real-time clinical information, and integrate multiple data sources. There has been interest in using such agents for clinical and administrative tasks, however, there is limited knowledge on their performance and whether multi-agent systems function better than a single agent for healthcare tasks.
PURPOSE
To evaluate the performance of AI agents in healthcare, compare AI agent systems vs. standard LLMs and catalog the tools used for task completion.
DATA SOURCES
PubMed, Web of Science, and Scopus from October 1, 2022, through August 5, 2025.
STUDY SELECTION
Peer-reviewed studies implementing AI agents for clinical tasks with quantitative performance comparisons.
DATA EXTRACTION
Two reviewers (A.G., M.O.) independently extracted data on architectures, performance metrics, and clinical applications. Discrepancies were resolved by discussion, with a third reviewer (E.K.) consulted when consensus could not be reached.
DATA SYNTHESIS
Twenty studies met inclusion criteria. Across studies, all agent systems outperformed their baseline LLMs in accuracy performance. Improvements ranged from small gains to increases of over 60 percentage points, with a median improvement of 53 percentage points in single-agent tool-calling studies. These systems were particularly effective for discrete tasks such as medication dosing and evidence retrieval. Multi-agent systems showed optimal performance with up to 5 agents, and their effectiveness was particularly pronounced when dealing with highly complex tasks. The highest performance boost occurred when the complexity of the AI agent framework aligned with that of the task.
LIMITATIONS
Heterogeneous outcomes precluded quantitative meta-analysis. Several studies relied on synthetic data, limiting generalizability.
CONCLUSIONS
AI agents consistently improve clinical task performance of Base-LLMs when architecture matches task complexity. Our analysis indicates a step-change over base-LLMs, with AI agents opening previously inaccessible domains. Future efforts should be based on prospective, multi-center trials using real-world data to determine safety, task matched and cost-effectiveness.
PRIMARY FUNDING SOURCE
This work was supported in part through the computational and data resources and staff expertise provided by Scientific Computing and Data at the Icahn School of Medicine at Mount Sinai and supported by the Clinical and Translational Science Awards (CTSA) grant UL1TR004419 from the National Center for Advancing Translational Sciences. Research reported in this publication was also supported by the Office of Research Infrastructure of the National Institutes of Health under award number S10OD026880 and S10OD030463. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
REGISTRATION
PROSPERO CRD420251120318.
背景
基于大语言模型(LLMs)构建的人工智能代理可以规划任务、使用外部工具并与其他代理进行协作。与标准的大语言模型不同,代理可以执行多步骤流程、访问实时临床信息并整合多个数据源。人们对将此类代理用于临床和管理任务很感兴趣,然而,关于它们的性能以及多代理系统在医疗任务中是否比单代理系统表现更好的了解有限。
目的
评估人工智能代理在医疗保健中的性能,比较人工智能代理系统与标准大语言模型,并编目用于任务完成的工具。
数据来源
2022年10月1日至2025年8月5日期间的PubMed、科学网和Scopus。
研究选择
进行了定量性能比较的、实施人工智能代理用于临床任务的同行评审研究。
数据提取
两名评审员(A.G.,M.O.)独立提取有关架构、性能指标和临床应用的数据。分歧通过讨论解决,在无法达成共识时会咨询第三位评审员(E.K.)。
数据综合
20项研究符合纳入标准。在各项研究中,所有代理系统在准确性性能方面均优于其基线大语言模型。改进幅度从小幅提升到超过60个百分点不等,在单代理工具调用研究中,改进中位数为53个百分点。这些系统对于诸如药物剂量计算和证据检索等离散任务特别有效。多代理系统在多达5个代理时表现出最佳性能,并且在处理高度复杂任务时其有效性尤为明显。当人工智能代理框架的复杂性与任务的复杂性相匹配时,性能提升最为显著。
局限性
结果的异质性妨碍了定量荟萃分析。几项研究依赖合成数据,限制了普遍性。
结论
当架构与任务复杂性相匹配时,人工智能代理持续提高基础大语言模型的临床任务性能。我们的分析表明相对于基础大语言模型有了显著进步,人工智能代理开启了以前无法触及的领域。未来的工作应基于使用真实世界数据的前瞻性多中心试验,以确定安全性、任务匹配度和成本效益。
主要资金来源
这项工作部分得到了西奈山伊坎医学院科学计算与数据部门提供的计算和数据资源以及工作人员专业知识的支持,并得到了国家推进转化科学中心授予的临床和转化科学奖(CTSA)资助UL1TR004419。本出版物中报告的研究还得到了美国国立卫生研究院研究基础设施办公室授予的编号为S10OD026880和S10OD030463的资助。内容完全由作者负责,不一定代表美国国立卫生研究院的官方观点。
注册
PROSPERO CRD420251120318
相似文献
medRxiv. 2025-8-26
Cochrane Database Syst Rev. 2021-4-19
Health Technol Assess. 2024-10
Cochrane Database Syst Rev. 2020-1-9
本文引用的文献
J Neurol. 2025-8-22
Nature. 2025-8
Nat Biomed Eng. 2025-7-30
Korean J Radiol. 2025-8
Nature. 2025-7-29
Comput Biol Med. 2025-9
NPJ Digit Med. 2025-7-12