• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

临床医学中的人工智能代理:一项系统综述。

AI Agents in Clinical Medicine: A Systematic Review.

作者信息

Gorenshtein Alon, Omar Mahmud, Glicksberg Benjamin S, Nadkarni Girish N, Klang Eyal

机构信息

The Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Medical Center, NY, USA.

The Hasso Plattner Institute for Digital Health at Mount Sinai, Mount Sinai Health System, NY, USA.

出版信息

medRxiv. 2025 Aug 26:2025.08.22.25334232. doi: 10.1101/2025.08.22.25334232.

DOI:10.1101/2025.08.22.25334232
PMID:40909853
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12407621/
Abstract

BACKGROUND

AI agents built on large language models (LLMs) can plan tasks, use external tools, and coordinate with other agents. Unlike standard LLMs, agents can execute multi-step processes, access real-time clinical information, and integrate multiple data sources. There has been interest in using such agents for clinical and administrative tasks, however, there is limited knowledge on their performance and whether multi-agent systems function better than a single agent for healthcare tasks.

PURPOSE

To evaluate the performance of AI agents in healthcare, compare AI agent systems vs. standard LLMs and catalog the tools used for task completion.

DATA SOURCES

PubMed, Web of Science, and Scopus from October 1, 2022, through August 5, 2025.

STUDY SELECTION

Peer-reviewed studies implementing AI agents for clinical tasks with quantitative performance comparisons.

DATA EXTRACTION

Two reviewers (A.G., M.O.) independently extracted data on architectures, performance metrics, and clinical applications. Discrepancies were resolved by discussion, with a third reviewer (E.K.) consulted when consensus could not be reached.

DATA SYNTHESIS

Twenty studies met inclusion criteria. Across studies, all agent systems outperformed their baseline LLMs in accuracy performance. Improvements ranged from small gains to increases of over 60 percentage points, with a median improvement of 53 percentage points in single-agent tool-calling studies. These systems were particularly effective for discrete tasks such as medication dosing and evidence retrieval. Multi-agent systems showed optimal performance with up to 5 agents, and their effectiveness was particularly pronounced when dealing with highly complex tasks. The highest performance boost occurred when the complexity of the AI agent framework aligned with that of the task.

LIMITATIONS

Heterogeneous outcomes precluded quantitative meta-analysis. Several studies relied on synthetic data, limiting generalizability.

CONCLUSIONS

AI agents consistently improve clinical task performance of Base-LLMs when architecture matches task complexity. Our analysis indicates a step-change over base-LLMs, with AI agents opening previously inaccessible domains. Future efforts should be based on prospective, multi-center trials using real-world data to determine safety, task matched and cost-effectiveness.

PRIMARY FUNDING SOURCE

This work was supported in part through the computational and data resources and staff expertise provided by Scientific Computing and Data at the Icahn School of Medicine at Mount Sinai and supported by the Clinical and Translational Science Awards (CTSA) grant UL1TR004419 from the National Center for Advancing Translational Sciences. Research reported in this publication was also supported by the Office of Research Infrastructure of the National Institutes of Health under award number S10OD026880 and S10OD030463. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

REGISTRATION

PROSPERO CRD420251120318.

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ba75/12407621/8a6912127c3b/nihpp-2025.08.22.25334232v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ba75/12407621/98145371aa14/nihpp-2025.08.22.25334232v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ba75/12407621/bfacd25ee8e7/nihpp-2025.08.22.25334232v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ba75/12407621/8a6912127c3b/nihpp-2025.08.22.25334232v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ba75/12407621/98145371aa14/nihpp-2025.08.22.25334232v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ba75/12407621/bfacd25ee8e7/nihpp-2025.08.22.25334232v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ba75/12407621/8a6912127c3b/nihpp-2025.08.22.25334232v1-f0003.jpg
摘要

背景

基于大语言模型(LLMs)构建的人工智能代理可以规划任务、使用外部工具并与其他代理进行协作。与标准的大语言模型不同,代理可以执行多步骤流程、访问实时临床信息并整合多个数据源。人们对将此类代理用于临床和管理任务很感兴趣,然而,关于它们的性能以及多代理系统在医疗任务中是否比单代理系统表现更好的了解有限。

目的

评估人工智能代理在医疗保健中的性能,比较人工智能代理系统与标准大语言模型,并编目用于任务完成的工具。

数据来源

2022年10月1日至2025年8月5日期间的PubMed、科学网和Scopus。

研究选择

进行了定量性能比较的、实施人工智能代理用于临床任务的同行评审研究。

数据提取

两名评审员(A.G.,M.O.)独立提取有关架构、性能指标和临床应用的数据。分歧通过讨论解决,在无法达成共识时会咨询第三位评审员(E.K.)。

数据综合

20项研究符合纳入标准。在各项研究中,所有代理系统在准确性性能方面均优于其基线大语言模型。改进幅度从小幅提升到超过60个百分点不等,在单代理工具调用研究中,改进中位数为53个百分点。这些系统对于诸如药物剂量计算和证据检索等离散任务特别有效。多代理系统在多达5个代理时表现出最佳性能,并且在处理高度复杂任务时其有效性尤为明显。当人工智能代理框架的复杂性与任务的复杂性相匹配时,性能提升最为显著。

局限性

结果的异质性妨碍了定量荟萃分析。几项研究依赖合成数据,限制了普遍性。

结论

当架构与任务复杂性相匹配时,人工智能代理持续提高基础大语言模型的临床任务性能。我们的分析表明相对于基础大语言模型有了显著进步,人工智能代理开启了以前无法触及的领域。未来的工作应基于使用真实世界数据的前瞻性多中心试验,以确定安全性、任务匹配度和成本效益。

主要资金来源

这项工作部分得到了西奈山伊坎医学院科学计算与数据部门提供的计算和数据资源以及工作人员专业知识的支持,并得到了国家推进转化科学中心授予的临床和转化科学奖(CTSA)资助UL1TR004419。本出版物中报告的研究还得到了美国国立卫生研究院研究基础设施办公室授予的编号为S10OD026880和S10OD030463的资助。内容完全由作者负责,不一定代表美国国立卫生研究院的官方观点。

注册

PROSPERO CRD420251120318

相似文献

1
AI Agents in Clinical Medicine: A Systematic Review.临床医学中的人工智能代理:一项系统综述。
medRxiv. 2025 Aug 26:2025.08.22.25334232. doi: 10.1101/2025.08.22.25334232.
2
Assessing Retrieval-Augmented Large Language Model Performance in Emergency Department ICD-10-CM Coding Compared to Human Coders.与人工编码员相比,评估检索增强型大语言模型在急诊科ICD-10-CM编码中的性能。
medRxiv. 2024 Oct 17:2024.10.15.24315526. doi: 10.1101/2024.10.15.24315526.
3
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
4
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.系统性药理学治疗慢性斑块状银屑病:网络荟萃分析。
Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.
5
Sexual Harassment and Prevention Training性骚扰与预防培训
6
[Volume and health outcomes: evidence from systematic reviews and from evaluation of Italian hospital data].[容量与健康结果:来自系统评价和意大利医院数据评估的证据]
Epidemiol Prev. 2013 Mar-Jun;37(2-3 Suppl 2):1-100.
7
Falls prevention interventions for community-dwelling older adults: systematic review and meta-analysis of benefits, harms, and patient values and preferences.社区居住的老年人跌倒预防干预措施:系统评价和荟萃分析的益处、危害以及患者的价值观和偏好。
Syst Rev. 2024 Nov 26;13(1):289. doi: 10.1186/s13643-024-02681-3.
8
A rapid and systematic review of the clinical effectiveness and cost-effectiveness of paclitaxel, docetaxel, gemcitabine and vinorelbine in non-small-cell lung cancer.对紫杉醇、多西他赛、吉西他滨和长春瑞滨在非小细胞肺癌中的临床疗效和成本效益进行的快速系统评价。
Health Technol Assess. 2001;5(32):1-195. doi: 10.3310/hta5320.
9
Antidepressants for pain management in adults with chronic pain: a network meta-analysis.抗抑郁药治疗成人慢性疼痛的疼痛管理:一项网络荟萃分析。
Health Technol Assess. 2024 Oct;28(62):1-155. doi: 10.3310/MKRT2948.
10
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.慢性斑块状银屑病的全身药理学治疗:一项网状Meta分析。
Cochrane Database Syst Rev. 2020 Jan 9;1(1):CD011535. doi: 10.1002/14651858.CD011535.pub3.

本文引用的文献

1
AI-Based EMG Reporting: A Randomized Controlled Trial.基于人工智能的肌电图报告:一项随机对照试验。
J Neurol. 2025 Aug 22;272(9):586. doi: 10.1007/s00415-025-13261-3.
2
We need a new ethics for a world of AI agents.我们需要为人工智能主体的世界制定一种新的伦理准则。
Nature. 2025 Aug;644(8075):38-40. doi: 10.1038/d41586-025-02454-5.
3
Multi-model assurance analysis showing large language models are highly vulnerable to adversarial hallucination attacks during clinical decision support.多模型保证分析表明,在临床决策支持过程中,大语言模型极易受到对抗性幻觉攻击。
Commun Med (Lond). 2025 Aug 2;5(1):330. doi: 10.1038/s43856-025-01021-3.
4
CRISPR-GPT for agentic automation of gene-editing experiments.用于基因编辑实验自主自动化的CRISPR-GPT
Nat Biomed Eng. 2025 Jul 30. doi: 10.1038/s41551-025-01463-z.
5
Uncover This Tech Term: Application Programming Interface for Large Language Models.揭开这个科技术语:大语言模型的应用程序编程接口。
Korean J Radiol. 2025 Aug;26(8):793-796. doi: 10.3348/kjr.2025.0360.
6
The Virtual Lab of AI agents designs new SARS-CoV-2 nanobodies.人工智能代理虚拟实验室设计新型新冠病毒纳米抗体。
Nature. 2025 Jul 29. doi: 10.1038/s41586-025-09442-9.
7
: A Medical Multi-Agent Framework for Automating Appointment Scheduling Based on Patient-Provider Profile Resource Matching.一种基于患者-提供者档案资源匹配的用于自动预约安排的医学多智能体框架。
Healthcare (Basel). 2025 Jul 8;13(14):1649. doi: 10.3390/healthcare13141649.
8
GeneAgent: self-verification language agent for gene-set analysis using domain databases.基因智能体:使用领域数据库进行基因集分析的自我验证语言智能体。
Nat Methods. 2025 Jul 28. doi: 10.1038/s41592-025-02748-6.
9
Refining LLMs outputs with iterative consensus ensemble (ICE).使用迭代共识集成(ICE)优化大语言模型输出。
Comput Biol Med. 2025 Sep;196(Pt B):110731. doi: 10.1016/j.compbiomed.2025.110731. Epub 2025 Jul 16.
10
Vision-language model for report generation and outcome prediction in CT pulmonary angiogram.用于CT肺血管造影报告生成和结果预测的视觉语言模型。
NPJ Digit Med. 2025 Jul 12;8(1):432. doi: 10.1038/s41746-025-01807-8.