Suppr超能文献

评估大型语言模型作为临床中的智能体。

Evaluating large language models as agents in the clinic.

作者信息

Mehandru Nikita, Miao Brenda Y, Almaraz Eduardo Rodriguez, Sushil Madhumita, Butte Atul J, Alaa Ahmed

机构信息

University of California, Berkeley, 2195 Hearst Ave, Warren Hall Suite, 120C, Berkeley, CA, USA.

Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA, USA.

出版信息

NPJ Digit Med. 2024 Apr 3;7(1):84. doi: 10.1038/s41746-024-01083-y.

Abstract

Recent developments in large language models (LLMs) have unlocked opportunities for healthcare, from information synthesis to clinical decision support. These LLMs are not just capable of modeling language, but can also act as intelligent “agents” that interact with stakeholders in open-ended conversations and even influence clinical decision-making. Rather than relying on benchmarks that measure a model’s ability to process clinical data or answer standardized test questions, LLM agents can be modeled in high-fidelity simulations of clinical settings and should be assessed for their impact on clinical workflows. These evaluation frameworks, which we refer to as “Artificial Intelligence Structured Clinical Examinations” (“AI-SCE”), can draw from comparable technologies where machines operate with varying degrees of self-governance, such as self-driving cars, in dynamic environments with multiple stakeholders. Developing these robust, real-world clinical evaluations will be crucial towards deploying LLM agents in medical settings.

摘要

大语言模型(LLMs)的最新发展为医疗保健领域带来了诸多机遇,从信息综合到临床决策支持。这些大语言模型不仅能够对语言进行建模,还能充当智能“代理”,在开放式对话中与利益相关者互动,甚至影响临床决策。与依赖衡量模型处理临床数据能力或回答标准化测试问题的基准不同,大语言模型代理可以在临床环境的高保真模拟中进行建模,并应评估其对临床工作流程的影响。我们将这些评估框架称为“人工智能结构化临床考试”(“AI-SCE”),它可以借鉴类似技术,即机器在具有多个利益相关者的动态环境中以不同程度的自主方式运行,如自动驾驶汽车。开发这些强大的、真实世界的临床评估对于在医疗环境中部署大语言模型代理至关重要。

相似文献

1
Evaluating large language models as agents in the clinic.
NPJ Digit Med. 2024 Apr 3;7(1):84. doi: 10.1038/s41746-024-01083-y.
7
MedConceptsQA: Open source medical concepts QA benchmark.
Comput Biol Med. 2024 Nov;182:109089. doi: 10.1016/j.compbiomed.2024.109089. Epub 2024 Sep 13.
8
9
Evaluating large language models in theory of mind tasks.
Proc Natl Acad Sci U S A. 2024 Nov 5;121(45):e2405460121. doi: 10.1073/pnas.2405460121. Epub 2024 Oct 29.
10
Evaluating large language models on medical, lay-language, and self-reported descriptions of genetic conditions.
Am J Hum Genet. 2024 Sep 5;111(9):1819-1833. doi: 10.1016/j.ajhg.2024.07.011. Epub 2024 Aug 14.

引用本文的文献

1
Orchestrated multi agents sustain accuracy under clinical-scale workloads compared to a single agent.
medRxiv. 2025 Aug 24:2025.08.22.25334049. doi: 10.1101/2025.08.22.25334049.
2
AI-Based EMG Reporting: A Randomized Controlled Trial.
J Neurol. 2025 Aug 22;272(9):586. doi: 10.1007/s00415-025-13261-3.
5
Survey and improvement strategies for gene prioritization with large language models.
Bioinform Adv. 2025 Jun 24;5(1):vbaf148. doi: 10.1093/bioadv/vbaf148. eCollection 2025.
7
Generative AI in hepatology: Transforming multimodal patient-generated data into actionable insights.
Hepatol Commun. 2025 Jul 14;9(8). doi: 10.1097/HC9.0000000000000683. eCollection 2025 Aug 1.
8
Rethinking artificial intelligence in medicine: from tools to agents.
Clin Exp Emerg Med. 2025 Jun;12(2):101-103. doi: 10.15441/ceem.25.125. Epub 2025 Jun 30.
9
Surge in large language models exacerbates global regional healthcare inequalities.
J Transl Med. 2025 Jul 1;23(1):706. doi: 10.1186/s12967-025-06751-5.

本文引用的文献

1
The diagnostic and triage accuracy of the GPT-3 artificial intelligence model: an observational study.
Lancet Digit Health. 2024 Aug;6(8):e555-e561. doi: 10.1016/S2589-7500(24)00097-9.
2
Six ways large language models are changing healthcare.
Nat Med. 2023 Dec;29(12):2969-2971. doi: 10.1038/s41591-023-02700-1.
3
The shaky foundations of large language models and foundation models for electronic health records.
NPJ Digit Med. 2023 Jul 29;6(1):135. doi: 10.1038/s41746-023-00879-8.
4
Large language models encode clinical knowledge.
Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.
5
A certified de-identification system for all clinical text documents for information extraction at scale.
JAMIA Open. 2023 Jul 4;6(3):ooad045. doi: 10.1093/jamiaopen/ooad045. eCollection 2023 Oct.
6
Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine.
N Engl J Med. 2023 Mar 30;388(13):1233-1239. doi: 10.1056/NEJMsr2214184.
7
MIMIC-IV, a freely accessible electronic health record dataset.
Sci Data. 2023 Jan 3;10(1):1. doi: 10.1038/s41597-022-01899-x.
8
Ethical Machine Learning in Healthcare.
Annu Rev Biomed Data Sci. 2021 Jul;4:123-144. doi: 10.1146/annurev-biodatasci-092820-114757. Epub 2021 May 6.
10
Agent-Based Modeling in Public Health: Current Applications and Future Directions.
Annu Rev Public Health. 2018 Apr 1;39:77-94. doi: 10.1146/annurev-publhealth-040617-014317. Epub 2018 Jan 12.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验