Mehandru Nikita, Miao Brenda Y, Almaraz Eduardo Rodriguez, Sushil Madhumita, Butte Atul J, Alaa Ahmed
University of California, Berkeley, 2195 Hearst Ave, Warren Hall Suite, 120C, Berkeley, CA, USA.
Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA, USA.
NPJ Digit Med. 2024 Apr 3;7(1):84. doi: 10.1038/s41746-024-01083-y.
Recent developments in large language models (LLMs) have unlocked opportunities for healthcare, from information synthesis to clinical decision support. These LLMs are not just capable of modeling language, but can also act as intelligent “agents” that interact with stakeholders in open-ended conversations and even influence clinical decision-making. Rather than relying on benchmarks that measure a model’s ability to process clinical data or answer standardized test questions, LLM agents can be modeled in high-fidelity simulations of clinical settings and should be assessed for their impact on clinical workflows. These evaluation frameworks, which we refer to as “Artificial Intelligence Structured Clinical Examinations” (“AI-SCE”), can draw from comparable technologies where machines operate with varying degrees of self-governance, such as self-driving cars, in dynamic environments with multiple stakeholders. Developing these robust, real-world clinical evaluations will be crucial towards deploying LLM agents in medical settings.
大语言模型(LLMs)的最新发展为医疗保健领域带来了诸多机遇,从信息综合到临床决策支持。这些大语言模型不仅能够对语言进行建模,还能充当智能“代理”,在开放式对话中与利益相关者互动,甚至影响临床决策。与依赖衡量模型处理临床数据能力或回答标准化测试问题的基准不同,大语言模型代理可以在临床环境的高保真模拟中进行建模,并应评估其对临床工作流程的影响。我们将这些评估框架称为“人工智能结构化临床考试”(“AI-SCE”),它可以借鉴类似技术,即机器在具有多个利益相关者的动态环境中以不同程度的自主方式运行,如自动驾驶汽车。开发这些强大的、真实世界的临床评估对于在医疗环境中部署大语言模型代理至关重要。