Klang Eyal, Omar Mahmud, Raut Ganesh, Agbareia Reem, Timsina Prem, Freeman Robert, Gavin Nicholas, Stump Lisa, Charney Alexander W, Glicksberg Benjamin S, Nadkarni Girish N
The Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Medical Center, NY, USA.
The Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA.
medRxiv. 2025 Aug 24:2025.08.22.25334049. doi: 10.1101/2025.08.22.25334049.
We tested state-of-the-art large language models (LLMs) in two configurations for clinical-scale workloads: a single agent handling heterogeneous tasks versus an orchestrated multi-agent system assigning each task to a dedicated worker. Across retrieval, extraction, and dosing calculations, we varied batch sizes from 5 to 80 to simulate clinical traffic. Multi-agent runs maintained high accuracy under load (pooled accuracy 90.6% at 5 tasks, 65.3% at 80) while single-agent accuracy fell sharply (73.1% to 16.6%), with significant differences beyond 10 tasks (FDR-adjusted p < 0.01). Multi-agent execution reduced token usage up to 65-fold and limited latency growth compared with single-agent runs. The design's isolation of tasks prevented context interference and preserved performance across four diverse LLM checkpoints. This is the first evaluation of LLM agent architectures under sustained, mixed-task clinical workloads, showing that lightweight orchestration can deliver accuracy, efficiency, and auditability at operational scale.
我们针对临床规模的工作负载,在两种配置下测试了最先进的大语言模型(LLMs):一种是单个智能体处理异构任务,另一种是精心编排的多智能体系统,将每个任务分配给一个专用工作器。在检索、提取和剂量计算过程中,我们将批量大小从5变化到80,以模拟临床流量。多智能体运行在负载下保持了较高的准确率(5个任务时的综合准确率为90.6%,80个任务时为65.3%),而单个智能体的准确率则大幅下降(从73.1%降至16.6%),在超过10个任务时存在显著差异(FDR校正p < 0.01)。与单个智能体运行相比,多智能体执行将令牌使用量减少了65倍,并限制了延迟增长。该设计对任务的隔离防止了上下文干扰,并在四个不同的大语言模型检查点上保持了性能。这是首次在持续的混合任务临床工作负载下对大语言模型智能体架构进行评估,表明轻量级编排能够在运营规模上实现准确性、效率和可审计性。