用于零样本多模态医学推理的主动智能体协作框架

A Proactive Agent Collaborative Framework for Zero-Shot Multimodal Medical Reasoning.

作者信息

Gu Zishan, Liu Fenglin, Chen Jiayuan, Yin Changchang, Zhang Ping

机构信息

Department of Computer Science and Engineering The Ohio State University Columbus OH USA.

Department of Biomedical Informatics The Ohio State University Columbus OH USA.

出版信息

Adv Intell Syst. 2025 Aug;7(8):2400840. doi: 10.1002/aisy.202400840. Epub 2025 Feb 5.

DOI:10.1002/aisy.202400840

PMID:40852092

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12370165/

Abstract

The adoption of large language models (LLMs) in healthcare has garnered significant research interest, yet their performance remains limited due to a lack of domain-specific knowledge, medical reasoning skills, and their unimodal nature, which restricts them to text-only inputs. To address these limitations, we propose MultiMedRes, a multimodal medical collaborative reasoning framework that simulates human physicians' communication by incorporating a learner agent to proactively acquire information from domain-specific expert models. MultiMedRes addresses medical multimodal reasoning problems through three steps i) Inquire: The learner agent decomposes complex medical reasoning problems into multiple domain-specific sub-problems; ii) Interact: The agent engages in iterative "ask-answer" interactions with expert models to obtain domain-specific knowledge; and iii) Integrate: The agent integrates all the acquired domain-specific knowledge to address the medical reasoning problems (e.g., identifying the difference of disease levels and abnormality sizes between medical images). We validate the effectiveness of our method on the task of difference visual question answering for X-ray images. The experiments show that our zero-shot prediction achieves state-of-the-art performance, surpassing fully supervised methods, which demonstrates that MultiMedRes could offer trustworthy and interpretable assistance to physicians in monitoring the treatment progression of patients, paving the way for effective human-AI interaction and collaboration.

摘要

大语言模型（LLMs）在医疗保健领域的应用引发了大量研究兴趣，然而，由于缺乏特定领域知识、医学推理技能以及其单模态性质（仅支持文本输入），它们的性能仍然有限。为了解决这些局限性，我们提出了MultiMedRes，这是一个多模态医学协作推理框架，通过纳入一个学习代理来主动从特定领域专家模型获取信息，从而模拟人类医生的交流。MultiMedRes通过三个步骤解决医学多模态推理问题：i）询问：学习代理将复杂的医学推理问题分解为多个特定领域的子问题；ii）交互：代理与专家模型进行迭代的“问答”交互以获取特定领域知识；iii）整合：代理整合所有获取的特定领域知识以解决医学推理问题（例如，识别医学图像之间疾病水平和异常大小的差异）。我们在X射线图像的差异视觉问答任务上验证了我们方法的有效性。实验表明，我们的零样本预测达到了当前最优性能，超过了全监督方法，这表明MultiMedRes可以为医生监测患者治疗进展提供可靠且可解释的帮助，为有效的人机交互与协作铺平道路。