Wu Da, Wang Zhanliang, Nguyen Quan, Xu Zhuoran, Wang Kai
Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA.
Applied Mathematics and Computational Science Graduate Program, University of Pennsylvania, Philadelphia, PA, 19104, USA.
ArXiv. 2025 May 9:arXiv:2505.05736v1.
The scarcity of high-quality multimodal biomedical data limits the ability to effectively fine-tune pretrained Large Language Models (LLMs) for specialized biomedical tasks. To address this challenge, we introduce MINT (Multimodal Integrated kNowledge Transfer), a framework that aligns unimodal large decoder models with domain-specific decision patterns from high-quality multimodal biomedical data through preference optimization. While MINT supports different optimization techniques, we primarily implement it with the Odds Ratio Preference Optimization (ORPO) framework as its backbone. This strategy enables the aligned LLMs to perform predictive tasks using text-only or image-only inputs while retaining knowledge learnt from multimodal data. MINT leverages an upstream multimodal machine learning (MML) model trained on high-quality multimodal data to transfer domain-specific insights to downstream text-only or image-only LLMs. We demonstrate MINT's effectiveness through two key applications: (1) Rare genetic disease prediction from texts, where MINT uses a multimodal encoder model, trained on facial photos and clinical notes, to generate a preference dataset for aligning a lightweight decoder-based text-only LLM (Llama 3.2-3B-Instruct). Despite relying on text input only, the MINT-derived model outperforms models trained with Supervised Fine-Tuning (SFT), Retrieval-Augmented Generation (RAG), or direct preference optimization (DPO), and even outperforms much larger foundation model (Llama 3.1-405B-Instruct). (2) Tissue type classification using cell nucleus images, where MINT uses a vision-language foundation model as the preference generator, containing knowledge learnt from both text and histopathological images to align downstream image-only models. The resulting MINT-derived model significantly improves the performance of Llama 3.2-Vision-11B-Instruct on tissue type classification. In summary, MINT provides an effective strategy to align unimodal LLMs with high-quality multimodal expertise through preference optimization. Our study also highlights a hybrid strategy that grafts the strength of encoder models in classification tasks into large decoder models to enhance reasoning, improve predictive tasks and reduce hallucination in biomedical applications.
高质量多模态生物医学数据的稀缺限制了针对专门生物医学任务有效微调预训练大语言模型(LLM)的能力。为应对这一挑战,我们引入了MINT(多模态集成知识转移),这是一个通过偏好优化将单模态大型解码器模型与来自高质量多模态生物医学数据的特定领域决策模式对齐的框架。虽然MINT支持不同的优化技术,但我们主要以优势比偏好优化(ORPO)框架为骨干来实现它。这种策略使对齐后的LLM能够使用仅文本或仅图像输入执行预测任务,同时保留从多模态数据中学到的知识。MINT利用在高质量多模态数据上训练的上游多模态机器学习(MML)模型,将特定领域的见解转移到下游仅文本或仅图像的LLM。我们通过两个关键应用展示了MINT的有效性:(1)从文本中预测罕见遗传疾病,其中MINT使用在面部照片和临床记录上训练的多模态编码器模型,生成一个偏好数据集,用于对齐基于轻量级解码器的仅文本LLM(Llama 3.2 - 3B - Instruct)。尽管仅依赖文本输入,但MINT衍生的模型优于使用监督微调(SFT)、检索增强生成(RAG)或直接偏好优化(DPO)训练的模型,甚至优于大得多的基础模型(Llama 3.1 - 405B - Instruct)。(2)使用细胞核图像进行组织类型分类,其中MINT使用视觉语言基础模型作为偏好生成器,该模型包含从文本和组织病理学图像中学到的知识,以对齐下游仅图像模型。由此产生的MINT衍生模型显著提高了Llama 3.2 - Vision - 11B - Instruct在组织类型分类上的性能。总之,MINT提供了一种通过偏好优化将单模态LLM与高质量多模态专业知识对齐的有效策略。我们的研究还突出了一种混合策略,即将分类任务中编码器模型的优势嫁接到大型解码器模型中,以增强生物医学应用中的推理、改善预测任务并减少幻觉。