• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

EIM:一种改进多模态大语言模型的有效解决方案。

EIM: An effective solution for improving multi-modal large language models.

作者信息

Bai Yuting, Su Tonghua, Bai Zixing

机构信息

School of Software, Harbin Institute of Technology, Harbin, China.

School of Computer Science and Technology, Fudan University, Shanghai, China.

出版信息

PLoS One. 2025 Aug 11;20(8):e0329590. doi: 10.1371/journal.pone.0329590. eCollection 2025.

DOI:10.1371/journal.pone.0329590
PMID:40788941
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12338808/
Abstract

Enabling large language models (LLMs) to have multi-modal capabilities, such as vision-language learning, has become a current research hotspot and the next milestone in LLM development with the advent of models like GPT4. The basic structure of current multi-modal LLMs usually includes three parts: the image encoder for extracting visual features, the semantic space transformation network ST for aligning the multi-modal semantic spaces, and LLM for generating text. Current works on multi-modal LLMs primarily focus on enhancing performance by utilizing larger image encoders and LLMs, and designing more complex fine-tuning methods and STs, which results in an escalation of model parameters. In this paper, we propose EIM, a novel effective solution for improving the performance of multi-modal large language models from the perspective of training process which reduces the need to introduce new parameters and modify the model structure, and is ignored and less explored in current research. EIM includes corresponding improvement measures in the image encoder, ST, and LLM. To validate EIM, we first apply it to ClipCap and conduct experiments on the COCO Caption dataset. Secondly, we extend EIM to the multi-modal LLMs, such as LLaMA-Adapter and LaVIN, and evaluate them on the ScienceQA dataset. Finally, we also conduct multi-modal chatbot experiments with the EIM enhanced LaVIN and evaluate it on the MME benchmark. The COCO Caption dataset experimental results of [Formula: see text], which is a model that applies EIM on the [Formula: see text], show the 1.75% performance improvement when compared to those of [Formula: see text], which has 3.13 times the number of parameters of [Formula: see text]. The experimental results on the ScienceQA dataset and MME benchmark show that EIM can achieve competitive performance with 7B model parameters when compared to the 13B multi-modal LLMs, which confirms the effective performance improvement of EIM for multi-modal LLMs.

摘要

随着GPT4等模型的出现,使大型语言模型(LLMs)具备多模态能力,如视觉语言学习,已成为当前的研究热点以及LLM发展的下一个里程碑。当前多模态LLMs的基本结构通常包括三个部分:用于提取视觉特征的图像编码器、用于对齐多模态语义空间的语义空间变换网络ST以及用于生成文本的LLM。当前关于多模态LLMs的工作主要集中在通过使用更大的图像编码器和LLMs以及设计更复杂的微调方法和ST来提高性能,这导致模型参数不断增加。在本文中,我们提出了EIM,这是一种从训练过程的角度提高多模态大型语言模型性能的新颖有效解决方案,它减少了引入新参数和修改模型结构的需求,而这在当前研究中被忽视且较少被探索。EIM在图像编码器、ST和LLM中都包含相应的改进措施。为了验证EIM,我们首先将其应用于ClipCap并在COCO Caption数据集上进行实验。其次,我们将EIM扩展到多模态LLMs,如LLaMA-Adapter和LaVIN,并在ScienceQA数据集上对它们进行评估。最后,我们还使用EIM增强的LaVIN进行多模态聊天机器人实验,并在MME基准上对其进行评估。在[公式:见文本]上应用EIM的模型[公式:见文本]在COCO Caption数据集上的实验结果表明,与[公式:见文本]相比,性能提高了1.75%,而[公式:见文本]的参数数量是[公式:见文本]的3.13倍。在ScienceQA数据集和MME基准上的实验结果表明,与具有13B参数的多模态LLMs相比,EIM在具有7B模型参数时可以实现有竞争力的性能,这证实了EIM对多模态LLMs的有效性能提升。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f61/12338808/bf978c6c2bc7/pone.0329590.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f61/12338808/5dc458204076/pone.0329590.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f61/12338808/94c8bf7146d1/pone.0329590.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f61/12338808/5d47c304ad36/pone.0329590.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f61/12338808/f50fdb585221/pone.0329590.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f61/12338808/7ecafbd36a86/pone.0329590.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f61/12338808/6fbe5e9f7f0a/pone.0329590.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f61/12338808/bf978c6c2bc7/pone.0329590.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f61/12338808/5dc458204076/pone.0329590.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f61/12338808/94c8bf7146d1/pone.0329590.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f61/12338808/5d47c304ad36/pone.0329590.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f61/12338808/f50fdb585221/pone.0329590.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f61/12338808/7ecafbd36a86/pone.0329590.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f61/12338808/6fbe5e9f7f0a/pone.0329590.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f61/12338808/bf978c6c2bc7/pone.0329590.g007.jpg

相似文献

1
EIM: An effective solution for improving multi-modal large language models.EIM:一种改进多模态大语言模型的有效解决方案。
PLoS One. 2025 Aug 11;20(8):e0329590. doi: 10.1371/journal.pone.0329590. eCollection 2025.
2
A dataset and benchmark for hospital course summarization with adapted large language models.一个用于医院病程总结的数据集和基准测试,采用了适配的大语言模型。
J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.
3
Implementing Large Language Models in Health Care: Clinician-Focused Review With Interactive Guideline.在医疗保健中应用大语言模型:以临床医生为重点的回顾与交互式指南
J Med Internet Res. 2025 Jul 11;27:e71916. doi: 10.2196/71916.
4
Short-Term Memory Impairment短期记忆障碍
5
Evaluating and Improving Syndrome Differentiation Thinking Ability in Large Language Models: Method Development Study.评估和提高大语言模型中的辨证思维能力:方法开发研究
JMIR Med Inform. 2025 Jun 20;13:e75103. doi: 10.2196/75103.
6
Can open source large language models be used for tumor documentation in Germany?-An evaluation on urological doctors' notes.在德国,开源大语言模型可用于肿瘤记录吗?——对泌尿科医生笔记的评估
BioData Min. 2025 Jul 24;18(1):48. doi: 10.1186/s13040-025-00463-8.
7
Aligning Large Language Models for Enhancing Psychiatric Interviews Through Symptom Delineation and Summarization: Pilot Study.通过症状描述和总结调整大型语言模型以增强精神病学访谈:初步研究。
JMIR Form Res. 2024 Oct 24;8:e58418. doi: 10.2196/58418.
8
Performance of Large Language Models in Numerical Versus Semantic Medical Knowledge: Cross-Sectional Benchmarking Study on Evidence-Based Questions and Answers.大型语言模型在数值与语义医学知识方面的表现:基于循证问答的横断面基准研究
J Med Internet Res. 2025 Jul 14;27:e64452. doi: 10.2196/64452.
9
Improving Large Language Models' Summarization Accuracy by Adding Highlights to Discharge Notes: Comparative Evaluation.通过在出院小结中添加重点内容提高大语言模型的总结准确性:比较评估
JMIR Med Inform. 2025 Jul 24;13:e66476. doi: 10.2196/66476.
10
Sexual Harassment and Prevention Training性骚扰与预防培训

本文引用的文献

1
Otter: A Multi-Modal Model With In-Context Instruction Tuning.水獭:一种具有上下文指令微调的多模态模型。
IEEE Trans Pattern Anal Mach Intell. 2025 Sep;47(9):7543-7557. doi: 10.1109/TPAMI.2025.3571946.
2
Biomedical Visual Instruction Tuning with Clinician Preference Alignment.基于临床医生偏好对齐的生物医学视觉指令微调
Adv Neural Inf Process Syst. 2024 Dec;37:96449-96467.
3
A survey on multimodal large language models.关于多模态大语言模型的一项调查。
Natl Sci Rev. 2024 Nov 12;11(12):nwae403. doi: 10.1093/nsr/nwae403. eCollection 2024 Dec.
4
Neural Prompt Search.神经提示搜索
IEEE Trans Pattern Anal Mach Intell. 2025 Jul;47(7):5268-5280. doi: 10.1109/TPAMI.2024.3435939.
5
Deep Visual-Semantic Alignments for Generating Image Descriptions.深度视觉-语义对齐生成图像描述。
IEEE Trans Pattern Anal Mach Intell. 2017 Apr;39(4):664-676. doi: 10.1109/TPAMI.2016.2598339. Epub 2016 Aug 5.