Suppr超能文献

EIM:一种改进多模态大语言模型的有效解决方案。

EIM: An effective solution for improving multi-modal large language models.

作者信息

Bai Yuting, Su Tonghua, Bai Zixing

机构信息

School of Software, Harbin Institute of Technology, Harbin, China.

School of Computer Science and Technology, Fudan University, Shanghai, China.

出版信息

PLoS One. 2025 Aug 11;20(8):e0329590. doi: 10.1371/journal.pone.0329590. eCollection 2025.

Abstract

Enabling large language models (LLMs) to have multi-modal capabilities, such as vision-language learning, has become a current research hotspot and the next milestone in LLM development with the advent of models like GPT4. The basic structure of current multi-modal LLMs usually includes three parts: the image encoder for extracting visual features, the semantic space transformation network ST for aligning the multi-modal semantic spaces, and LLM for generating text. Current works on multi-modal LLMs primarily focus on enhancing performance by utilizing larger image encoders and LLMs, and designing more complex fine-tuning methods and STs, which results in an escalation of model parameters. In this paper, we propose EIM, a novel effective solution for improving the performance of multi-modal large language models from the perspective of training process which reduces the need to introduce new parameters and modify the model structure, and is ignored and less explored in current research. EIM includes corresponding improvement measures in the image encoder, ST, and LLM. To validate EIM, we first apply it to ClipCap and conduct experiments on the COCO Caption dataset. Secondly, we extend EIM to the multi-modal LLMs, such as LLaMA-Adapter and LaVIN, and evaluate them on the ScienceQA dataset. Finally, we also conduct multi-modal chatbot experiments with the EIM enhanced LaVIN and evaluate it on the MME benchmark. The COCO Caption dataset experimental results of [Formula: see text], which is a model that applies EIM on the [Formula: see text], show the 1.75% performance improvement when compared to those of [Formula: see text], which has 3.13 times the number of parameters of [Formula: see text]. The experimental results on the ScienceQA dataset and MME benchmark show that EIM can achieve competitive performance with 7B model parameters when compared to the 13B multi-modal LLMs, which confirms the effective performance improvement of EIM for multi-modal LLMs.

摘要

随着GPT4等模型的出现,使大型语言模型(LLMs)具备多模态能力,如视觉语言学习,已成为当前的研究热点以及LLM发展的下一个里程碑。当前多模态LLMs的基本结构通常包括三个部分:用于提取视觉特征的图像编码器、用于对齐多模态语义空间的语义空间变换网络ST以及用于生成文本的LLM。当前关于多模态LLMs的工作主要集中在通过使用更大的图像编码器和LLMs以及设计更复杂的微调方法和ST来提高性能,这导致模型参数不断增加。在本文中,我们提出了EIM,这是一种从训练过程的角度提高多模态大型语言模型性能的新颖有效解决方案,它减少了引入新参数和修改模型结构的需求,而这在当前研究中被忽视且较少被探索。EIM在图像编码器、ST和LLM中都包含相应的改进措施。为了验证EIM,我们首先将其应用于ClipCap并在COCO Caption数据集上进行实验。其次,我们将EIM扩展到多模态LLMs,如LLaMA-Adapter和LaVIN,并在ScienceQA数据集上对它们进行评估。最后,我们还使用EIM增强的LaVIN进行多模态聊天机器人实验,并在MME基准上对其进行评估。在[公式:见文本]上应用EIM的模型[公式:见文本]在COCO Caption数据集上的实验结果表明,与[公式:见文本]相比,性能提高了1.75%,而[公式:见文本]的参数数量是[公式:见文本]的3.13倍。在ScienceQA数据集和MME基准上的实验结果表明,与具有13B参数的多模态LLMs相比,EIM在具有7B模型参数时可以实现有竞争力的性能,这证实了EIM对多模态LLMs的有效性能提升。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f61/12338808/5dc458204076/pone.0329590.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验