Suppr超能文献

关于多模态大语言模型的一项调查。

A survey on multimodal large language models.

作者信息

Yin Shukang, Fu Chaoyou, Zhao Sirui, Li Ke, Sun Xing, Xu Tong, Chen Enhong

机构信息

School of Artificial Intelligence and Data Science, University of Science and Technology of China, Hefei 230026, China.

State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China.

出版信息

Natl Sci Rev. 2024 Nov 12;11(12):nwae403. doi: 10.1093/nsr/nwae403. eCollection 2024 Dec.

Abstract

Recently, the multimodal large language model (MLLM) represented by GPT-4V has been a new rising research hotspot, which uses powerful large language models (LLMs) as a brain to perform multimodal tasks. The surprising emergent capabilities of the MLLM, such as writing stories based on images and optical character recognition-free math reasoning, are rare in traditional multimodal methods, suggesting a potential path to artificial general intelligence. To this end, both academia and industry have endeavored to develop MLLMs that can compete with or even outperform GPT-4V, pushing the limit of research at a surprising speed. In this paper, we aim to trace and summarize the recent progress of MLLMs. First, we present the basic formulation of the MLLM and delineate its related concepts, including architecture, training strategy and data, as well as evaluation. Then, we introduce research topics about how MLLMs can be extended to support more granularity, modalities, languages and scenarios. We continue with multimodal hallucination and extended techniques, including multimodal in-context learning, multimodal chain of thought and LLM-aided visual reasoning. To conclude the paper, we discuss existing challenges and point out promising research directions.

摘要

最近,以GPT-4V为代表的多模态大语言模型(MLLM)成为了一个新的研究热点,它使用强大的大语言模型(LLM)作为“大脑”来执行多模态任务。MLLM令人惊讶的涌现能力,如基于图像编写故事和无需光学字符识别的数学推理,在传统多模态方法中很少见,这为通用人工智能指明了一条潜在路径。为此,学术界和工业界都致力于开发能够与GPT-4V竞争甚至超越它的MLLM,以惊人的速度推动研究边界。在本文中,我们旨在追溯和总结MLLM的最新进展。首先,我们介绍MLLM的基本形式,并阐述其相关概念,包括架构、训练策略和数据以及评估。然后,我们介绍关于如何扩展MLLM以支持更多粒度、模态、语言和场景的研究主题。我们接着介绍多模态幻觉和扩展技术,包括多模态上下文学习、多模态思维链和LLM辅助视觉推理。在本文结尾,我们讨论现有挑战并指出有前景的研究方向。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7d65/11645129/958316e59c0d/nwae403fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验