• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

关于多模态大语言模型的一项调查。

A survey on multimodal large language models.

作者信息

Yin Shukang, Fu Chaoyou, Zhao Sirui, Li Ke, Sun Xing, Xu Tong, Chen Enhong

机构信息

School of Artificial Intelligence and Data Science, University of Science and Technology of China, Hefei 230026, China.

State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China.

出版信息

Natl Sci Rev. 2024 Nov 12;11(12):nwae403. doi: 10.1093/nsr/nwae403. eCollection 2024 Dec.

DOI:10.1093/nsr/nwae403
PMID:39679213
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11645129/
Abstract

Recently, the multimodal large language model (MLLM) represented by GPT-4V has been a new rising research hotspot, which uses powerful large language models (LLMs) as a brain to perform multimodal tasks. The surprising emergent capabilities of the MLLM, such as writing stories based on images and optical character recognition-free math reasoning, are rare in traditional multimodal methods, suggesting a potential path to artificial general intelligence. To this end, both academia and industry have endeavored to develop MLLMs that can compete with or even outperform GPT-4V, pushing the limit of research at a surprising speed. In this paper, we aim to trace and summarize the recent progress of MLLMs. First, we present the basic formulation of the MLLM and delineate its related concepts, including architecture, training strategy and data, as well as evaluation. Then, we introduce research topics about how MLLMs can be extended to support more granularity, modalities, languages and scenarios. We continue with multimodal hallucination and extended techniques, including multimodal in-context learning, multimodal chain of thought and LLM-aided visual reasoning. To conclude the paper, we discuss existing challenges and point out promising research directions.

摘要

最近,以GPT-4V为代表的多模态大语言模型(MLLM)成为了一个新的研究热点,它使用强大的大语言模型(LLM)作为“大脑”来执行多模态任务。MLLM令人惊讶的涌现能力,如基于图像编写故事和无需光学字符识别的数学推理,在传统多模态方法中很少见,这为通用人工智能指明了一条潜在路径。为此,学术界和工业界都致力于开发能够与GPT-4V竞争甚至超越它的MLLM,以惊人的速度推动研究边界。在本文中,我们旨在追溯和总结MLLM的最新进展。首先,我们介绍MLLM的基本形式,并阐述其相关概念,包括架构、训练策略和数据以及评估。然后,我们介绍关于如何扩展MLLM以支持更多粒度、模态、语言和场景的研究主题。我们接着介绍多模态幻觉和扩展技术,包括多模态上下文学习、多模态思维链和LLM辅助视觉推理。在本文结尾,我们讨论现有挑战并指出有前景的研究方向。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7d65/11645129/cc9babf1288c/nwae403sch4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7d65/11645129/958316e59c0d/nwae403fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7d65/11645129/fc46ece7db8c/nwae403fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7d65/11645129/5fa9fd463f24/nwae403sch1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7d65/11645129/b94a7a4aadf5/nwae403fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7d65/11645129/68734c4ceb3b/nwae403sch2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7d65/11645129/4fb75cfac5ef/nwae403sch3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7d65/11645129/cc9babf1288c/nwae403sch4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7d65/11645129/958316e59c0d/nwae403fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7d65/11645129/fc46ece7db8c/nwae403fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7d65/11645129/5fa9fd463f24/nwae403sch1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7d65/11645129/b94a7a4aadf5/nwae403fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7d65/11645129/68734c4ceb3b/nwae403sch2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7d65/11645129/4fb75cfac5ef/nwae403sch3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7d65/11645129/cc9babf1288c/nwae403sch4.jpg

相似文献

1
A survey on multimodal large language models.关于多模态大语言模型的一项调查。
Natl Sci Rev. 2024 Nov 12;11(12):nwae403. doi: 10.1093/nsr/nwae403. eCollection 2024 Dec.
2
Large language model to multimodal large language model: A journey to shape the biological macromolecules to biological sciences and medicine.从大语言模型到多模态大语言模型:塑造生物大分子以服务生物科学与医学的征程。
Mol Ther Nucleic Acids. 2024 Jun 15;35(3):102255. doi: 10.1016/j.omtn.2024.102255. eCollection 2024 Sep 10.
3
Simignore: Exploring and enhancing multimodal large model complex reasoning via similarity computation.Simignore:通过相似度计算探索和增强多模态大模型复杂推理
Neural Netw. 2025 Apr;184:107059. doi: 10.1016/j.neunet.2024.107059. Epub 2024 Dec 31.
4
An empirical study of LLaMA3 quantization: from LLMs to MLLMs.LLaMA3量化的实证研究:从大语言模型到多模态大语言模型
Vis Intell. 2024;2(1):36. doi: 10.1007/s44267-024-00070-x. Epub 2024 Dec 30.
5
Automated electrosynthesis reaction mining with multimodal large language models (MLLMs).使用多模态大语言模型(MLLMs)进行自动化电合成反应挖掘。
Chem Sci. 2024 Oct 9;15(43):17881-91. doi: 10.1039/d4sc04630g.
6
Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning.通过可视化参考指令微调推进图表问答中的多模态大语言模型
IEEE Trans Vis Comput Graph. 2025 Jan;31(1):525-535. doi: 10.1109/TVCG.2024.3456159. Epub 2024 Nov 25.
7
Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study.GPT-4V(视觉)在日本国家医师资格考试中的能力:评估研究。
JMIR Med Educ. 2024 Mar 12;10:e54393. doi: 10.2196/54393.
8
GPT-4 Vision: Multi-Modal Evolution of ChatGPT and Potential Role in Radiology.GPT-4视觉:ChatGPT的多模态演进及其在放射学中的潜在作用。
Cureus. 2024 Aug 31;16(8):e68298. doi: 10.7759/cureus.68298. eCollection 2024 Aug.
9
Assessing the Utility of Multimodal Large Language Models (GPT-4 Vision and Large Language and Vision Assistant) in Identifying Melanoma Across Different Skin Tones.评估多模态大语言模型(GPT-4视觉、大语言与视觉助手)在识别不同肤色黑色素瘤方面的效用。
JMIR Dermatol. 2024 Mar 13;7:e55508. doi: 10.2196/55508.
10
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research.MicroVQA:基于显微镜的科学研究的多模态推理基准
ArXiv. 2025 Mar 17:arXiv:2503.13399v1.

引用本文的文献

1
From large language models to multimodal AI: a scoping review on the potential of generative AI in medicine.从大语言模型到多模态人工智能:关于生成式人工智能在医学领域潜力的范围综述
Biomed Eng Lett. 2025 Aug 22;15(5):845-863. doi: 10.1007/s13534-025-00497-1. eCollection 2025 Sep.
2
[Proposal for Responsible Use of Generative Artificial Intelligence in Medical Practice].[关于在医疗实践中负责任使用生成式人工智能的提案]
Rev Neurol. 2025 Aug 27;80(7):37503. doi: 10.31083/RN37503.
3
An intelligent agent for sentence completion test: creation and application in depression assessment.

本文引用的文献

1
A panel discussion on AI for science: the opportunities, challenges and reflections.一场关于人工智能在科学领域应用的小组讨论:机遇、挑战与思考。
Natl Sci Rev. 2024 Mar 26;11(8):nwae119. doi: 10.1093/nsr/nwae119. eCollection 2024 Aug.
2
Harnessing generative AI to decode enzyme catalysis and evolution for enhanced engineering.利用生成式人工智能解码酶催化与进化以加强工程设计。
Natl Sci Rev. 2023 Dec 28;10(12):nwad331. doi: 10.1093/nsr/nwad331. eCollection 2023 Dec.
3
Large language models and brain-inspired general intelligence.大语言模型与受大脑启发的通用智能。
用于句子完成测试的智能代理:在抑郁症评估中的创建与应用。
Front Psychol. 2025 Aug 12;16:1649905. doi: 10.3389/fpsyg.2025.1649905. eCollection 2025.
4
Assessing Large Multimodal Models for One-Shot Learning and Interpretability in Biomedical Image Classification.评估大型多模态模型在生物医学图像分类中的一次性学习和可解释性
Adv Intell Syst. 2025 Apr 6. doi: 10.1002/aisy.202400947.
5
Accurate Prediction of Protein Tertiary and Quaternary Stability Using Fine-Tuned Protein Language Models and Free Energy Perturbation.使用微调蛋白质语言模型和自由能微扰准确预测蛋白质三级和四级结构稳定性
Int J Mol Sci. 2025 Jul 24;26(15):7125. doi: 10.3390/ijms26157125.
6
EIM: An effective solution for improving multi-modal large language models.EIM:一种改进多模态大语言模型的有效解决方案。
PLoS One. 2025 Aug 11;20(8):e0329590. doi: 10.1371/journal.pone.0329590. eCollection 2025.
7
Multimodal Alzheimer's disease recognition from image, text and audio.基于图像、文本和音频的多模态阿尔茨海默病识别
Sci Rep. 2025 Aug 8;15(1):29038. doi: 10.1038/s41598-025-14998-7.
8
Leveraging multimodal large language model for multimodal sequential recommendation.利用多模态大语言模型进行多模态序列推荐。
Sci Rep. 2025 Aug 7;15(1):28960. doi: 10.1038/s41598-025-14251-1.
9
Reporting guideline for chatbot health advice studies: the Chatbot Assessment Reporting Tool (CHART) statement.聊天机器人健康建议研究报告指南:聊天机器人评估报告工具(CHART)声明。
BMJ Med. 2025 Aug 1;4(1):e001632. doi: 10.1136/bmjmed-2025-001632. eCollection 2025.
10
Reporting guideline for chatbot health advice studies: the Chatbot Assessment Reporting Tool (CHART) statement.聊天机器人健康建议研究报告指南:聊天机器人评估报告工具(CHART)声明
Br J Surg. 2025 Aug 1;112(8). doi: 10.1093/bjs/znaf142.
Natl Sci Rev. 2023 Nov 3;10(10):nwad267. doi: 10.1093/nsr/nwad267. eCollection 2023 Oct.
4
iEarth: an interdisciplinary framework in the era of big data and AI for sustainable development.iEarth:大数据与人工智能时代促进可持续发展的跨学科框架。
Natl Sci Rev. 2023 Jun 24;10(8):nwad178. doi: 10.1093/nsr/nwad178. eCollection 2023 Aug.
5
Deep Visual-Semantic Alignments for Generating Image Descriptions.深度视觉-语义对齐生成图像描述。
IEEE Trans Pattern Anal Mach Intell. 2017 Apr;39(4):664-676. doi: 10.1109/TPAMI.2016.2598339. Epub 2016 Aug 5.