• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过可视化参考指令微调推进图表问答中的多模态大语言模型

Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning.

作者信息

Zeng Xingchen, Lin Haichuan, Ye Yilin, Zeng Wei

出版信息

IEEE Trans Vis Comput Graph. 2025 Jan;31(1):525-535. doi: 10.1109/TVCG.2024.3456159. Epub 2024 Nov 25.

DOI:10.1109/TVCG.2024.3456159
PMID:39255172
Abstract

Emerging multimodal large language models (MLLMs) exhibit great potential for chart question answering (CQA). Recent efforts primarily focus on scaling up training datasets (i.e., charts, data tables, and question-answer (QA) pairs) through data collection and synthesis. However, our empirical study on existing MLLMs and CQA datasets reveals notable gaps. First, current data collection and synthesis focus on data volume and lack consideration of fine-grained visual encodings and QA tasks, resulting in unbalanced data distribution divergent from practical CQA scenarios. Second, existing work follows the training recipe of the base MLLMs initially designed for natural images, under-exploring the adaptation to unique chart characteristics, such as rich text elements. To fill the gap, we propose a visualization-referenced instruction tuning approach to guide the training dataset enhancement and model development. Specifically, we propose a novel data engine to effectively filter diverse and high-quality data from existing datasets and subsequently refine and augment the data using LLM-based generation techniques to better align with practical QA tasks and visual encodings. Then, to facilitate the adaptation to chart characteristics, we utilize the enriched data to train an MLLM by unfreezing the vision encoder and incorporating a mixture-of-resolution adaptation strategy for enhanced fine-grained recognition. Experimental results validate the effectiveness of our approach. Even with fewer training examples, our model consistently outperforms state-of-the-art CQA models on established benchmarks. We also contribute a dataset split as a benchmark for future research. Source codes and datasets of this paper are available at https://github.com/zengxingchen/ChartQA-MLLM.

摘要

新兴的多模态大语言模型(MLLMs)在图表问答(CQA)方面展现出巨大潜力。近期的努力主要集中在通过数据收集和合成来扩大训练数据集(即图表、数据表和问答(QA)对)。然而,我们对现有MLLMs和CQA数据集的实证研究揭示了显著差距。首先,当前的数据收集和合成侧重于数据量,而未考虑细粒度的视觉编码和QA任务,导致数据分布不均衡,与实际CQA场景不同。其次,现有工作遵循最初为自然图像设计的基础MLLMs的训练方法,对适应独特的图表特征(如丰富的文本元素)探索不足。为了填补这一差距,我们提出了一种可视化参考指令微调方法,以指导训练数据集的增强和模型开发。具体而言,我们提出了一种新颖的数据引擎,以有效地从现有数据集中筛选出多样且高质量的数据,随后使用基于大语言模型的生成技术对数据进行细化和扩充,以更好地与实际QA任务和视觉编码对齐。然后,为了便于适应图表特征,我们利用丰富的数据通过解冻视觉编码器并采用分辨率混合适应策略来训练MLLM,以增强细粒度识别。实验结果验证了我们方法的有效性。即使训练示例较少,我们的模型在既定基准上始终优于当前最先进的CQA模型。我们还贡献了一个数据集划分作为未来研究的基准。本文的源代码和数据集可在https://github.com/zengxingchen/ChartQA-MLLM获取。

相似文献

1
Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning.通过可视化参考指令微调推进图表问答中的多模态大语言模型
IEEE Trans Vis Comput Graph. 2025 Jan;31(1):525-535. doi: 10.1109/TVCG.2024.3456159. Epub 2024 Nov 25.
2
Q-BENCH: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs.Q-BENCH:从单图像到图像对的低级视觉多模态基础模型基准测试
IEEE Trans Pattern Anal Mach Intell. 2024 Aug 21;PP. doi: 10.1109/TPAMI.2024.3445770.
3
MedChatZH: A tuning LLM for traditional Chinese medicine consultations.医聊 ChatZH:一个用于中医咨询的调优大语言模型。
Comput Biol Med. 2024 Apr;172:108290. doi: 10.1016/j.compbiomed.2024.108290. Epub 2024 Mar 13.
4
Evaluating and Enhancing Japanese Large Language Models for Genetic Counseling Support: Comparative Study of Domain Adaptation and the Development of an Expert-Evaluated Dataset.评估和增强用于遗传咨询支持的日本大语言模型:领域适应的比较研究与专家评估数据集的开发
JMIR Med Inform. 2025 Jan 16;13:e65047. doi: 10.2196/65047.
5
MMAgentRec, a personalized multi-modal recommendation agent with large language model.MMAgentRec,一个带有大语言模型的个性化多模态推荐代理。
Sci Rep. 2025 Apr 8;15(1):12062. doi: 10.1038/s41598-025-96458-w.
6
Advancing surgical VQA with scene graph knowledge.利用场景图知识推进外科视觉问答。
Int J Comput Assist Radiol Surg. 2024 Jul;19(7):1409-1417. doi: 10.1007/s11548-024-03141-y. Epub 2024 May 23.
7
An empirical study of LLaMA3 quantization: from LLMs to MLLMs.LLaMA3量化的实证研究:从大语言模型到多模态大语言模型
Vis Intell. 2024;2(1):36. doi: 10.1007/s44267-024-00070-x. Epub 2024 Dec 30.
8
Taiyi: a bilingual fine-tuned large language model for diverse biomedical tasks.太乙:一个用于多种生物医学任务的双语精调大型语言模型。
J Am Med Inform Assoc. 2024 Sep 1;31(9):1865-1874. doi: 10.1093/jamia/ocae037.
9
Cross-Modal self-supervised vision language pre-training with multiple objectives for medical visual question answering.用于医学视觉问答的多目标跨模态自监督视觉语言预训练
J Biomed Inform. 2024 Dec;160:104748. doi: 10.1016/j.jbi.2024.104748. Epub 2024 Nov 12.
10
Prophet: Prompting Large Language Models With Complementary Answer Heuristics for Knowledge-Based Visual Question Answering.Prophet:通过互补答案启发式方法提示大语言模型以进行基于知识的视觉问答
IEEE Trans Pattern Anal Mach Intell. 2025 Aug;47(8):6797-6808. doi: 10.1109/TPAMI.2025.3562422.