通过可视化参考指令微调推进图表问答中的多模态大语言模型

Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning.

作者信息

Zeng Xingchen, Lin Haichuan, Ye Yilin, Zeng Wei

出版信息

IEEE Trans Vis Comput Graph. 2025 Jan;31(1):525-535. doi: 10.1109/TVCG.2024.3456159. Epub 2024 Nov 25.

DOI:10.1109/TVCG.2024.3456159

Abstract

Emerging multimodal large language models (MLLMs) exhibit great potential for chart question answering (CQA). Recent efforts primarily focus on scaling up training datasets (i.e., charts, data tables, and question-answer (QA) pairs) through data collection and synthesis. However, our empirical study on existing MLLMs and CQA datasets reveals notable gaps. First, current data collection and synthesis focus on data volume and lack consideration of fine-grained visual encodings and QA tasks, resulting in unbalanced data distribution divergent from practical CQA scenarios. Second, existing work follows the training recipe of the base MLLMs initially designed for natural images, under-exploring the adaptation to unique chart characteristics, such as rich text elements. To fill the gap, we propose a visualization-referenced instruction tuning approach to guide the training dataset enhancement and model development. Specifically, we propose a novel data engine to effectively filter diverse and high-quality data from existing datasets and subsequently refine and augment the data using LLM-based generation techniques to better align with practical QA tasks and visual encodings. Then, to facilitate the adaptation to chart characteristics, we utilize the enriched data to train an MLLM by unfreezing the vision encoder and incorporating a mixture-of-resolution adaptation strategy for enhanced fine-grained recognition. Experimental results validate the effectiveness of our approach. Even with fewer training examples, our model consistently outperforms state-of-the-art CQA models on established benchmarks. We also contribute a dataset split as a benchmark for future research. Source codes and datasets of this paper are available at https://github.com/zengxingchen/ChartQA-MLLM.

摘要

新兴的多模态大语言模型（MLLMs）在图表问答（CQA）方面展现出巨大潜力。近期的努力主要集中在通过数据收集和合成来扩大训练数据集（即图表、数据表和问答（QA）对）。然而，我们对现有MLLMs和CQA数据集的实证研究揭示了显著差距。首先，当前的数据收集和合成侧重于数据量，而未考虑细粒度的视觉编码和QA任务，导致数据分布不均衡，与实际CQA场景不同。其次，现有工作遵循最初为自然图像设计的基础MLLMs的训练方法，对适应独特的图表特征（如丰富的文本元素）探索不足。为了填补这一差距，我们提出了一种可视化参考指令微调方法，以指导训练数据集的增强和模型开发。具体而言，我们提出了一种新颖的数据引擎，以有效地从现有数据集中筛选出多样且高质量的数据，随后使用基于大语言模型的生成技术对数据进行细化和扩充，以更好地与实际QA任务和视觉编码对齐。然后，为了便于适应图表特征，我们利用丰富的数据通过解冻视觉编码器并采用分辨率混合适应策略来训练MLLM，以增强细粒度识别。实验结果验证了我们方法的有效性。即使训练示例较少，我们的模型在既定基准上始终优于当前最先进的CQA模型。我们还贡献了一个数据集划分作为未来研究的基准。本文的源代码和数据集可在https://github.com/zengxingchen/ChartQA-MLLM获取。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

通过可视化参考指令微调推进图表问答中的多模态大语言模型

Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning.

作者信息

出版信息

相似文献

通过可视化参考指令微调推进图表问答中的多模态大语言模型

Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning.

作者信息

出版信息

相似文献