Guo Zihan, Shen Xiang, Chen Chongqing
Department of Computer Science, Changzhi University, Changzhi, Shanxi, China.
College of Information Engineering, Shanghai Maritime University, Pudong, Shanghai, China.
PLoS One. 2025 Jun 10;20(6):e0325543. doi: 10.1371/journal.pone.0325543. eCollection 2025.
Vision-language models aim to seamlessly integrate visual and linguistic information for multi-modal tasks, demanding precise semantic alignments between image-text pairs while minimizing the influence of irrelevant data. While existing methods leverage intra-modal and cross-modal knowledge to enhance alignments, they often fall short in sufficiently reducing interference, which ultimately constrains model performance. To address this gap, we propose a novel vision-language model, the threshold-based knowledge integration network (TBKIN), designed to effectively capture intra-modal and cross-modal knowledge while systematically mitigating the impact of extraneous information. TBKIN employs unified scene graph structures and advanced masking strategies to strengthen semantic alignments and introduces a fine-tuning strategy based on threshold selection to eliminate noise. Comprehensive experimental evaluations demonstrate the efficacy of TBKIN, with our best model achieving state-of-the-art accuracy of 73.90% on the VQA 2.0 dataset and 84.60% on the RefCOCO dataset. Attention visualization and detailed result analysis further validate the robustness of TBKIN in tackling vision-language tasks. The model's ability to reduce interference while enhancing semantic alignments underscores its potential for advancing multi-modal learning. Extensive experiments across four widely-used benchmark datasets confirm its superior performance on two typical vision-language tasks, offering a practical and effective solution for real-world applications.
视觉语言模型旨在无缝集成视觉和语言信息以完成多模态任务,这要求图像-文本对之间有精确的语义对齐,同时尽量减少无关数据的影响。虽然现有方法利用模态内和跨模态知识来增强对齐,但它们在充分减少干扰方面往往存在不足,这最终限制了模型性能。为了弥补这一差距,我们提出了一种新颖的视觉语言模型,即基于阈值的知识集成网络(TBKIN),旨在有效捕捉模态内和跨模态知识,同时系统地减轻无关信息的影响。TBKIN采用统一的场景图结构和先进的掩码策略来加强语义对齐,并引入基于阈值选择的微调策略来消除噪声。全面的实验评估证明了TBKIN的有效性,我们最好的模型在VQA 2.0数据集上达到了73.90%的最新准确率,在RefCOCO数据集上达到了84.60%。注意力可视化和详细的结果分析进一步验证了TBKIN在处理视觉语言任务方面的鲁棒性。该模型在减少干扰的同时增强语义对齐的能力突出了其在推进多模态学习方面的潜力。在四个广泛使用的基准数据集上进行的大量实验证实了它在两个典型视觉语言任务上的卓越性能,为实际应用提供了一个实用且有效的解决方案。