• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

TBKIN:基于阈值的显式选择以增强跨模态语义对齐

TBKIN: Threshold-based explicit selection for enhanced cross-modal semantic alignments.

作者信息

Guo Zihan, Shen Xiang, Chen Chongqing

机构信息

Department of Computer Science, Changzhi University, Changzhi, Shanxi, China.

College of Information Engineering, Shanghai Maritime University, Pudong, Shanghai, China.

出版信息

PLoS One. 2025 Jun 10;20(6):e0325543. doi: 10.1371/journal.pone.0325543. eCollection 2025.

DOI:10.1371/journal.pone.0325543
PMID:40493640
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12151420/
Abstract

Vision-language models aim to seamlessly integrate visual and linguistic information for multi-modal tasks, demanding precise semantic alignments between image-text pairs while minimizing the influence of irrelevant data. While existing methods leverage intra-modal and cross-modal knowledge to enhance alignments, they often fall short in sufficiently reducing interference, which ultimately constrains model performance. To address this gap, we propose a novel vision-language model, the threshold-based knowledge integration network (TBKIN), designed to effectively capture intra-modal and cross-modal knowledge while systematically mitigating the impact of extraneous information. TBKIN employs unified scene graph structures and advanced masking strategies to strengthen semantic alignments and introduces a fine-tuning strategy based on threshold selection to eliminate noise. Comprehensive experimental evaluations demonstrate the efficacy of TBKIN, with our best model achieving state-of-the-art accuracy of 73.90% on the VQA 2.0 dataset and 84.60% on the RefCOCO dataset. Attention visualization and detailed result analysis further validate the robustness of TBKIN in tackling vision-language tasks. The model's ability to reduce interference while enhancing semantic alignments underscores its potential for advancing multi-modal learning. Extensive experiments across four widely-used benchmark datasets confirm its superior performance on two typical vision-language tasks, offering a practical and effective solution for real-world applications.

摘要

视觉语言模型旨在无缝集成视觉和语言信息以完成多模态任务,这要求图像-文本对之间有精确的语义对齐,同时尽量减少无关数据的影响。虽然现有方法利用模态内和跨模态知识来增强对齐,但它们在充分减少干扰方面往往存在不足,这最终限制了模型性能。为了弥补这一差距,我们提出了一种新颖的视觉语言模型,即基于阈值的知识集成网络(TBKIN),旨在有效捕捉模态内和跨模态知识,同时系统地减轻无关信息的影响。TBKIN采用统一的场景图结构和先进的掩码策略来加强语义对齐,并引入基于阈值选择的微调策略来消除噪声。全面的实验评估证明了TBKIN的有效性,我们最好的模型在VQA 2.0数据集上达到了73.90%的最新准确率,在RefCOCO数据集上达到了84.60%。注意力可视化和详细的结果分析进一步验证了TBKIN在处理视觉语言任务方面的鲁棒性。该模型在减少干扰的同时增强语义对齐的能力突出了其在推进多模态学习方面的潜力。在四个广泛使用的基准数据集上进行的大量实验证实了它在两个典型视觉语言任务上的卓越性能,为实际应用提供了一个实用且有效的解决方案。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c651/12151420/67dfc5535504/pone.0325543.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c651/12151420/cb2705ff78d7/pone.0325543.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c651/12151420/330891623372/pone.0325543.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c651/12151420/f63326b351f5/pone.0325543.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c651/12151420/a6e527d1d77c/pone.0325543.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c651/12151420/91d251e6e9aa/pone.0325543.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c651/12151420/67dfc5535504/pone.0325543.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c651/12151420/cb2705ff78d7/pone.0325543.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c651/12151420/330891623372/pone.0325543.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c651/12151420/f63326b351f5/pone.0325543.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c651/12151420/a6e527d1d77c/pone.0325543.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c651/12151420/91d251e6e9aa/pone.0325543.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c651/12151420/67dfc5535504/pone.0325543.g006.jpg

相似文献

1
TBKIN: Threshold-based explicit selection for enhanced cross-modal semantic alignments.TBKIN:基于阈值的显式选择以增强跨模态语义对齐
PLoS One. 2025 Jun 10;20(6):e0325543. doi: 10.1371/journal.pone.0325543. eCollection 2025.
2
MMAgentRec, a personalized multi-modal recommendation agent with large language model.MMAgentRec,一个带有大语言模型的个性化多模态推荐代理。
Sci Rep. 2025 Apr 8;15(1):12062. doi: 10.1038/s41598-025-96458-w.
3
DICCR: Double-gated intervention and confounder causal reasoning for vision-language navigation.DICCR:用于视觉语言导航的双门干预与混杂因素因果推理
Neural Netw. 2025 Apr;184:107078. doi: 10.1016/j.neunet.2024.107078. Epub 2024 Dec 30.
4
Interpretable medical image Visual Question Answering via multi-modal relationship graph learning.基于多模态关系图学习的可解释医学图像视觉问答。
Med Image Anal. 2024 Oct;97:103279. doi: 10.1016/j.media.2024.103279. Epub 2024 Jul 20.
5
An effective spatial relational reasoning networks for visual question answering.用于视觉问答的有效的空间关系推理网络。
PLoS One. 2022 Nov 28;17(11):e0277693. doi: 10.1371/journal.pone.0277693. eCollection 2022.
6
Multi-grained visual pivot-guided multi-modal neural machine translation with text-aware cross-modal contrastive disentangling.基于文本感知跨模态对比解缠的多粒度视觉枢轴引导多模态神经机器翻译
Neural Netw. 2024 Oct;178:106403. doi: 10.1016/j.neunet.2024.106403. Epub 2024 May 23.
7
Multi-Modal Explicit Sparse Attention Networks for Visual Question Answering.多模态显式稀疏注意力网络的视觉问答。
Sensors (Basel). 2020 Nov 26;20(23):6758. doi: 10.3390/s20236758.
8
Hybrid Attention Network for Language-Based Person Search.基于语言的人物搜索的混合注意力网络。
Sensors (Basel). 2020 Sep 15;20(18):5279. doi: 10.3390/s20185279.
9
Cross-Modal self-supervised vision language pre-training with multiple objectives for medical visual question answering.用于医学视觉问答的多目标跨模态自监督视觉语言预训练
J Biomed Inform. 2024 Dec;160:104748. doi: 10.1016/j.jbi.2024.104748. Epub 2024 Nov 12.
10
AMVLM: Alignment-Multiplicity Aware Vision-Language Model for Semi-Supervised Medical Image Segmentation.AMVLM:用于半监督医学图像分割的对齐-多样性感知视觉语言模型
IEEE Trans Med Imaging. 2025 May 23;PP. doi: 10.1109/TMI.2025.3573018.

本文引用的文献

1
Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQA.视觉问答中的跨模态偏差:基于可能世界视觉问答的因果观点
IEEE Trans Multimedia. 2024;26:8609-8624. doi: 10.1109/tmm.2024.3380259. Epub 2024 Mar 21.
2
Re-Attention for Visual Question Answering.视觉问答的重新关注
IEEE Trans Image Process. 2021;30:6730-6743. doi: 10.1109/TIP.2021.3097180. Epub 2021 Jul 26.
3
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.更快的 R-CNN:基于区域建议网络的实时目标检测。
IEEE Trans Pattern Anal Mach Intell. 2017 Jun;39(6):1137-1149. doi: 10.1109/TPAMI.2016.2577031. Epub 2016 Jun 6.