Suppr超能文献

从多任务视角看自然保护图像数据中的细粒度跨模态语义一致性

Fine-Grained Cross-Modal Semantic Consistency in Natural Conservation Image Data from a Multi-Task Perspective.

作者信息

Tao Rui, Zhu Meng, Cao Haiyan, Ren Honge

机构信息

College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China.

College of Artificial Intelligence and Big Data, Hulunbuir University, Hulunbuir 021008, China.

出版信息

Sensors (Basel). 2024 May 14;24(10):3130. doi: 10.3390/s24103130.

Abstract

Fine-grained representation is fundamental to species classification based on deep learning, and in this context, cross-modal contrastive learning is an effective method. The diversity of species coupled with the inherent contextual ambiguity of natural language poses a primary challenge in the cross-modal representation alignment of conservation area image data. Integrating cross-modal retrieval tasks with generation tasks contributes to cross-modal representation alignment based on contextual understanding. However, during the contrastive learning process, apart from learning the differences in the data itself, a pair of encoders inevitably learns the differences caused by encoder fluctuations. The latter leads to convergence shortcuts, resulting in poor representation quality and an inaccurate reflection of the similarity relationships between samples in the original dataset within the shared space of features. To achieve fine-grained cross-modal representation alignment, we first propose a residual attention network to enhance consistency during momentum updates in cross-modal encoders. Building upon this, we propose momentum encoding from a multi-task perspective as a bridge for cross-modal information, effectively improving cross-modal mutual information, representation quality, and optimizing the distribution of feature points within the cross-modal shared semantic space. By acquiring momentum encoding queues for cross-modal semantic understanding through multi-tasking, we align ambiguous natural language representations around the invariant image features of factual information, alleviating contextual ambiguity and enhancing model robustness. Experimental validation shows that our proposed multi-task perspective of cross-modal momentum encoders outperforms similar models on standardized image classification tasks and image-text cross-modal retrieval tasks on public datasets by up to 8% on the leaderboard, demonstrating the effectiveness of the proposed method. Qualitative experiments on our self-built conservation area image-text paired dataset show that our proposed method accurately performs cross-modal retrieval and generation tasks among 8142 species, proving its effectiveness on fine-grained cross-modal image-text conservation area image datasets.

摘要

细粒度表示是基于深度学习的物种分类的基础,在这种情况下,跨模态对比学习是一种有效的方法。物种的多样性加上自然语言固有的上下文模糊性,在保护区图像数据的跨模态表示对齐中构成了主要挑战。将跨模态检索任务与生成任务相结合,有助于基于上下文理解的跨模态表示对齐。然而,在对比学习过程中,除了学习数据本身的差异外,一对编码器不可避免地会学习由编码器波动引起的差异。后者导致收敛捷径,导致表示质量较差,并且不能准确反映原始数据集中样本在特征共享空间内的相似关系。为了实现细粒度的跨模态表示对齐,我们首先提出了一种残差注意力网络,以增强跨模态编码器在动量更新期间的一致性。在此基础上,我们从多任务角度提出动量编码作为跨模态信息的桥梁,有效提高跨模态互信息、表示质量,并优化跨模态共享语义空间内的特征点分布。通过多任务获取用于跨模态语义理解的动量编码队列,我们围绕事实信息的不变图像特征对齐模糊的自然语言表示,减轻上下文模糊性并增强模型鲁棒性。实验验证表明,我们提出的跨模态动量编码器的多任务视角在标准化图像分类任务和公共数据集上的图像 - 文本跨模态检索任务中,在排行榜上比类似模型高出8%,证明了所提方法的有效性。在我们自建的保护区图像 - 文本配对数据集上进行的定性实验表明,我们提出的方法能够在8142个物种中准确执行跨模态检索和生成任务,证明了其在细粒度跨模态图像 - 文本保护区图像数据集上的有效性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/87a9/11125332/9143a108fa1c/sensors-24-03130-g0A1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验