Suppr
超能文献

从多任务视角看自然保护图像数据中的细粒度跨模态语义一致性

Fine-Grained Cross-Modal Semantic Consistency in Natural Conservation Image Data from a Multi-Task Perspective.

作者信息

Tao Rui, Zhu Meng, Cao Haiyan, Ren Honge

机构信息

College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China.

College of Artificial Intelligence and Big Data, Hulunbuir University, Hulunbuir 021008, China.

出版信息

Sensors (Basel). 2024 May 14;24(10):3130. doi: 10.3390/s24103130.

DOI:10.3390/s24103130

PMID:38793984

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11125332/

Abstract

Fine-grained representation is fundamental to species classification based on deep learning, and in this context, cross-modal contrastive learning is an effective method. The diversity of species coupled with the inherent contextual ambiguity of natural language poses a primary challenge in the cross-modal representation alignment of conservation area image data. Integrating cross-modal retrieval tasks with generation tasks contributes to cross-modal representation alignment based on contextual understanding. However, during the contrastive learning process, apart from learning the differences in the data itself, a pair of encoders inevitably learns the differences caused by encoder fluctuations. The latter leads to convergence shortcuts, resulting in poor representation quality and an inaccurate reflection of the similarity relationships between samples in the original dataset within the shared space of features. To achieve fine-grained cross-modal representation alignment, we first propose a residual attention network to enhance consistency during momentum updates in cross-modal encoders. Building upon this, we propose momentum encoding from a multi-task perspective as a bridge for cross-modal information, effectively improving cross-modal mutual information, representation quality, and optimizing the distribution of feature points within the cross-modal shared semantic space. By acquiring momentum encoding queues for cross-modal semantic understanding through multi-tasking, we align ambiguous natural language representations around the invariant image features of factual information, alleviating contextual ambiguity and enhancing model robustness. Experimental validation shows that our proposed multi-task perspective of cross-modal momentum encoders outperforms similar models on standardized image classification tasks and image-text cross-modal retrieval tasks on public datasets by up to 8% on the leaderboard, demonstrating the effectiveness of the proposed method. Qualitative experiments on our self-built conservation area image-text paired dataset show that our proposed method accurately performs cross-modal retrieval and generation tasks among 8142 species, proving its effectiveness on fine-grained cross-modal image-text conservation area image datasets.

摘要

细粒度表示是基于深度学习的物种分类的基础，在这种情况下，跨模态对比学习是一种有效的方法。物种的多样性加上自然语言固有的上下文模糊性，在保护区图像数据的跨模态表示对齐中构成了主要挑战。将跨模态检索任务与生成任务相结合，有助于基于上下文理解的跨模态表示对齐。然而，在对比学习过程中，除了学习数据本身的差异外，一对编码器不可避免地会学习由编码器波动引起的差异。后者导致收敛捷径，导致表示质量较差，并且不能准确反映原始数据集中样本在特征共享空间内的相似关系。为了实现细粒度的跨模态表示对齐，我们首先提出了一种残差注意力网络，以增强跨模态编码器在动量更新期间的一致性。在此基础上，我们从多任务角度提出动量编码作为跨模态信息的桥梁，有效提高跨模态互信息、表示质量，并优化跨模态共享语义空间内的特征点分布。通过多任务获取用于跨模态语义理解的动量编码队列，我们围绕事实信息的不变图像特征对齐模糊的自然语言表示，减轻上下文模糊性并增强模型鲁棒性。实验验证表明，我们提出的跨模态动量编码器的多任务视角在标准化图像分类任务和公共数据集上的图像 - 文本跨模态检索任务中，在排行榜上比类似模型高出8%，证明了所提方法的有效性。在我们自建的保护区图像 - 文本配对数据集上进行的定性实验表明，我们提出的方法能够在8142个物种中准确执行跨模态检索和生成任务，证明了其在细粒度跨模态图像 - 文本保护区图像数据集上的有效性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/87a9/11125332/9143a108fa1c/sensors-24-03130-g0A1.jpg

相似文献

Fine-Grained Cross-Modal Semantic Consistency in Natural Conservation Image Data from a Multi-Task Perspective.

Sensors (Basel). 2024 May 14;24(10):3130. doi: 10.3390/s24103130.

Histopathology language-image representation learning for fine-grained digital pathology cross-modal retrieval.

Med Image Anal. 2024 Jul;95:103163. doi: 10.1016/j.media.2024.103163. Epub 2024 Apr 9.

Latent Space Semantic Supervision Based on Knowledge Distillation for Cross-Modal Retrieval.

IEEE Trans Image Process. 2022;31:7154-7164. doi: 10.1109/TIP.2022.3220051. Epub 2022 Nov 16.

Disambiguity and Alignment: An Effective Multi-Modal Alignment Method for Cross-Modal Recipe Retrieval.

Foods. 2024 May 23;13(11):1628. doi: 10.3390/foods13111628.

A Fine-Grained Semantic Alignment Method Specific to Aggregate Multi-Scale Information for Cross-Modal Remote Sensing Image Retrieval.

Sensors (Basel). 2023 Oct 13;23(20):8437. doi: 10.3390/s23208437.

Relation-Aggregated Cross-Graph Correlation Learning for Fine-Grained Image-Text Retrieval.

IEEE Trans Neural Netw Learn Syst. 2024 Feb;35(2):2194-2207. doi: 10.1109/TNNLS.2022.3188569. Epub 2024 Feb 5.

Multi-grained contrastive representation learning for label-efficient lesion segmentation and onset time classification of acute ischemic stroke.

Med Image Anal. 2024 Oct;97:103250. doi: 10.1016/j.media.2024.103250. Epub 2024 Jun 25.

Learning Relationship-Enhanced Semantic Graph for Fine-Grained Image-Text Matching.

IEEE Trans Cybern. 2024 Feb;54(2):948-961. doi: 10.1109/TCYB.2022.3179020. Epub 2024 Jan 17.

Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training.

IEEE Trans Image Process. 2023;32:3622-3633. doi: 10.1109/TIP.2023.3286710. Epub 2023 Jul 3.

Hybrid Attention Network for Language-Based Person Search.

Sensors (Basel). 2020 Sep 15;20(18):5279. doi: 10.3390/s20185279.

引用本文的文献

An Audiovisual Correlation Matching Method Based on Fine-Grained Emotion and Feature Fusion.

Sensors (Basel). 2024 Aug 31;24(17):5681. doi: 10.3390/s24175681.

本文引用的文献

Generalized Parametric Contrastive Learning.

IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):7463-7474. doi: 10.1109/TPAMI.2023.3278694. Epub 2024 Nov 6.

Cross-Modal Attention With Semantic Consistence for Image-Text Matching.

IEEE Trans Neural Netw Learn Syst. 2020 Dec;31(12):5412-5425. doi: 10.1109/TNNLS.2020.2967597. Epub 2020 Nov 30.

Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning.

Proc Natl Acad Sci U S A. 2018 Jun 19;115(25):E5716-E5725. doi: 10.1073/pnas.1719367115. Epub 2018 Jun 5.

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.

IEEE Trans Pattern Anal Mach Intell. 2017 Jun;39(6):1137-1149. doi: 10.1109/TPAMI.2016.2577031. Epub 2016 Jun 6.

Snapshot Serengeti, high-frequency annotated camera trap images of 40 mammalian species in an African savanna.

Sci Data. 2015 Jun 9;2:150026. doi: 10.1038/sdata.2015.26. eCollection 2015.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Suppr超能文献

从多任务视角看自然保护图像数据中的细粒度跨模态语义一致性

Fine-Grained Cross-Modal Semantic Consistency in Natural Conservation Image Data from a Multi-Task Perspective.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译