无监督视觉-文本关联学习与细粒度语义对齐。

Unsupervised Visual-Textual Correlation Learning With Fine-Grained Semantic Alignment.

出版信息

IEEE Trans Cybern. 2022 May;52(5):3669-3683. doi: 10.1109/TCYB.2020.3015084. Epub 2022 May 19.

DOI:10.1109/TCYB.2020.3015084

Abstract

With the rapid growth of multimedia data on the Internet, there has been a rapid rise in the demand for visual-textual cross-media retrieval between images and sentences. However, the heterogeneous property of visual and textual data brings huge challenges to measure the cross-media similarity for retrieval. Although existing methods have achieved great progress with the strong learning ability of the deep neural network, they rely heavily on the scale of training data with manual annotation, that is, either pairwise image-sentence annotation or category annotation as supervised information for visual-textual correlation learning, which are extremely labor and time consuming to collect. Without any pairwise or category annotation, it is highly challenging to construct a correlation between images and sentences due to their inconsistent distributions and representations. But people can naturally understand the correlation between visual and textual data in high-level semantic, and those images and sentences containing the same group of semantic concepts can be easily matched in human brain. Inspired by the above human cognitive process, this article proposes an unsupervised visual-textual correlation learning (UVCL) approach to construct correlations without any manual annotation. The contributions are summarized as follows: 1) unsupervised semantic-guided cross-media correlation mining is proposed to bridge the heterogeneous gap between visual and textual data. We measure the semantic matching degree between images and sentences, and generate descriptive sentences according to the concepts extracted from images to further augment the training data in an unsupervised manner. Therefore, the approach can exploit the semantic knowledge within both visual and textual data to reduce the gap between them for further correlation learning and 2) unsupervised visual-textual fine-grained semantic alignment is proposed to learn cross-media correlation by aligning the fine-grained visual local patches and textual keywords with fine-grained soft attention as well as semantic-guided hard attention, and the results can effectively highlight the fine-grained semantic information within both images and sentences to boost visual-textual alignment. Extensive experiments are conducted to perform visual-textual cross-media retrieval in unsupervised setting without any manual annotation on two widely used datasets, namely, Flickr-30K and MS-COCO, which verify the effectiveness of our proposed UVCL approach.

摘要

随着互联网上多媒体数据的快速增长，人们对图像和句子之间的视觉-文本跨媒体检索的需求迅速增加。然而，视觉和文本数据的异构性质给检索的跨媒体相似性度量带来了巨大的挑战。尽管现有的方法已经在深度学习神经网络的强大学习能力的帮助下取得了很大的进展，但它们严重依赖于带有手动注释的训练数据的规模，即图像-句子的成对注释或类别注释作为视觉-文本相关性学习的监督信息，这在收集方面既耗费人力又耗费时间。如果没有任何成对的或类别的注释，由于它们不一致的分布和表示，构建图像和句子之间的相关性是极具挑战性的。但是，人们可以在高层语义上自然地理解视觉和文本数据之间的相关性，那些包含相同语义概念的图像和句子在人类大脑中很容易匹配。受上述人类认知过程的启发，本文提出了一种无监督的视觉-文本相关性学习（UVCL）方法，无需任何手动注释即可构建相关性。主要贡献总结如下：1）提出了一种无监督的语义引导跨媒体相关挖掘方法，以弥合视觉和文本数据之间的异构差距。我们测量图像和句子之间的语义匹配程度，并根据从图像中提取的概念生成描述性句子，以在无监督的方式进一步扩充训练数据。因此，该方法可以利用视觉和文本数据中的语义知识来缩小它们之间的差距，以进一步进行相关性学习；2）提出了一种无监督的视觉-文本细粒度语义对齐方法，通过使用细粒度软注意力和语义引导硬注意力对齐细粒度视觉局部补丁和文本关键词来学习跨媒体相关性，并且结果可以有效地突出图像和句子中的细粒度语义信息，从而增强视觉-文本对齐。在两个广泛使用的数据集，即 Flickr-30K 和 MS-COCO 上进行了广泛的实验，以在无任何手动注释的情况下进行无监督的视觉-文本跨媒体检索，验证了我们提出的 UVCL 方法的有效性。

相似文献

Unsupervised Visual-Textual Correlation Learning With Fine-Grained Semantic Alignment.无监督视觉-文本关联学习与细粒度语义对齐。

IEEE Trans Cybern. 2022 May;52(5):3669-3683. doi: 10.1109/TCYB.2020.3015084. Epub 2022 May 19.

MAVA: Multi-level Adaptive Visual-textual Alignment by Cross-media Bi-attention Mechanism.MAVA：基于跨媒体双向注意力机制的多层次自适应视觉文本对齐

IEEE Trans Image Process. 2019 Nov 22. doi: 10.1109/TIP.2019.2952085.

Structure-aware contrastive hashing for unsupervised cross-modal retrieval.用于无监督跨模态检索的结构感知对比哈希

Neural Netw. 2024 Jun;174:106211. doi: 10.1016/j.neunet.2024.106211. Epub 2024 Feb 27.

VGSG: Vision-Guided Semantic-Group Network for Text-Based Person Search.VGSG：用于基于文本的行人搜索的视觉引导语义组网络。

IEEE Trans Image Process. 2024;33:163-176. doi: 10.1109/TIP.2023.3337653. Epub 2023 Dec 8.

Centralized contrastive loss with weakly supervised progressive feature extraction for fine-grained common thorax disease retrieval in chest x-ray.基于集中对比损失和弱监督渐进式特征提取的胸部 X 射线细粒度常见胸部疾病检索方法。

Med Phys. 2023 Jun;50(6):3560-3572. doi: 10.1002/mp.16144. Epub 2023 Jan 11.

Decoupled Cross-Modal Phrase-Attention Network for Image-Sentence Matching.用于图像-句子匹配的解耦跨模态短语注意力网络

IEEE Trans Image Process. 2024;33:1326-1337. doi: 10.1109/TIP.2022.3197972. Epub 2024 Feb 13.

Learning Relationship-Enhanced Semantic Graph for Fine-Grained Image-Text Matching.用于细粒度图像-文本匹配的学习关系增强语义图

IEEE Trans Cybern. 2024 Feb;54(2):948-961. doi: 10.1109/TCYB.2022.3179020. Epub 2024 Jan 17.

Quantifying semantic similarity of clinical evidence in the biomedical literature to facilitate related evidence synthesis.量化生物医学文献中临床证据的语义相似度，以促进相关证据的综合。

J Biomed Inform. 2019 Dec;100:103321. doi: 10.1016/j.jbi.2019.103321. Epub 2019 Oct 30.

Relation-Aggregated Cross-Graph Correlation Learning for Fine-Grained Image-Text Retrieval.用于细粒度图像-文本检索的关系聚合跨图相关性学习

IEEE Trans Neural Netw Learn Syst. 2024 Feb;35(2):2194-2207. doi: 10.1109/TNNLS.2022.3188569. Epub 2024 Feb 5.

Deep Unsupervised Hashing for Large-Scale Cross-Modal Retrieval Using Knowledge Distillation Model.基于知识蒸馏模型的大规模跨模态检索的深度无监督哈希。

Comput Intell Neurosci. 2021 Jul 17;2021:5107034. doi: 10.1155/2021/5107034. eCollection 2021.

无监督视觉-文本关联学习与细粒度语义对齐。

Unsupervised Visual-Textual Correlation Learning With Fine-Grained Semantic Alignment.

出版信息

IEEE Trans Cybern. 2022 May;52(5):3669-3683. doi: 10.1109/TCYB.2020.3015084. Epub 2022 May 19.

DOI:10.1109/TCYB.2020.3015084

PMID:32931438

Abstract

摘要

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

无监督视觉-文本关联学习与细粒度语义对齐。

Unsupervised Visual-Textual Correlation Learning With Fine-Grained Semantic Alignment.

出版信息

相似文献

无监督视觉-文本关联学习与细粒度语义对齐。

Unsupervised Visual-Textual Correlation Learning With Fine-Grained Semantic Alignment.

出版信息

相似文献