Suppr超能文献

用于可扩展图像-文本和视频-文本检索的深度语义多模态哈希网络

Deep Semantic Multimodal Hashing Network for Scalable Image-Text and Video-Text Retrievals.

作者信息

Jin Lu, Li Zechao, Tang Jinhui

出版信息

IEEE Trans Neural Netw Learn Syst. 2023 Apr;34(4):1838-1851. doi: 10.1109/TNNLS.2020.2997020. Epub 2023 Apr 4.

Abstract

Hashing has been widely applied to multimodal retrieval on large-scale multimedia data due to its efficiency in computation and storage. In this article, we propose a novel deep semantic multimodal hashing network (DSMHN) for scalable image-text and video-text retrieval. The proposed deep hashing framework leverages 2-D convolutional neural networks (CNN) as the backbone network to capture the spatial information for image-text retrieval, while the 3-D CNN as the backbone network to capture the spatial and temporal information for video-text retrieval. In the DSMHN, two sets of modality-specific hash functions are jointly learned by explicitly preserving both intermodality similarities and intramodality semantic labels. Specifically, with the assumption that the learned hash codes should be optimal for the classification task, two stream networks are jointly trained to learn the hash functions by embedding the semantic labels on the resultant hash codes. Moreover, a unified deep multimodal hashing framework is proposed to learn compact and high-quality hash codes by exploiting the feature representation learning, intermodality similarity-preserving learning, semantic label-preserving learning, and hash function learning with different types of loss functions simultaneously. The proposed DSMHN method is a generic and scalable deep hashing framework for both image-text and video-text retrievals, which can be flexibly integrated with different types of loss functions. We conduct extensive experiments for both single-modal- and cross-modal-retrieval tasks on four widely used multimodal-retrieval data sets. Experimental results on both image-text- and video-text-retrieval tasks demonstrate that the DSMHN significantly outperforms the state-of-the-art methods.

摘要

由于哈希在计算和存储方面的高效性,它已被广泛应用于大规模多媒体数据的多模态检索。在本文中,我们提出了一种新颖的深度语义多模态哈希网络(DSMHN),用于可扩展的图像 - 文本和视频 - 文本检索。所提出的深度哈希框架利用二维卷积神经网络(CNN)作为骨干网络来捕获图像 - 文本检索的空间信息,而利用三维CNN作为骨干网络来捕获视频 - 文本检索的空间和时间信息。在DSMHN中,通过明确保留模态间相似性和模态内语义标签,联合学习两组特定于模态的哈希函数。具体而言,假设学习到的哈希码对于分类任务应该是最优的,通过将语义标签嵌入到生成的哈希码上,联合训练两个流网络来学习哈希函数。此外,提出了一个统一的深度多模态哈希框架,通过同时利用特征表示学习、模态间相似性保留学习、语义标签保留学习以及使用不同类型损失函数的哈希函数学习,来学习紧凑且高质量的哈希码。所提出的DSMHN方法是一种用于图像 - 文本和视频 - 文本检索的通用且可扩展的深度哈希框架,它可以灵活地与不同类型的损失函数集成。我们在四个广泛使用的多模态检索数据集上对单模态和跨模态检索任务进行了广泛的实验。图像 - 文本和视频 - 文本检索任务的实验结果表明,DSMHN显著优于现有方法。

相似文献

3
Deep Semantic-Preserving Ordinal Hashing for Cross-Modal Similarity Search.用于跨模态相似性搜索的深度语义保持序数哈希
IEEE Trans Neural Netw Learn Syst. 2019 May;30(5):1429-1440. doi: 10.1109/TNNLS.2018.2869601. Epub 2018 Oct 1.
5
Deep Ordinal Hashing With Spatial Attention.深度序哈希与空间注意力。
IEEE Trans Image Process. 2019 May;28(5):2173-2186. doi: 10.1109/TIP.2018.2883522. Epub 2018 Nov 28.
6
Semantic Neighbor Graph Hashing for Multimodal Retrieval.基于语义邻居图的哈希的多模态检索。
IEEE Trans Image Process. 2018 Mar;27(3):1405-1417. doi: 10.1109/TIP.2017.2776745. Epub 2017 Nov 22.
7
Multimodal Discriminative Binary Embedding for Large-Scale Cross-Modal Retrieval.多模态判别式二值嵌入的大规模跨模态检索。
IEEE Trans Image Process. 2016 Oct;25(10):4540-54. doi: 10.1109/TIP.2016.2592800. Epub 2016 Jul 18.
9
Unsupervised Semantic-Preserving Adversarial Hashing for Image Search.用于图像搜索的无监督语义保持对抗哈希
IEEE Trans Image Process. 2019 Aug;28(8):4032-4044. doi: 10.1109/TIP.2019.2903661. Epub 2019 Mar 13.
10
Discrete Semantic Alignment Hashing for Cross-Media Retrieval.用于跨媒体检索的离散语义对齐哈希
IEEE Trans Cybern. 2020 Dec;50(12):4896-4907. doi: 10.1109/TCYB.2019.2912644. Epub 2020 Dec 3.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验