Jin Lu, Li Zechao, Tang Jinhui
IEEE Trans Neural Netw Learn Syst. 2023 Apr;34(4):1838-1851. doi: 10.1109/TNNLS.2020.2997020. Epub 2023 Apr 4.
Hashing has been widely applied to multimodal retrieval on large-scale multimedia data due to its efficiency in computation and storage. In this article, we propose a novel deep semantic multimodal hashing network (DSMHN) for scalable image-text and video-text retrieval. The proposed deep hashing framework leverages 2-D convolutional neural networks (CNN) as the backbone network to capture the spatial information for image-text retrieval, while the 3-D CNN as the backbone network to capture the spatial and temporal information for video-text retrieval. In the DSMHN, two sets of modality-specific hash functions are jointly learned by explicitly preserving both intermodality similarities and intramodality semantic labels. Specifically, with the assumption that the learned hash codes should be optimal for the classification task, two stream networks are jointly trained to learn the hash functions by embedding the semantic labels on the resultant hash codes. Moreover, a unified deep multimodal hashing framework is proposed to learn compact and high-quality hash codes by exploiting the feature representation learning, intermodality similarity-preserving learning, semantic label-preserving learning, and hash function learning with different types of loss functions simultaneously. The proposed DSMHN method is a generic and scalable deep hashing framework for both image-text and video-text retrievals, which can be flexibly integrated with different types of loss functions. We conduct extensive experiments for both single-modal- and cross-modal-retrieval tasks on four widely used multimodal-retrieval data sets. Experimental results on both image-text- and video-text-retrieval tasks demonstrate that the DSMHN significantly outperforms the state-of-the-art methods.
由于哈希在计算和存储方面的高效性,它已被广泛应用于大规模多媒体数据的多模态检索。在本文中,我们提出了一种新颖的深度语义多模态哈希网络(DSMHN),用于可扩展的图像 - 文本和视频 - 文本检索。所提出的深度哈希框架利用二维卷积神经网络(CNN)作为骨干网络来捕获图像 - 文本检索的空间信息,而利用三维CNN作为骨干网络来捕获视频 - 文本检索的空间和时间信息。在DSMHN中,通过明确保留模态间相似性和模态内语义标签,联合学习两组特定于模态的哈希函数。具体而言,假设学习到的哈希码对于分类任务应该是最优的,通过将语义标签嵌入到生成的哈希码上,联合训练两个流网络来学习哈希函数。此外,提出了一个统一的深度多模态哈希框架,通过同时利用特征表示学习、模态间相似性保留学习、语义标签保留学习以及使用不同类型损失函数的哈希函数学习,来学习紧凑且高质量的哈希码。所提出的DSMHN方法是一种用于图像 - 文本和视频 - 文本检索的通用且可扩展的深度哈希框架,它可以灵活地与不同类型的损失函数集成。我们在四个广泛使用的多模态检索数据集上对单模态和跨模态检索任务进行了广泛的实验。图像 - 文本和视频 - 文本检索任务的实验结果表明,DSMHN显著优于现有方法。