School of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China.
School of Information Science and Engineering, Wuhan University of Science and Technology, Wuhan 430081, China.
Sensors (Basel). 2023 Mar 24;23(7):3439. doi: 10.3390/s23073439.
With the proliferation of multi-modal data generated by various sensors, unsupervised multi-modal hashing retrieval has been extensively studied due to its advantages in storage, retrieval efficiency, and label independence. However, there are still two obstacles to existing unsupervised methods: (1) As existing methods cannot fully capture the complementary and co-occurrence information of multi-modal data, existing methods suffer from inaccurate similarity measures. (2) Existing methods suffer from unbalanced multi-modal learning and data semantic structure being corrupted in the process of hash codes binarization. To address these obstacles, we devise an effective CLIP-based Adaptive Graph Attention Network (CAGAN) for large-scale unsupervised multi-modal hashing retrieval. Firstly, we use the multi-modal model CLIP to extract fine-grained semantic features, mine similar information from different perspectives of multi-modal data and perform similarity fusion and enhancement. In addition, this paper proposes an adaptive graph attention network to assist the learning of hash codes, which uses an attention mechanism to learn adaptive graph similarity across modalities. It further aggregates the intrinsic neighborhood information of neighboring data nodes through a graph convolutional network to generate more discriminative hash codes. Finally, this paper employs an iterative approximate optimization strategy to mitigate the information loss in the binarization process. Extensive experiments on three benchmark datasets demonstrate that the proposed method significantly outperforms several representative hashing methods in unsupervised multi-modal retrieval tasks.
随着各种传感器生成的多模态数据的激增,由于其在存储、检索效率和标签独立性方面的优势,无监督多模态哈希检索得到了广泛的研究。然而,现有的无监督方法仍然存在两个障碍:(1)由于现有方法无法充分捕获多模态数据的互补和共同出现信息,因此现有方法的相似度衡量不准确。(2)现有方法存在多模态学习不平衡和数据语义结构在哈希码二值化过程中被破坏的问题。为了解决这些障碍,我们设计了一种有效的基于 CLIP 的自适应图注意网络(CAGAN),用于大规模无监督多模态哈希检索。首先,我们使用多模态模型 CLIP 提取细粒度的语义特征,从多模态数据的不同角度挖掘相似信息,并进行相似度融合和增强。此外,本文提出了一种自适应图注意网络来辅助哈希码的学习,它使用注意力机制来学习跨模态的自适应图相似度。它通过图卷积网络进一步聚合邻近数据节点的内在邻域信息,生成更具判别力的哈希码。最后,本文采用迭代近似优化策略来减轻二值化过程中的信息损失。在三个基准数据集上的广泛实验表明,所提出的方法在无监督多模态检索任务中明显优于几种有代表性的哈希方法。