School of Computer Science and Technology, Shandong University of Finance and Economics, Jinan 250014, China.
Shandong Provincial Key Laboratory of Digital Media Technology, Jinan 250014, China.
Sensors (Basel). 2023 Feb 25;23(5):2559. doi: 10.3390/s23052559.
Image-text retrieval aims to search related results of one modality by querying another modality. As a fundamental and key problem in cross-modal retrieval, image-text retrieval is still a challenging problem owing to the complementary and imbalanced relationship between different modalities (i.e., Image and Text) and different granularities (i.e., Global-level and Local-level). However, existing works have not fully considered how to effectively mine and fuse the complementarities between images and texts at different granularities. Therefore, in this paper, we propose a hierarchical adaptive alignment network, whose contributions are as follows: (1) We propose a multi-level alignment network, which simultaneously mines global-level and local-level data, thereby enhancing the semantic association between images and texts. (2) We propose an adaptive weighted loss to flexibly optimize the image-text similarity with two stages in a unified framework. (3) We conduct extensive experiments on three public benchmark datasets (Corel 5K, Pascal Sentence, and Wiki) and compare them with eleven state-of-the-art methods. The experimental results thoroughly verify the effectiveness of our proposed method.
图像-文本检索旨在通过查询另一种模态来搜索一种模态的相关结果。作为跨模态检索中的一个基本和关键问题,由于不同模态(即图像和文本)和不同粒度(即全局级和局部级)之间的互补和不平衡关系,图像-文本检索仍然是一个具有挑战性的问题。然而,现有工作并没有充分考虑如何有效地挖掘和融合不同粒度的图像和文本之间的互补性。因此,在本文中,我们提出了一种分层自适应对齐网络,其贡献如下:(1)我们提出了一种多级对齐网络,它同时挖掘全局级和局部级的数据,从而增强了图像和文本之间的语义关联。(2)我们提出了一种自适应加权损失函数,可以在统一框架的两个阶段中灵活地优化图像-文本相似度。(3)我们在三个公共基准数据集(Corel 5K、Pascal Sentence 和 Wiki)上进行了广泛的实验,并与十一种最先进的方法进行了比较。实验结果充分验证了我们提出的方法的有效性。