Huang Zhao, Hu Haowu, Su Miao
Key Laboratory of Modern Teaching Technology, Ministry of Education, Xi'an 710062, China.
School of Computer Science, Shaanxi Normal University, Xi'an 710119, China.
Entropy (Basel). 2023 Aug 16;25(8):1216. doi: 10.3390/e25081216.
Information retrieval across multiple modes has attracted much attention from academics and practitioners. One key challenge of cross-modal retrieval is to eliminate the heterogeneous gap between different patterns. Most of the existing methods tend to jointly construct a common subspace. However, very little attention has been given to the study of the importance of different fine-grained regions of various modalities. This lack of consideration significantly influences the utilization of the extracted information of multiple modalities. Therefore, this study proposes a novel text-image cross-modal retrieval approach that constructs a dual attention network and an enhanced relation network (DAER). More specifically, the dual attention network tends to precisely extract fine-grained weight information from text and images, while the enhanced relation network is used to expand the differences between different categories of data in order to improve the computational accuracy of similarity. The comprehensive experimental results on three widely-used major datasets (i.e., Wikipedia, Pascal Sentence, and XMediaNet) show that our proposed approach is effective and superior to existing cross-modal retrieval methods.
跨多种模态的信息检索已引起学术界和从业者的广泛关注。跨模态检索的一个关键挑战是消除不同模式之间的异构差距。现有的大多数方法倾向于联合构建一个公共子空间。然而,对于各种模态不同细粒度区域的重要性研究却很少受到关注。这种忽视显著影响了多模态提取信息的利用。因此,本研究提出了一种新颖的文本-图像跨模态检索方法,即构建一个双注意力网络和一个增强关系网络(DAER)。更具体地说,双注意力网络倾向于从文本和图像中精确提取细粒度权重信息,而增强关系网络用于扩大不同类别数据之间的差异,以提高相似度计算的准确性。在三个广泛使用的主要数据集(即维基百科、帕斯卡句子和XMediaNet)上的综合实验结果表明,我们提出的方法是有效的,并且优于现有的跨模态检索方法。