Zeng Yawen, Wang Yiru, Liao Dongliang, Li Gongfu, Huang Weijie, Xu Jin, Cao Da, Man Hong
IEEE Trans Neural Netw Learn Syst. 2023 Dec;34(12):10528-10537. doi: 10.1109/TNNLS.2022.3168431. Epub 2023 Nov 30.
The task of cross-modal image retrieval has recently attracted considerable research attention. In real-world scenarios, keyword-based queries issued by users are usually short and have broad semantics. Therefore, semantic diversity is as important as retrieval accuracy in such user-oriented services, which improves user experience. However, most typical cross-modal image retrieval methods based on single point query embedding inevitably result in low semantic diversity, while existing diverse retrieval approaches frequently lead to low accuracy due to a lack of cross-modal understanding. To address this challenge, we introduce an end-to-end solution termed variational multiple instance graph (VMIG), in which a continuous semantic space is learned to capture diverse query semantics, and the retrieval task is formulated as a multiple instance learning problems to connect diverse features across modalities. Specifically, a query-guided variational autoencoder is employed to model the continuous semantic space instead of learning a single-point embedding. Afterward, multiple instances of the image and query are obtained by sampling in the continuous semantic space and applying multihead attention, respectively. Thereafter, an instance graph is constructed to remove noisy instances and align cross-modal semantics. Finally, heterogeneous modalities are robustly fused under multiple losses. Extensive experiments on two real-world datasets have well verified the effectiveness of our proposed solution in both retrieval accuracy and semantic diversity.
跨模态图像检索任务最近引起了相当多的研究关注。在现实世界场景中,用户发出的基于关键词的查询通常很短且语义宽泛。因此,在这种面向用户的服务中,语义多样性与检索准确性同样重要,这有助于提升用户体验。然而,大多数基于单点查询嵌入的典型跨模态图像检索方法不可避免地导致语义多样性较低,而现有的多样化检索方法由于缺乏跨模态理解,常常导致准确性较低。为应对这一挑战,我们引入了一种名为变分多实例图(VMIG)的端到端解决方案,其中学习一个连续语义空间以捕获多样的查询语义,并将检索任务表述为一个多实例学习问题,以连接跨模态的多样特征。具体而言,采用一个查询引导的变分自编码器来对连续语义空间进行建模,而非学习单点嵌入。之后,分别通过在连续语义空间中采样并应用多头注意力,获得图像和查询的多个实例。此后,构建一个实例图以去除噪声实例并对齐跨模态语义。最后,在多种损失下对异构模态进行稳健融合。在两个真实世界数据集上进行的大量实验充分验证了我们提出的解决方案在检索准确性和语义多样性方面的有效性。