Peng Shu-Juan, He Yi, Liu Xin, Cheung Yiu-Ming, Xu Xing, Cui Zhen
IEEE Trans Neural Netw Learn Syst. 2024 Feb;35(2):2194-2207. doi: 10.1109/TNNLS.2022.3188569. Epub 2024 Feb 5.
Fine-grained image-text retrieval has been a hot research topic to bridge the vision and languages, and its main challenge is how to learn the semantic correspondence across different modalities. The existing methods mainly focus on learning the global semantic correspondence or intramodal relation correspondence in separate data representations, but which rarely consider the intermodal relation that interactively provide complementary hints for fine-grained semantic correlation learning. To address this issue, we propose a relation-aggregated cross-graph (RACG) model to explicitly learn the fine-grained semantic correspondence by aggregating both intramodal and intermodal relations, which can be well utilized to guide the feature correspondence learning process. More specifically, we first build semantic-embedded graph to explore both fine-grained objects and their relations of different media types, which aim not only to characterize the object appearance in each modality, but also to capture the intrinsic relation information to differentiate intramodal discrepancies. Then, a cross-graph relation encoder is newly designed to explore the intermodal relation across different modalities, which can mutually boost the cross-modal correlations to learn more precise intermodal dependencies. Besides, the feature reconstruction module and multihead similarity alignment are efficiently leveraged to optimize the node-level semantic correspondence, whereby the relation-aggregated cross-modal embeddings between image and text are discriminatively obtained to benefit various image-text retrieval tasks with high retrieval performance. Extensive experiments evaluated on benchmark datasets quantitatively and qualitatively verify the advantages of the proposed framework for fine-grained image-text retrieval and show its competitive performance with the state of the arts.
细粒度图像-文本检索一直是连接视觉和语言的热门研究课题,其主要挑战在于如何学习不同模态之间的语义对应关系。现有方法主要集中在单独的数据表示中学习全局语义对应或模态内关系对应,但很少考虑交互地为细粒度语义相关学习提供互补线索的模态间关系。为了解决这个问题,我们提出了一种关系聚合交叉图(RACG)模型,通过聚合模态内和模态间关系来显式学习细粒度语义对应关系,这可以很好地用于指导特征对应学习过程。更具体地说,我们首先构建语义嵌入图来探索不同媒体类型的细粒度对象及其关系,其目的不仅是在每个模态中表征对象外观,而且是捕获内在关系信息以区分模态内差异。然后,新设计了一个交叉图关系编码器来探索不同模态之间的模态间关系,它可以相互促进跨模态相关性以学习更精确的模态间依赖关系。此外,有效地利用特征重建模块和多头相似性对齐来优化节点级语义对应,从而有区别地获得图像和文本之间的关系聚合跨模态嵌入,以有利于具有高检索性能的各种图像-文本检索任务。在基准数据集上进行的大量实验从定量和定性方面验证了所提出框架在细粒度图像-文本检索方面的优势,并展示了其与现有技术相比的竞争性能。