Xu Xing, Lin Kaiyi, Yang Yang, Hanjalic Alan, Shen Heng Tao
IEEE Trans Pattern Anal Mach Intell. 2022 Jun;44(6):3030-3047. doi: 10.1109/TPAMI.2020.3045530. Epub 2022 May 5.
Recently, generative adversarial network (GAN) has shown its strong ability on modeling data distribution via adversarial learning. Cross-modal GAN, which attempts to utilize the power of GAN to model the cross-modal joint distribution and to learn compatible cross-modal features, is becoming the research hotspot. However, the existing cross-modal GAN approaches typically 1) require labeled multimodal data of massive labor cost to establish cross-modal correlation; 2) utilize the vanilla GAN model that results in unstable training procedure and meaningless synthetic features; and 3) lack of extensibility for retrieving cross-modal data of new classes. In this article, we revisit the adversarial learning in existing cross-modal GAN methods and propose Joint Feature Synthesis and Embedding (JFSE), a novel method that jointly performs multimodal feature synthesis and common embedding space learning to overcome the above three shortcomings. Specifically, JFSE deploys two coupled conditional Wassertein GAN modules for the input data of two modalities, to synthesize meaningful and correlated multimodal features under the guidance of the word embeddings of class labels. Moreover, three advanced distribution alignment schemes with advanced cycle-consistency constraints are proposed to preserve the semantic compatibility and enable the knowledge transfer in the common embedding space for both the true and synthetic cross-modal features. All these add-ons in JFSE not only help to learn more effective common embedding space that effectively captures the cross-modal correlation but also facilitate to transfer knowledge to multimodal data of new classes. Extensive experiments are conducted on four widely used cross-modal datasets, and the comparisons with more than ten state-of-the-art approaches show that our JFSE method achieves remarkably accuracy improvement on both standard retrieval and the newly explored zero-shot and generalized zero-shot retrieval tasks.
最近,生成对抗网络(GAN)已通过对抗学习展现出其在数据分布建模方面的强大能力。跨模态GAN试图利用GAN的能力来对跨模态联合分布进行建模并学习兼容的跨模态特征,正成为研究热点。然而,现有的跨模态GAN方法通常:1)需要大量人工标注的多模态数据来建立跨模态相关性;2)使用普通的GAN模型,导致训练过程不稳定且生成的特征无意义;3)缺乏可扩展性,无法检索新类别的跨模态数据。在本文中,我们重新审视了现有跨模态GAN方法中的对抗学习,并提出了联合特征合成与嵌入(JFSE)方法,这是一种新颖的方法,可联合执行多模态特征合成和公共嵌入空间学习,以克服上述三个缺点。具体而言,JFSE为两种模态的输入数据部署了两个耦合的条件瓦瑟斯坦GAN模块,以在类别标签的词嵌入指导下合成有意义且相关的多模态特征。此外,还提出了三种具有高级循环一致性约束的先进分布对齐方案,以保持语义兼容性,并使真实和合成的跨模态特征在公共嵌入空间中实现知识转移。JFSE中的所有这些附加功能不仅有助于学习更有效的公共嵌入空间,有效捕捉跨模态相关性,还便于将知识转移到新类别的多模态数据中。我们在四个广泛使用的跨模态数据集上进行了大量实验,与十多种最新方法的比较表明,我们的JFSE方法在标准检索以及新探索的零样本和广义零样本检索任务上均取得了显著的准确率提升。