IEEE Trans Pattern Anal Mach Intell. 2021 Mar;43(3):858-872. doi: 10.1109/TPAMI.2019.2942028. Epub 2021 Feb 4.
Multimodal learning aims to discover the relationship between multiple modalities. It has become an important research topic due to extensive multimodal applications such as cross-modal retrieval. This paper attempts to address the modality heterogeneity problem based on Gaussian process latent variable models (GPLVMs) to represent multimodal data in a common space. Previous multimodal GPLVM extensions generally adopt individual learning schemes on latent representations and kernel hyperparameters, which ignore their intrinsic relationship. To exploit strong complementarity among different modalities and GPLVM components, we develop a novel learning scheme called Harmonization, where latent representations and kernel hyperparameters are jointly learned from each other. Beyond the correlation fitting or intra-modal structure preservation paradigms widely used in existing studies, the harmonization is derived in a model-driven manner to encourage the agreement between modality-specific GP kernels and the similarity of latent representations. We present a range of multimodal learning models by incorporating the harmonization mechanism into several representative GPLVM-based approaches. Experimental results on four benchmark datasets show that the proposed models outperform the strong baselines for cross-modal retrieval tasks, and that the harmonized multimodal learning method is superior in discovering semantically consistent latent representation.
多模态学习旨在发现多种模态之间的关系。由于跨模态检索等广泛的多模态应用,它已成为一个重要的研究课题。本文试图基于高斯过程潜在变量模型(GPLVM)解决模态异质性问题,以便在公共空间中表示多模态数据。以前的多模态 GPLVM 扩展通常在潜在表示和核超参数上采用单独的学习方案,这忽略了它们的内在关系。为了利用不同模态和 GPLVM 组件之间的强大互补性,我们开发了一种新的学习方案,称为协调,其中潜在表示和核超参数彼此共同学习。与现有研究中广泛使用的相关性拟合或模态内结构保持范式不同,协调是从模型驱动的角度推导出来的,以鼓励特定于模态的 GP 核之间的一致性和潜在表示的相似性。我们通过将协调机制纳入几种基于代表性 GPLVM 的方法中,提出了一系列多模态学习模型。在四个基准数据集上的实验结果表明,所提出的模型在跨模态检索任务中优于强大的基线,并且协调的多模态学习方法在发现语义一致的潜在表示方面更具优势。