Kaya Semih, Vural Elif
IEEE Trans Image Process. 2021;30:4384-4394. doi: 10.1109/TIP.2021.3071688. Epub 2021 Apr 21.
While many approaches exist in the literature to learn low-dimensional representations for data collections in multiple modalities, the generalizability of multi-modal nonlinear embeddings to previously unseen data is a rather overlooked subject. In this work, we first present a theoretical analysis of learning multi-modal nonlinear embeddings in a supervised setting. Our performance bounds indicate that for successful generalization in multi-modal classification and retrieval problems, the regularity of the interpolation functions extending the embedding to the whole data space is as important as the between-class separation and cross-modal alignment criteria. We then propose a multi-modal nonlinear representation learning algorithm that is motivated by these theoretical findings, where the embeddings of the training samples are optimized jointly with the Lipschitz regularity of the interpolators. Experimental comparison to recent multi-modal and single-modal learning algorithms suggests that the proposed method yields promising performance in multi-modal image classification and cross-modal image-text retrieval applications.
虽然文献中存在许多方法来学习多模态数据集合的低维表示,但多模态非线性嵌入对以前未见过的数据的可推广性是一个相当被忽视的主题。在这项工作中,我们首先对监督设置下学习多模态非线性嵌入进行了理论分析。我们的性能界限表明,对于多模态分类和检索问题中的成功泛化,将嵌入扩展到整个数据空间的插值函数的正则性与类间分离和跨模态对齐标准同样重要。然后,我们提出了一种受这些理论发现启发的多模态非线性表示学习算法,其中训练样本的嵌入与插值器的利普希茨正则性联合优化。与最近的多模态和单模态学习算法的实验比较表明,该方法在多模态图像分类和跨模态图像-文本检索应用中产生了有前景的性能。