IEEE Trans Cybern. 2022 Oct;52(10):11156-11171. doi: 10.1109/TCYB.2021.3070881. Epub 2022 Sep 19.
For multimodal representation learning, traditional black-box approaches often fall short of extracting interpretable multilayer hidden structures, which contribute to visualize the connections between different modalities at multiple semantic levels. To extract interpretable multimodal latent representations and visualize the hierarchial semantic relationships between different modalities, based on deep topic models, we develop a novel multimodal Poisson gamma belief network (mPGBN) that tightly couples the observations of different modalities via imposing sparse connections between their modality-specific hidden layers. To alleviate the time-consuming Gibbs sampler adopted by traditional topic models in the testing stage, we construct a Weibull-based variational inference network (encoder) to directly map the observations to their latent representations, and further combine it with the mPGBN (decoder), resulting in a novel multimodal Weibull variational autoencoder (MWVAE), which is fast in out-of-sample prediction and can handle large-scale multimodal datasets. Qualitative evaluations on bimodal data consisting of image-text pairs show that the developed MWVAE can successfully extract expressive multimodal latent representations for downstream tasks like missing modality imputation and multimodal retrieval. Further extensive quantitative results demonstrate that both MWVAE and its supervised extension sMWVAE achieve state-of-the-art performance on various multimodal benchmarks.
对于多模态表示学习,传统的黑盒方法往往无法提取可解释的多层隐藏结构,这些结构有助于在多个语义层次上可视化不同模态之间的联系。为了提取可解释的多模态潜在表示,并可视化不同模态之间的层次语义关系,我们基于深度主题模型开发了一种新颖的多模态泊松伽马置信网络 (mPGBN),通过在其特定于模态的隐藏层之间施加稀疏连接,紧密地将不同模态的观察结果耦合在一起。为了缓解传统主题模型在测试阶段采用的耗时的 Gibbs 采样器,我们构建了一个基于威布尔的变分推理网络(编码器),直接将观察结果映射到它们的潜在表示,并进一步将其与 mPGBN(解码器)相结合,从而形成一种新颖的多模态威布尔变分自动编码器(MWVAE),它在样本外预测中速度很快,可以处理大规模的多模态数据集。由图像-文本对组成的双模态数据的定性评估表明,所开发的 MWVAE 可以成功地为下游任务(如缺失模态插补和多模态检索)提取有表现力的多模态潜在表示。进一步广泛的定量结果表明,MWVAE 及其监督扩展 sMWVAE 在各种多模态基准上均达到了最先进的性能。