CentraleSupélec IETR UMR CNRS 6164, France.
CentraleSupélec IETR UMR CNRS 6164, France.
Neural Netw. 2024 Apr;172:106120. doi: 10.1016/j.neunet.2024.106120. Epub 2024 Jan 11.
High-dimensional data such as natural images or speech signals exhibit some form of regularity, preventing their dimensions from varying independently. This suggests that there exists a lower dimensional latent representation from which the high-dimensional observed data were generated. Uncovering the hidden explanatory features of complex data is the goal of representation learning, and deep latent variable generative models have emerged as promising unsupervised approaches. In particular, the variational autoencoder (VAE) which is equipped with both a generative and an inference model allows for the analysis, transformation, and generation of various types of data. Over the past few years, the VAE has been extended to deal with data that are either multimodal or dynamical (i.e., sequential). In this paper, we present a multimodal and dynamical VAE (MDVAE) applied to unsupervised audiovisual speech representation learning. The latent space is structured to dissociate the latent dynamical factors that are shared between the modalities from those that are specific to each modality. A static latent variable is also introduced to encode the information that is constant over time within an audiovisual speech sequence. The model is trained in an unsupervised manner on an audiovisual emotional speech dataset, in two stages. In the first stage, a vector quantized VAE (VQ-VAE) is learned independently for each modality, without temporal modeling. The second stage consists in learning the MDVAE model on the intermediate representation of the VQ-VAEs before quantization. The disentanglement between static versus dynamical and modality-specific versus modality-common information occurs during this second training stage. Extensive experiments are conducted to investigate how audiovisual speech latent factors are encoded in the latent space of MDVAE. These experiments include manipulating audiovisual speech, audiovisual facial image denoising, and audiovisual speech emotion recognition. The results show that MDVAE effectively combines the audio and visual information in its latent space. They also show that the learned static representation of audiovisual speech can be used for emotion recognition with few labeled data, and with better accuracy compared with unimodal baselines and a state-of-the-art supervised model based on an audiovisual transformer architecture.
高维数据,如自然图像或语音信号,表现出某种形式的规律性,从而防止其维度独立变化。这表明存在一个较低维的潜在表示,高维观测数据是从该潜在表示中生成的。揭示复杂数据的隐藏解释特征是表示学习的目标,深度潜在变量生成模型已成为很有前途的无监督方法。特别是,变分自编码器(VAE),它既配备了生成模型又配备了推理模型,允许对各种类型的数据进行分析、转换和生成。在过去的几年中,VAE 已经扩展到处理多模态或动态(即序列)数据。在本文中,我们提出了一种应用于无监督视听语音表示学习的多模态和动态 VAE(MDVAE)。潜在空间被构造为将模态之间共享的潜在动态因素与每个模态特有的潜在动态因素分离。还引入了一个静态潜在变量,以编码在视听语音序列内随时间变化的信息。该模型在一个视听情感语音数据集上以无监督的方式分两个阶段进行训练。在第一阶段,为每个模态独立学习向量量化 VAE(VQ-VAE),而不进行时间建模。第二阶段包括在量化之前学习 VQ-VAEs 的中间表示上学习 MDVAE 模型。静态与动态、模态特定与模态公共信息之间的解耦发生在第二个训练阶段。进行了广泛的实验来研究视听语音潜在因素是如何在 MDVAE 的潜在空间中编码的。这些实验包括操纵视听语音、视听面部图像去噪和视听语音情感识别。结果表明,MDVAE 有效地在其潜在空间中结合了音频和视觉信息。它们还表明,从视听语音中学习到的静态表示可以用于情感识别,使用少量标记数据,并具有比单模态基线和基于视听转换器架构的最先进监督模型更好的准确性。