Department of Applied and Cognitive Informatics, Graduate School of Science and Engineering, Chiba University, Chiba 263-8522, Japan.
Graduate School of Engineering, Chiba University, Chiba 263-8522, Japan.
Sensors (Basel). 2023 Feb 24;23(5):2515. doi: 10.3390/s23052515.
In this paper, we propose a sequential variational autoencoder for video disentanglement, which is a representation learning method that can be used to separately extract static and dynamic features from videos. Building sequential variational autoencoders with a two-stream architecture induces inductive bias for video disentanglement. However, our preliminary experiment demonstrated that the two-stream architecture is insufficient for video disentanglement because static features frequently contain dynamic features. Additionally, we found that dynamic features are not discriminative in the latent space. To address these problems, we introduced an adversarial classifier using supervised learning into the two-stream architecture. The strong inductive bias through supervision separates dynamic features from static features and yields discriminative representations of the dynamic features. Through a comparison with other sequential variational autoencoders, we qualitatively and quantitatively demonstrate the effectiveness of the proposed method on the Sprites and MUG datasets.
在本文中,我们提出了一种用于视频解缠的序列变分自动编码器,这是一种表示学习方法,可用于从视频中分别提取静态和动态特征。使用双流架构构建序列变分自动编码器会为视频解缠引入归纳偏置。然而,我们的初步实验表明,双流架构对于视频解缠来说是不够的,因为静态特征经常包含动态特征。此外,我们发现动态特征在潜在空间中没有判别力。为了解决这些问题,我们在双流架构中引入了一个使用监督学习的对抗分类器。通过监督的强归纳偏置将动态特征与静态特征分离,并对动态特征进行有判别力的表示。通过与其他序列变分自动编码器的比较,我们在 Sprites 和 MUG 数据集上定性和定量地证明了所提出方法的有效性。