Yao Zhiyu, Wang Yunbo, Wu Haixu, Wang Jianmin, Long Mingsheng
IEEE Trans Pattern Anal Mach Intell. 2023 Nov;45(11):13281-13296. doi: 10.1109/TPAMI.2023.3293145. Epub 2023 Oct 3.
Learning predictive models for unlabeled spatiotemporal data is challenging in part because visual dynamics can be highly entangled, especially in real scenes. In this paper, we refer to the multi-modal output distribution of predictive learning as spatiotemporal modes. We find an experimental phenomenon named spatiotemporal mode collapse (STMC) on most existing video prediction models, that is, features collapse into invalid representation subspaces due to the ambiguous understanding of mixed physical processes. We propose to quantify STMC and explore its solution for the first time in the context of unsupervised predictive learning. To this end, we present ModeRNN, a decoupling-aggregation framework that has a strong inductive bias of discovering the compositional structures of spatiotemporal modes between recurrent states. We first leverage a set of dynamic slots with independent parameters to extract individual building components of spatiotemporal modes. We then perform a weighted fusion of slot features to adaptively aggregate them into a unified hidden representation for recurrent updates. Through a series of experiments, we show high correlation between STMC and the fuzzy prediction results of future video frames. Besides, ModeRNN is shown to better mitigate STMC and achieve the state of the art on five video prediction datasets.
学习未标记的时空数据的预测模型具有挑战性,部分原因是视觉动态可能高度纠缠,尤其是在真实场景中。在本文中,我们将预测学习的多模态输出分布称为时空模式。我们在大多数现有的视频预测模型上发现了一种名为时空模式崩溃(STMC)的实验现象,即由于对混合物理过程的模糊理解,特征会崩溃到无效的表示子空间中。我们首次提出在无监督预测学习的背景下量化STMC并探索其解决方案。为此,我们提出了ModeRNN,这是一个解耦聚合框架,在循环状态之间发现时空模式的组成结构方面具有很强的归纳偏差。我们首先利用一组具有独立参数的动态插槽来提取时空模式的各个构建组件。然后,我们对插槽特征进行加权融合,以自适应地将它们聚合为一个统一的隐藏表示,用于循环更新。通过一系列实验,我们展示了STMC与未来视频帧的模糊预测结果之间的高度相关性。此外,ModeRNN被证明能更好地缓解STMC,并在五个视频预测数据集上达到了当前最优水平。