Massague Armand Comas, Fernandez-Lopez Christian, Ghimire Sandesh, Li Haolin, Sznaier Mario, Camps Octavia
Northeastern University.
Proc Mach Learn Res. 2023;5:745-769.
One of the long-term objectives of Machine Learning is to endow machines with the capacity of structuring and interpreting the world as we do. This is particularly challenging in scenes involving time series, such as video sequences, since seemingly different data can correspond to the same underlying dynamics. Recent approaches seek to decompose video sequences into their composing objects, attributes and dynamics in a self-supervised fashion, thus simplifying the task of learning suitable features that can be used to analyze each component. While existing methods can successfully disentangle dynamics from other components, there have been relatively few efforts in learning parsimonious representations of these underlying dynamics. In this paper, motivated by recent advances in non-linear identification, we propose a method to decompose a video into moving objects, their attributes and the dynamic modes of their trajectories. We model video dynamics as the output of a Koopman operator to be learned from the available data. In this context, the dynamic information contained in the scene is encapsulated in the eigenvalues and eigenvectors of the Koopman operator, providing an interpretable and parsimonious representation. We show that such decomposition can be used for instance to perform video analytics, predict future frames or generate synthetic video. We test our framework in a variety of datasets that encompass different dynamic scenarios, while illustrating the novel features that emerge from our dynamic modes decomposition: Video dynamics interpretation and user manipulation at test-time. We successfully forecast challenging object trajectories from pixels, achieving competitive performance while drawing useful insights.
机器学习的长期目标之一是赋予机器像我们人类一样构建和解释世界的能力。这在涉及时间序列的场景中,如视频序列,尤其具有挑战性,因为看似不同的数据可能对应相同的潜在动态。最近的方法试图以自监督的方式将视频序列分解为其组成对象、属性和动态,从而简化学习可用于分析每个组件的合适特征的任务。虽然现有方法可以成功地将动态与其他组件分离,但在学习这些潜在动态的简洁表示方面相对较少有人努力。在本文中,受非线性识别领域最近进展的启发,我们提出了一种将视频分解为移动对象、其属性及其轨迹动态模式的方法。我们将视频动态建模为一个要从可用数据中学习的柯普曼算子的输出。在这种情况下,场景中包含的动态信息被封装在柯普曼算子的特征值和特征向量中,提供了一种可解释且简洁的表示。我们表明,这种分解例如可用于执行视频分析、预测未来帧或生成合成视频。我们在包含不同动态场景的各种数据集中测试了我们的框架,同时展示了从我们的动态模式分解中出现的新颖特征:测试时的视频动态解释和用户操作。我们成功地从像素预测具有挑战性的对象轨迹,在获得有用见解的同时实现了有竞争力的性能。