Dai Kuai, Li Xutao, Luo Chuyao, Chen Wuqiao, Ye Yunming, Feng Shanshan
School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 518055, Guangdong, China.
School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 518055, Guangdong, China.
Neural Netw. 2023 Nov;168:256-271. doi: 10.1016/j.neunet.2023.09.024. Epub 2023 Sep 21.
As a pixel-wise dense forecast task, video prediction is challenging due to its high computation complexity, dramatic future uncertainty, and extremely complicated spatial-temporal patterns. Many deep learning methods are proposed for the task, which bring up significant improvements. However, they focus on modeling short-term spatial-temporal dynamics and fail to sufficiently exploit long-term ones. As a result, the methods tend to deliver unsatisfactory performance for a long-term forecast requirement. In this article, we propose a novel unified memory network (UNIMEMnet) for long-term video prediction, which can effectively exploit long-term motion-appearance dynamics and unify the short-term spatial-temporal dynamics and long-term ones in an architecture. In the UNIMEMnet, a dual branch multi-scale memory module is carefully designed to extract and preserve long-term spatial-temporal patterns. In addition, a short-term spatial-temporal dynamics module and an alignment and fusion module are devised to capture and coordinate short-term motion-appearance dynamics with long-term ones from our designed memory module. Extensive experiments on five video prediction datasets from both synthetic and real-world scenarios are conducted, which validate the effectiveness and superiority of our proposed method UNIMEMnet over state-of-the-art methods.
作为一项逐像素的密集预测任务,视频预测具有挑战性,因为其计算复杂度高、未来不确定性大,且时空模式极其复杂。针对该任务提出了许多深度学习方法,这些方法带来了显著的改进。然而,它们专注于对短期时空动态进行建模,未能充分利用长期动态。因此,对于长期预测需求,这些方法往往表现不佳。在本文中,我们提出了一种用于长期视频预测的新型统一记忆网络(UNIMEMnet),它可以有效地利用长期运动外观动态,并在一个架构中统一短期时空动态和长期动态。在UNIMEMnet中,精心设计了一个双分支多尺度记忆模块来提取和保留长期时空模式。此外,还设计了一个短期时空动态模块以及一个对齐与融合模块,以捕捉短期运动外观动态并将其与我们设计的记忆模块中的长期动态进行协调。我们在来自合成和现实世界场景的五个视频预测数据集上进行了广泛的实验,验证了我们提出的方法UNIMEMnet相对于现有方法的有效性和优越性。