Beijing Laboratory of Intelligent Information Technology, School of Computer Science, Beijing Institute of Technology, Beijing, China.
School of Computing Sciences, University of East Anglia, Norwich, U.K.
IEEE Trans Image Process. 2018;27(1):38-49. doi: 10.1109/TIP.2017.2754941.
This paper proposes a deep learning model to efficiently detect salient regions in videos. It addresses two important issues: 1) deep video saliency model training with the absence of sufficiently large and pixel-wise annotated video data and 2) fast video saliency training and detection. The proposed deep video saliency network consists of two modules, for capturing the spatial and temporal saliency information, respectively. The dynamic saliency model, explicitly incorporating saliency estimates from the static saliency model, directly produces spatiotemporal saliency inference without time-consuming optical flow computation. We further propose a novel data augmentation technique that simulates video training data from existing annotated image data sets, which enables our network to learn diverse saliency information and prevents overfitting with the limited number of training videos. Leveraging our synthetic video data (150K video sequences) and real videos, our deep video saliency model successfully learns both spatial and temporal saliency cues, thus producing accurate spatiotemporal saliency estimate. We advance the state-of-the-art on the densely annotated video segmentation data set (MAE of .06) and the Freiburg-Berkeley Motion Segmentation data set (MAE of .07), and do so with much improved speed (2 fps with all steps).This paper proposes a deep learning model to efficiently detect salient regions in videos. It addresses two important issues: 1) deep video saliency model training with the absence of sufficiently large and pixel-wise annotated video data and 2) fast video saliency training and detection. The proposed deep video saliency network consists of two modules, for capturing the spatial and temporal saliency information, respectively. The dynamic saliency model, explicitly incorporating saliency estimates from the static saliency model, directly produces spatiotemporal saliency inference without time-consuming optical flow computation. We further propose a novel data augmentation technique that simulates video training data from existing annotated image data sets, which enables our network to learn diverse saliency information and prevents overfitting with the limited number of training videos. Leveraging our synthetic video data (150K video sequences) and real videos, our deep video saliency model successfully learns both spatial and temporal saliency cues, thus producing accurate spatiotemporal saliency estimate. We advance the state-of-the-art on the densely annotated video segmentation data set (MAE of .06) and the Freiburg-Berkeley Motion Segmentation data set (MAE of .07), and do so with much improved speed (2 fps with all steps).
本文提出了一种深度学习模型,用于有效地检测视频中的显著区域。它解决了两个重要问题:1)在缺乏足够大和像素级注释视频数据的情况下,对深度视频显着性模型进行训练;2)快速进行视频显着性训练和检测。所提出的深度视频显着性网络由两个模块组成,分别用于捕获空间和时间显着性信息。动态显着性模型明确地从静态显着性模型中提取显着性估计值,直接产生时空显着性推断,而无需进行耗时的光流计算。我们进一步提出了一种新颖的数据增强技术,该技术可以从现有的注释图像数据集模拟视频训练数据,这使我们的网络能够学习到不同的显着性信息,并防止因训练视频数量有限而导致的过拟合。利用我们的合成视频数据(150K 个视频序列)和真实视频,我们的深度视频显着性模型成功地学习了空间和时间显着性线索,从而产生了准确的时空显着性估计。我们在密集注释的视频分割数据集(MAE 为.06)和弗莱堡-伯克利运动分割数据集(MAE 为.07)上取得了最新的成果,并且速度大大提高(所有步骤的帧率为 2fps)。