IEEE Trans Image Process. 2018 Oct;27(10):5002-5015. doi: 10.1109/TIP.2018.2849860.
This paper presents a method for detecting salient objects in videos, where temporal information in addition to spatial information is fully taken into account. Following recent reports on the advantage of deep features over conventional handcrafted features, we propose a new set of spatiotemporal deep (STD) features that utilize local and global contexts over frames. We also propose new spatiotemporal conditional random field (STCRF) to compute saliency from STD features. STCRF is our extension of CRF to the temporal domain and describes the relationships among neighboring regions both in a frame and over frames. STCRF leads to temporally consistent saliency maps over frames, contributing to accurate detection of salient objects' boundaries and noise reduction during detection. Our proposed method first segments an input video into multiple scales and then computes a saliency map at each scale level using STD features with STCRF. The final saliency map is computed by fusing saliency maps at different scale levels. Our experiments, using publicly available benchmark datasets, confirm that the proposed method significantly outperforms the state-of-the-art methods. We also applied our saliency computation to the video object segmentation task, showing that our method outperforms existing video object segmentation methods.
本文提出了一种用于检测视频中显著对象的方法,该方法充分考虑了时间信息和空间信息。鉴于最近有关深度特征优于传统手工特征的报道,我们提出了一组新的时空深度(STD)特征,这些特征利用了帧上的局部和全局上下文。我们还提出了新的时空条件随机场(STCRF),以从 STD 特征计算显著度。STCRF 是我们对 CRF 的扩展,适用于时间域,并描述了帧内和帧间相邻区域之间的关系。STCRF 导致了帧间时间一致的显著图,有助于在检测过程中准确检测显著对象的边界和减少噪声。我们的方法首先将输入视频分割成多个尺度,然后使用 STD 特征和 STCRF 在每个尺度级别计算显著图。最终的显著图是通过融合不同尺度级别的显著图计算得到的。我们的实验使用了公开的基准数据集,证实了所提出的方法显著优于最先进的方法。我们还将我们的显著度计算应用于视频对象分割任务,结果表明我们的方法优于现有的视频对象分割方法。