IEEE Trans Image Process. 2016 Apr;25(4):1779-92. doi: 10.1109/TIP.2016.2531283. Epub 2016 Feb 18.
Deep networks have been successfully applied to visual tracking by learning a generic representation offline from numerous training images. However, the offline training is time-consuming and the learned generic representation may be less discriminative for tracking specific objects. In this paper, we present that, even without offline training with a large amount of auxiliary data, simple two-layer convolutional networks can be powerful enough to learn robust representations for visual tracking. In the first frame, we extract a set of normalized patches from the target region as fixed filters, which integrate a series of adaptive contextual filters surrounding the target to define a set of feature maps in the subsequent frames. These maps measure similarities between each filter and useful local intensity patterns across the target, thereby encoding its local structural information. Furthermore, all the maps together form a global representation, via which the inner geometric layout of the target is also preserved. A simple soft shrinkage method that suppresses noisy values below an adaptive threshold is employed to de-noise the global representation. Our convolutional networks have a lightweight structure and perform favorably against several state-of-the-art methods on the recent tracking benchmark data set with 50 challenging videos.
深度网络已经成功地应用于视觉跟踪,通过从大量训练图像中离线学习通用表示。然而,离线训练非常耗时,并且学习到的通用表示对于跟踪特定对象可能不够有判别力。在本文中,我们提出,即使没有使用大量辅助数据进行离线训练,简单的两层卷积网络也可以强大到足以学习用于视觉跟踪的鲁棒表示。在第一帧中,我们从目标区域中提取一组归一化的补丁作为固定滤波器,这些滤波器集成了围绕目标的一系列自适应上下文滤波器,以在后续帧中定义一组特征图。这些图测量每个滤波器与目标上有用的局部强度模式之间的相似性,从而编码其局部结构信息。此外,所有的图一起形成一个全局表示,通过这个表示可以保留目标的内部几何布局。我们采用了一种简单的软收缩方法,该方法通过自适应阈值抑制低于噪声值的噪声,从而对全局表示进行去噪。我们的卷积网络具有轻量级的结构,在具有 50 个挑战性视频的最新跟踪基准数据集上,与几种最先进的方法相比表现良好。