IEEE Trans Pattern Anal Mach Intell. 2019 Nov;41(11):2709-2723. doi: 10.1109/TPAMI.2018.2865311. Epub 2018 Aug 13.
Visual tracking is challenging as target objects often undergo significant appearance changes caused by deformation, abrupt motion, background clutter and occlusion. In this paper, we propose to exploit the rich hierarchical features of deep convolutional neural networks to improve the accuracy and robustness of visual tracking. Deep neural networks trained on object recognition datasets consist of multiple convolutional layers. These layers encode target appearance with different levels of abstraction. For example, the outputs of the last convolutional layers encode the semantic information of targets and such representations are invariant to significant appearance variations. However, their spatial resolutions are too coarse to precisely localize the target. In contrast, features from earlier convolutional layers provide more precise localization but are less invariant to appearance changes. We interpret the hierarchical features of convolutional layers as a nonlinear counterpart of an image pyramid representation and explicitly exploit these multiple levels of abstraction to represent target objects. Specifically, we learn adaptive correlation filters on the outputs from each convolutional layer to encode the target appearance. We infer the maximum response of each layer to locate targets in a coarse-to-fine manner. To further handle the issues with scale estimation and re-detecting target objects from tracking failures caused by heavy occlusion or out-of-the-view movement, we conservatively learn another correlation filter, that maintains a long-term memory of target appearance, as a discriminative classifier. We apply the classifier to two types of object proposals: (1) proposals with a small step size and tightly around the estimated location for scale estimation; and (2) proposals with large step size and across the whole image for target re-detection. Extensive experimental results on large-scale benchmark datasets show that the proposed algorithm performs favorably against the state-of-the-art tracking methods.
视觉跟踪具有挑战性,因为目标对象通常会经历由于变形、突然运动、背景杂波和遮挡而引起的显著外观变化。在本文中,我们提出利用深度卷积神经网络的丰富层次特征来提高视觉跟踪的准确性和鲁棒性。在对象识别数据集上训练的深度神经网络由多个卷积层组成。这些层使用不同抽象级别的目标外观进行编码。例如,最后卷积层的输出编码目标的语义信息,并且这些表示对于显著的外观变化是不变的。然而,它们的空间分辨率太粗糙,无法精确地定位目标。相比之下,来自早期卷积层的特征提供了更精确的定位,但对外观变化的不变性较差。我们将卷积层的层次特征解释为图像金字塔表示的非线性对应物,并明确利用这些多个抽象级别来表示目标对象。具体来说,我们在每个卷积层的输出上学习自适应相关滤波器,以编码目标外观。我们以粗到精的方式推断每层的最大响应以定位目标。为了进一步处理由于严重遮挡或超出视野运动而导致的尺度估计和重新检测目标对象的问题,我们保守地学习另一个相关滤波器,该滤波器作为判别分类器,保持目标外观的长期记忆。我们将分类器应用于两种类型的对象建议:(1)具有小步长且紧密围绕估计位置的建议,用于尺度估计;(2)具有大步长且跨越整个图像的建议,用于目标重新检测。在大规模基准数据集上的广泛实验结果表明,所提出的算法在跟踪方法方面表现出色。