IEEE Trans Pattern Anal Mach Intell. 2019 Sep;41(9):2146-2160. doi: 10.1109/TPAMI.2018.2849374. Epub 2018 Jun 21.
In this work, we propose a tracker that differs from most existing multi-target trackers in two major ways. First, our tracker does not rely on a pre-trained object detector to get the initial object hypotheses. Second, our tracker's final output is the fine contours of the targets rather than traditional bounding boxes. Therefore, our tracker simultaneously solves three main problems: detection, data association and segmentation. This is especially important because the output of each of those three problems are highly correlated and the solution of one can greatly help improve the others. The proposed algorithm consists of two main components: structured learning and Lagrange dual decomposition. Our structured learning based tracker learns a model for each target and infers the best locations of all targets simultaneously in a video clip. The inference of our structured learning is achieved through a new Target Identity-aware Network Flow (TINF), where each node in the network encodes the probability of each target identity belonging to that node. The probabilities are obtained by training target specific models using a global structured learning technique. This is followed by proposed Lagrangian relaxation optimization to find the high quality solution to the network. This forms the first component of our tracker. The second component is Lagrange dual decomposition, which combines the structured learning tracker with a segmentation algorithm. For segmentation, multi-label Conditional Random Field (CRF) is applied to a superpixel based spatio-temporal graph in a segment of video, in order to assign background or target labels to every superpixel. We show how the multi-label CRF is combined with the structured learning tracker through our dual decomposition formulation. This leads to more accurate segmentation results and also helps better resolve typical difficulties in multiple target tracking, such as occlusion handling, ID-switch and track drifting. The experiments on diverse and challenging sequences show that our method achieves superior results compared to competitive approaches for detection, multiple target tracking as well as segmentation.
在这项工作中,我们提出了一种与大多数现有的多目标跟踪器在两个主要方面不同的跟踪器。首先,我们的跟踪器不依赖于预先训练的目标检测器来获取初始目标假设。其次,我们的跟踪器的最终输出是目标的精细轮廓,而不是传统的边界框。因此,我们的跟踪器同时解决了三个主要问题:检测、数据关联和分割。这一点尤为重要,因为这三个问题的输出高度相关,解决其中一个问题可以极大地帮助改善其他问题。所提出的算法由两个主要组成部分组成:结构学习和拉格朗日对偶分解。我们基于结构学习的跟踪器为每个目标学习一个模型,并在视频片段中同时推断所有目标的最佳位置。我们的结构学习推断是通过一个新的目标身份感知网络流(TINF)实现的,其中网络中的每个节点编码每个目标身份属于该节点的概率。这些概率是通过使用全局结构学习技术训练特定于目标的模型获得的。随后是提出的拉格朗日松弛优化,以找到网络的高质量解。这构成了我们跟踪器的第一个组成部分。第二个组成部分是拉格朗日对偶分解,它将结构学习跟踪器与分割算法结合在一起。对于分割,多标签条件随机场(CRF)应用于视频片段中的基于超像素的时空图,以便将背景或目标标签分配给每个超像素。我们展示了如何通过我们的对偶分解公式将多标签 CRF 与结构学习跟踪器结合起来。这导致更准确的分割结果,并有助于更好地解决多目标跟踪中的典型困难,如遮挡处理、ID 切换和跟踪漂移。在多样化和具有挑战性的序列上的实验表明,与竞争方法相比,我们的方法在检测、多目标跟踪和分割方面都取得了优越的结果。