Wang Xiao, Chen Zhe, Jiang Bo, Tang Jin, Luo Bin, Tao Dacheng
IEEE Trans Image Process. 2022;31:6239-6254. doi: 10.1109/TIP.2022.3208437. Epub 2022 Sep 30.
To track the target in a video, current visual trackers usually adopt greedy search for target object localization in each frame, that is, the candidate region with the maximum response score will be selected as the tracking result of each frame. However, we found that this may be not an optimal choice, especially when encountering challenging tracking scenarios such as heavy occlusion and fast motion. In particular, if a tracker drifts, errors will be accumulated and would further make response scores estimated by the tracker unreliable in future frames. To address this issue, we propose to maintain multiple tracking trajectories and apply beam search strategy for visual tracking, so that the trajectory with fewer accumulated errors can be identified. Accordingly, this paper introduces a novel multi-agent reinforcement learning based beam search tracking strategy, termed BeamTracking. It is mainly inspired by the image captioning task, which takes an image as input and generates diverse descriptions using beam search algorithm. Accordingly, we formulate the tracking as a sample selection problem fulfilled by multiple parallel decision-making processes, each of which aims at picking out one sample as their tracking result in each frame. Each maintained trajectory is associated with an agent to perform the decision-making and determine what actions should be taken to update related information. More specifically, using the classification-based tracker as the baseline, we first adopt bi-GRU to encode the target feature, proposal feature, and its response score into a unified state representation. The state feature and greedy search result are then fed into the first agent for independent action selection. Afterwards, the output action and state features are fed into the subsequent agent for diverse results prediction. When all the frames are processed, we select the trajectory with the maximum accumulated score as the tracking result. Extensive experiments on seven popular tracking benchmark datasets validated the effectiveness of the proposed algorithm.
为了在视频中跟踪目标,当前的视觉跟踪器通常在每一帧中采用贪婪搜索来定位目标对象,即选择响应得分最高的候选区域作为每一帧的跟踪结果。然而,我们发现这可能不是一个最优选择,尤其是在遇到诸如严重遮挡和快速运动等具有挑战性的跟踪场景时。特别是,如果跟踪器发生漂移,误差将会累积,并会进一步使跟踪器在未来帧中估计的响应得分变得不可靠。为了解决这个问题,我们建议维护多个跟踪轨迹,并将波束搜索策略应用于视觉跟踪,以便能够识别累积误差较少的轨迹。因此,本文介绍了一种基于多智能体强化学习的新型波束搜索跟踪策略,称为BeamTracking。它主要受到图像字幕任务的启发,该任务以图像为输入,并使用波束搜索算法生成多样化的描述。相应地,我们将跟踪问题表述为一个由多个并行决策过程完成的样本选择问题,每个决策过程旨在在每一帧中挑选出一个样本作为其跟踪结果。每个维护的轨迹都与一个智能体相关联,以执行决策并确定应采取哪些行动来更新相关信息。更具体地说,以基于分类的跟踪器为基线,我们首先采用双向门控循环单元(bi-GRU)将目标特征、提议特征及其响应得分编码为统一的状态表示。然后,将状态特征和贪婪搜索结果输入到第一个智能体中进行独立的动作选择。之后,将输出动作和状态特征输入到后续智能体中进行多样化结果预测。当所有帧都处理完毕后,我们选择累积得分最高的轨迹作为跟踪结果。在七个流行的跟踪基准数据集上进行的大量实验验证了所提算法的有效性。