Gurkan Filiz, Cerkezi Llukman, Cirakman Ozgun, Gunsel Bilge
IEEE Trans Image Process. 2021;30:7938-7951. doi: 10.1109/TIP.2021.3112010. Epub 2021 Sep 22.
Recent tracking-by-detection approaches use deep object detectors as target detection baseline, because of their high performance on still images. For effective video object tracking, object detection is integrated with a data association step performed by either a custom design inference architecture or an end-to-end joint training for tracking purpose. In this work, we adopt the former approach and use the pre-trained Mask R-CNN deep object detector as the baseline. We introduce a novel inference architecture placed on top of FPN-ResNet101 backbone of Mask R-CNN to jointly perform detection and tracking, without requiring additional training for tracking purpose. The proposed single object tracker, TDIOT, applies an appearance similarity-based temporal matching for data association. To tackle tracking discontinuities, we incorporate a local search and matching module into the inference head layer that exploits SiamFC. Moreover, to improve robustness to scale changes, we introduce a scale adaptive region proposal network that enables to search for the target at an adaptively enlarged spatial neighborhood specified by the trace of the target. In order to meet long term tracking requirements, a low cost verification layer is incorporated into the inference architecture to monitor presence of the target based on its LBP histogram model. Performance evaluation on videos from VOT2016, VOT2018, and VOT-LT2018 datasets demonstrate that TDIOT achieves higher accuracy compared to the state-of-the-art short-term trackers while it provides comparable performance in long term tracking. We also compare our tracker on LaSOT dataset where we observe that TDIOT provides comparable performance with other methods that are trained on LaSOT. The source code and TDIOT output videos are accessible at https://github.com/msprITU/TDIOT.
最近的基于检测的跟踪方法将深度目标检测器用作目标检测基线,因为它们在静态图像上具有高性能。为了实现有效的视频目标跟踪,目标检测与通过自定义设计推理架构或用于跟踪目的的端到端联合训练执行的数据关联步骤相结合。在这项工作中,我们采用前一种方法,并使用预训练的Mask R-CNN深度目标检测器作为基线。我们引入了一种新颖的推理架构,该架构置于Mask R-CNN的FPN-ResNet101骨干之上,以联合执行检测和跟踪,而无需为跟踪目的进行额外训练。所提出的单目标跟踪器TDIOT应用基于外观相似性的时间匹配进行数据关联。为了解决跟踪不连续性问题,我们将一个利用SiamFC的局部搜索和匹配模块纳入推理头层。此外,为了提高对尺度变化的鲁棒性,我们引入了一个尺度自适应区域提议网络,该网络能够在由目标轨迹指定的自适应扩大的空间邻域中搜索目标。为了满足长期跟踪要求,在推理架构中纳入了一个低成本验证层,以基于目标的LBP直方图模型监测目标的存在。对来自VOT2016、VOT2018和VOT-LT2018数据集的视频进行的性能评估表明,与最先进的短期跟踪器相比,TDIOT实现了更高的准确率,同时在长期跟踪中提供了可比的性能。我们还在LaSOT数据集上比较了我们的跟踪器,在该数据集中我们观察到TDIOT与在LaSOT上训练的其他方法具有可比的性能。源代码和TDIOT输出视频可在https://github.com/msprITU/TDIOT上获取。