Geng Gu, Zhou Sida, Tang Jianing, Zhang Xinming, Liu Qiao, Yuan Di
Guangzhou Institute of Technology, Xidian University, Guangzhou 510555, China.
School of Electrical and Information Engineering, Yunnan Minzu University, Kunming 650504, China.
Sensors (Basel). 2025 Jul 25;25(15):4621. doi: 10.3390/s25154621.
With the widespread use of sensors in applications such as autonomous driving and intelligent security, stable and efficient target tracking from diverse sensor data has become increasingly important. Self-supervised visual tracking has attracted increasing attention due to its potential to eliminate reliance on costly manual annotations; however, existing methods often train on incomplete object representations, resulting in inaccurate localization during inference. In addition, current methods typically struggle when applied to deep networks. To address these limitations, we propose a novel self-supervised tracking framework based on image synthesis and domain adversarial learning. We first construct a large-scale database of real-world target objects, then synthesize training video pairs by randomly inserting these targets into background frames while applying geometric and appearance transformations to simulate realistic variations. To reduce domain shift introduced by synthetic content, we incorporate a domain classification branch after feature extraction and adopt domain adversarial training to encourage feature alignment between real and synthetic domains. Experimental results on five standard tracking benchmarks demonstrate that our method significantly enhances tracking accuracy compared to existing self-supervised approaches without introducing any additional labeling cost. The proposed framework not only ensures complete target coverage during training but also shows strong scalability to deeper network architectures, offering a practical and effective solution for real-world tracking applications.
随着传感器在自动驾驶和智能安全等应用中的广泛使用,从各种传感器数据中进行稳定高效的目标跟踪变得越来越重要。自监督视觉跟踪因其有潜力消除对昂贵人工标注的依赖而受到越来越多的关注;然而,现有方法通常在不完整的对象表示上进行训练,导致推理过程中的定位不准确。此外,当前方法在应用于深度网络时通常会遇到困难。为了解决这些限制,我们提出了一种基于图像合成和域对抗学习的新型自监督跟踪框架。我们首先构建一个大规模的真实世界目标对象数据库,然后通过将这些目标随机插入背景帧中,同时应用几何和外观变换来模拟现实变化,从而合成训练视频对。为了减少合成内容引入的域转移,我们在特征提取后加入一个域分类分支,并采用域对抗训练来促进真实域和合成域之间的特征对齐。在五个标准跟踪基准上的实验结果表明,与现有的自监督方法相比,我们的方法在不引入任何额外标注成本的情况下显著提高了跟踪精度。所提出的框架不仅在训练期间确保了目标的完整覆盖,而且对更深的网络架构显示出强大的可扩展性,为实际的跟踪应用提供了一种实用有效的解决方案。