Du Guocai, Zhou Peiyong, Yadikar Nurbiya, Aysa Alimjan, Ubul Kurban
School of Computer Science and Technology, Xinjiang University, Urumqi, 830046, China.
Key Laboratory of Xinjiang Multilingual Information Technology, Xinjiang University, Urumqi, 830046, China.
Sci Rep. 2025 Apr 26;15(1):14603. doi: 10.1038/s41598-025-85335-1.
Based on natural language specification unmanned aerial vehicle tracking mission goal is to automatically and continuously track the target in subsequent frames by natural language descriptions. Existing tracking methods typically handle this problem through two separate steps: visual grounding and object tracking. However, this independent solution would result in ignoring the relationship between visual grounding and object tracking, e.g., natural language can provide semantic information about the target, and this solution would also result in an inability to train end-to-end. Therefore, we propose a framework based on natural language specification that integrates visual grounding and object tracking, redefining it as a unified task. This framework can track the object based on a given natural language reference. First, the proposed triangular integration effectively establishes the relationship between natural language and images (template image and search image). Then in order to accomplish multi-scale learning and global receptive field, and effectively improve the flexibility of the method to the visual characteristics of the tracking target, we designed a new lightweight concentrated multi-scale linear attention. Additionally, to reduce computational complexity, we introduced residuals. Finally, experiments conducted on six UAV tracking datasets showed that our tracker achieved accuracy, success rate, and average speed of 0.819, 0.654, and 61 FPS, respectively, outperforming other state-of-the-art trackers.
基于自然语言描述的无人机跟踪任务目标是通过自然语言描述在后续帧中自动且持续地跟踪目标。现有的跟踪方法通常通过两个独立步骤来处理这个问题:视觉定位和目标跟踪。然而,这种独立的解决方案会导致忽略视觉定位和目标跟踪之间的关系,例如,自然语言可以提供关于目标的语义信息,并且这种解决方案还会导致无法进行端到端的训练。因此,我们提出了一个基于自然语言描述的框架,该框架将视觉定位和目标跟踪集成在一起,将其重新定义为一个统一的任务。这个框架可以基于给定的自然语言参考来跟踪目标。首先,所提出的三角积分有效地建立了自然语言与图像(模板图像和搜索图像)之间的关系。然后,为了实现多尺度学习和全局感受野,并有效地提高该方法对跟踪目标视觉特征的灵活性,我们设计了一种新的轻量级集中式多尺度线性注意力机制。此外,为了降低计算复杂度,我们引入了残差。最后,在六个无人机跟踪数据集上进行的实验表明,我们的跟踪器的准确率、成功率和平均速度分别达到了0.819、0.654和61帧每秒,优于其他现有最先进的跟踪器。