通过逐像素自适应特征增强实现卫星视频中的高精度多目标跟踪

High-Precision Multi-Object Tracking in Satellite Videos via Pixel-Wise Adaptive Feature Enhancement.

作者信息

Wan Gang, Su Zhijuan, Wu Yitian, Guo Ningbo, Cong Dianwei, Wei Zhanji, Liu Wei, Wang Guoping

机构信息

School of Space Information, Space Engineering University, Beijing 101407, China.

State Key Laboratory of Geo-Information Engineering, Xi'an 710054, China.

出版信息

Sensors (Basel). 2024 Oct 9;24(19):6489. doi: 10.3390/s24196489.

DOI:10.3390/s24196489

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11479360/

Abstract

In this paper, we focus on the multi-target tracking (MOT) task in satellite videos. To achieve efficient and accurate tracking, we propose a transformer-distillation-based end-to-end joint detection and tracking (JDT) method. Specifically, (1) considering that targets in satellite videos usually have small scales and are shot from a bird's-eye view, we propose a pixel-wise transformer-based feature distillation module through which useful object representations are learned via pixel-wise distillation using a strong teacher detection network; (2) targets in satellite videos, such as airplanes, ships, and vehicles, usually have similar appearances, so we propose a temperature-controllable key feature learning objective function, and by highlighting the learning of similar features during distilling, the tracking accuracy for such objects can be further improved; (3) we propose a method that is based on an end-to-end network but simultaneously learns from a highly precise teacher network and tracking head during training so that the tracking accuracy of the end-to-end network can be improved via distillation without compromising efficiency. The experimental results on three recently released publicly available datasets demonstrated the superior performance of the proposed method for satellite videos. The proposed method achieved over 90% overall tracking performance on the AIR-MOT dataset.

摘要

在本文中，我们专注于卫星视频中的多目标跟踪（MOT）任务。为实现高效且准确的跟踪，我们提出一种基于Transformer蒸馏的端到端联合检测与跟踪（JDT）方法。具体而言，（1）考虑到卫星视频中的目标通常具有小尺度且是从鸟瞰视角拍摄的，我们提出一种基于逐像素Transformer的特征蒸馏模块，通过该模块利用强大的教师检测网络通过逐像素蒸馏来学习有用的目标表示；（2）卫星视频中的目标，如飞机、船只和车辆，通常具有相似的外观，因此我们提出一种温度可控的关键特征学习目标函数，并且通过在蒸馏过程中突出对相似特征的学习，可以进一步提高对此类目标的跟踪精度；（3）我们提出一种基于端到端网络的方法，但在训练期间同时从高精度的教师网络和跟踪头进行学习，以便通过蒸馏提高端到端网络的跟踪精度而不影响效率。在最近发布的三个公开可用数据集上的实验结果证明了所提方法在卫星视频方面的卓越性能。所提方法在AIR-MOT数据集上实现了超过90%的整体跟踪性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3f06/11479360/2363d5bb05b9/sensors-24-06489-g001.jpg