Guo Weilong, Li Shengyang, Chen Feixiang, Sun Yuhan, Gu Yanfeng
IEEE Trans Image Process. 2024;33:2238-2251. doi: 10.1109/TIP.2024.3374100. Epub 2024 Mar 21.
Satellite video multi-label scene classification predicts semantic labels of multiple ground contents to describe a given satellite observation video, which plays an important role in applications like ocean observation, smart cities, et al. However, the lack of a high-quality and large-scale dataset prevents further improvement of the task. And existing methods on general videos have the difficulty to represent the local details of ground contents when directly applied to the satellite videos. In this paper, our contributions include (1) we develop the first publicly available and large-scale satellite video multi-label scene classification dataset. It consists of 18 classes of static and dynamic ground contents, 3549 videos, and 141960 frames. (2) we propose a baseline method with the novel Spatial and Temporal Feature Cooperative Encoding (STFCE). It exploits the relations between local spatial and temporal features, and models long-term motion information hidden in inter-frame variations. In this way, it can enhance features of local details and obtain the powerful video-scene-level feature representation, which raises the classification performance effectively. Experimental results show that our proposed STFCE outperforms 13 state-of-the-art methods with a global average precision (GAP) of 0.8106 and the careful fusion and joint learning of the spatial, temporal, and motion features are beneficial to achieve a more robust and accurate model. Moreover, benchmarking results show that the proposed dataset is very challenging and we hope it could promote further development of the satellite video multi-label scene classification task.
卫星视频多标签场景分类用于预测多个地面内容的语义标签,以描述给定的卫星观测视频,这在海洋观测、智慧城市等应用中发挥着重要作用。然而,缺乏高质量的大规模数据集阻碍了该任务的进一步改进。并且,直接将通用视频上的现有方法应用于卫星视频时,难以表征地面内容的局部细节。在本文中,我们的贡献包括:(1)我们开发了首个公开可用的大规模卫星视频多标签场景分类数据集。它由18类静态和动态地面内容、3549个视频以及141960帧组成。(2)我们提出了一种采用新颖的时空特征协同编码(STFCE)的基线方法。它利用局部空间和时间特征之间的关系,并对帧间变化中隐藏的长期运动信息进行建模。通过这种方式,它可以增强局部细节的特征并获得强大的视频场景级特征表示,从而有效提高分类性能。实验结果表明,我们提出的STFCE优于13种先进方法,全局平均精度(GAP)为0.8106,并且空间、时间和运动特征的精心融合与联合学习有利于实现更稳健、准确的模型。此外,基准测试结果表明,所提出的数据集具有很大挑战性,我们希望它能推动卫星视频多标签场景分类任务的进一步发展。