Cen Zhigang, Guo Ningyan, Xu Wenjing, Feng Zhiyong, Huang Danlan
Beijing University of Posts and Telecommunications, Beijing, 100876, China.
Beijing University of Posts and Telecommunications, Beijing, 100876, China.
Neural Netw. 2025 Aug 7;192:107953. doi: 10.1016/j.neunet.2025.107953.
Video semantic segmentation (VSS) has been widely employed in lots of fields, such as simultaneous localization and mapping, autonomous driving and surveillance. Its core challenge is how to leverage temporal information to achieve better segmentation. Previous efforts have primarily focused on pixel-level static-dynamic contexts matching, utilizing techniques such as optical flow and attention mechanism. Instead, this paper rethinks static-dynamic contexts at the class level and proposes a novel static-dynamic class-level perceptual consistency (SD-CPC) framework. In this framework, we propose multivariate class prototype with contrastive learning and a static-dynamic semantic alignment module. The former provides class-level constraints for the model, obtaining personalized inter-class features and diversified intra-class features. The latter first establishes intra-frame spatial multi-scale and multi-level correlations to achieve static semantic alignment. Then, based on cross-frame static perceptual differences, it performs two-stage cross-frame selective aggregation to achieve dynamic semantic alignment. Meanwhile, we propose a novel window-based attention map calculation method that leverages the sparsity of cross-frame attention points and the Hadamard product, which reduces the computational cost of cross-frame attention aggregation. It is worth noting that the proposed method achieves a 51.1 mIoU on the VSPW dataset using MiT-B5, and 81.6 mIoU and 78.2 mIoU on the Cityscapes and CamVid datasets, respectively, using ResNet101. These results surpass those of other existing state-of-the-art methods. Our implementation will be open-sourced on GitHub.
视频语义分割(VSS)已在许多领域中广泛应用,如同时定位与地图构建、自动驾驶和监控。其核心挑战在于如何利用时间信息来实现更好的分割。以往的工作主要集中在像素级的静态-动态上下文匹配,采用诸如光流和注意力机制等技术。相反,本文在类别级别重新思考静态-动态上下文,并提出了一种新颖的静态-动态类别级感知一致性(SD-CPC)框架。在这个框架中,我们提出了具有对比学习的多元类别原型和一个静态-动态语义对齐模块。前者为模型提供类别级约束,获得个性化的类间特征和多样化的类内特征。后者首先建立帧内空间多尺度和多层次相关性以实现静态语义对齐。然后,基于跨帧静态感知差异,它执行两阶段跨帧选择性聚合以实现动态语义对齐。同时,我们提出了一种新颖的基于窗口的注意力图计算方法,该方法利用跨帧注意力点的稀疏性和哈达玛积,降低了跨帧注意力聚合的计算成本。值得注意的是,所提出的方法使用MiT-B5在VSPW数据集上实现了51.1的平均交并比(mIoU),使用ResNet101在Cityscapes和CamVid数据集上分别实现了81.6的mIoU和78.2的mIoU。这些结果超过了其他现有的先进方法。我们的实现将在GitHub上开源。