Zhuge Yunzhi, Gu Hongyu, Zhang Lu, Qi Jinqing, Lu Huchuan
IEEE Trans Neural Netw Learn Syst. 2025 May;36(5):9084-9097. doi: 10.1109/TNNLS.2024.3418980. Epub 2025 May 2.
In this article, we address the challenges in unsupervised video object segmentation (UVOS) by proposing an efficient algorithm, termed MTNet, which concurrently exploits motion and temporal cues. Unlike previous methods that focus solely on integrating appearance with motion or on modeling temporal relations, our method combines both aspects by integrating them within a unified framework. MTNet is devised by effectively merging appearance and motion features during the feature extraction process within encoders, promoting a more complementary representation. To capture the intricate long-range contextual dynamics and information embedded within videos, a temporal transformer module is introduced, facilitating efficacious interframe interactions throughout a video clip. Furthermore, we employ a cascade of decoders all feature levels across all feature levels to optimally exploit the derived features, aiming to generate increasingly precise segmentation masks. As a result, MTNet provides a strong and compact framework that explores both temporal and cross-modality knowledge to robustly localize and track the primary object accurately in various challenging scenarios efficiently. Extensive experiments across diverse benchmarks conclusively show that our method not only attains state-of-the-art performance in UVOS but also delivers competitive results in video salient object detection (VSOD). These findings highlight the method's robust versatility and its adeptness in adapting to a range of segmentation tasks. The source code is available at https://github.com/hy0523/MTNet.
在本文中,我们提出了一种名为MTNet的高效算法来应对无监督视频对象分割(UVOS)中的挑战,该算法同时利用运动和时间线索。与以往仅专注于将外观与运动相结合或对时间关系进行建模的方法不同,我们的方法通过将这两个方面整合在一个统一的框架中来进行结合。MTNet是通过在编码器的特征提取过程中有效地融合外观和运动特征而设计的,从而促进了更具互补性的表示。为了捕捉视频中复杂的长距离上下文动态和信息,引入了一个时间变压器模块,以促进整个视频片段中有效的帧间交互。此外,我们在所有特征级别上采用了一系列解码器,以最佳地利用派生特征,旨在生成越来越精确的分割掩码。结果,MTNet提供了一个强大而紧凑的框架,该框架探索时间和跨模态知识,以便在各种具有挑战性的场景中高效地稳健定位和准确跟踪主要对象。在各种基准上进行的广泛实验最终表明,我们的方法不仅在UVOS中达到了当前的最佳性能,而且在视频显著对象检测(VSOD)中也取得了有竞争力的结果。这些发现突出了该方法强大的通用性及其在适应一系列分割任务方面的熟练程度。源代码可在https://github.com/hy0523/MTNet获取。