Zhang Zhe, Wu Gaochang, Zhang Jing, Zhu Xiatian, Tao Dacheng, Chai Tianyou
IEEE Trans Pattern Anal Mach Intell. 2025 Aug;47(8):6731-6748. doi: 10.1109/TPAMI.2025.3562999.
Unsupervised Domain Adaptive Semantic Segmentation (UDA-SS) aims to transfer the supervision from a labeled source domain to an unlabeled and shifted target domain. The majority of existing UDA-SS works typically consider images whilst recent attempts have extended further to tackle videos by modeling the temporal dimension. Although two lines of research share the major challenges - overcoming the underlying domain distribution shift, their studies are largely independent. It causes several issues: (1) The insights gained from each line of research remain fragmented, leading to a lack of holistic understanding of the problem and potential solutions. (2) Preventing the unification of methods and best practices across two scenarios (images and videos) will lead to redundant efforts and missed opportunities for cross-pollination of ideas. (3) Without a unified approach, the knowledge and advancements made in one scenario may not be effectively transferred to the other, leading to suboptimal performance and slower progress. Under this observation, we advocate unifying the study of UDA-SS across video and image scenarios, enabling a more comprehensive understanding, synergistic advancements, and efficient knowledge sharing. To that end, we explore the unified UDA-SS from a general domain augmentation perspective, serving as a unifying framework, enabling improved generalization, and potential for cross-pollination, ultimately contributing to the practical impact and overall progress. Specifically, we propose a Quad-directional Mixup (QuadMix) method, characterized by tackling intra-domain discontinuity, fragmented gap bridging, and feature inconsistencies through four-directional paths designed for intra- and inter-domain mixing within an explicit feature space. To deal with temporal shifts within videos, we incorporate optical flow-guided feature aggregation across spatial and temporal dimensions for fine-grained domain alignment, which is extendable to image scenarios. Extensive experiments show that QuadMix outperforms the state-of-the-art works by large margins on four challenging UDA-SS benchmarks.
无监督域自适应语义分割(UDA-SS)旨在将监督从有标签的源域转移到无标签且有偏移的目标域。大多数现有的UDA-SS工作通常考虑图像,而最近的尝试通过对时间维度进行建模进一步扩展到处理视频。尽管这两条研究路线面临共同的主要挑战——克服潜在的域分布偏移,但它们的研究在很大程度上是独立的。这导致了几个问题:(1)从每条研究路线中获得的见解仍然是零散的,导致对问题和潜在解决方案缺乏整体理解。(2)阻止跨两种场景(图像和视频)的方法和最佳实践的统一将导致重复工作,并错过思想交叉融合的机会。(3)没有统一的方法,在一种场景中取得的知识和进展可能无法有效地转移到另一种场景,导致性能次优和进展缓慢。基于此观察,我们主张统一视频和图像场景下的UDA-SS研究,以实现更全面的理解、协同进展和高效的知识共享。为此,我们从通用域增强的角度探索统一的UDA-SS,作为一个统一框架,实现更好的泛化以及交叉融合的潜力,最终为实际影响和整体进展做出贡献。具体而言,我们提出了一种四向混合(QuadMix)方法,其特点是通过在显式特征空间内为域内和域间混合设计的四条方向路径来解决域内不连续性、碎片化间隙弥合和特征不一致问题。为了处理视频中的时间偏移,我们纳入了跨空间和时间维度的光流引导特征聚合,以实现细粒度的域对齐,这可扩展到图像场景。大量实验表明,QuadMix在四个具有挑战性的UDA-SS基准测试中大幅优于现有最先进的方法。