Montalvo Javier, García-Martín Álvaro, Carballeira Pablo, SanMiguel Juan C
Video Processing and Understanding Lab, Escuela Politécnica Superior, Universidad Autónoma de Madrid, 28049 Madrid, Spain.
J Imaging. 2025 May 22;11(6):172. doi: 10.3390/jimaging11060172.
Semantic segmentation is a computer vision task where classification is performed at the pixel level. Due to this, the process of labeling images for semantic segmentation is time-consuming and expensive. To mitigate this cost there has been a surge in the use of synthetically generated data-usually created using simulators or videogames-which, in combination with domain adaptation methods, can effectively learn how to segment real data. Still, these datasets have a particular limitation: due to their closed-set nature, it is not possible to include novel classes without modifying the tool used to generate them, which is often not public. Concurrently, generative models have made remarkable progress, particularly with the introduction of diffusion models, enabling the creation of high-quality images from text prompts without additional supervision. In this work, we propose an unsupervised pipeline that leverages Stable Diffusion and Segment Anything Module to generate class examples with an associated segmentation mask, and a method to integrate generated cutouts for novel classes in semantic segmentation datasets, all with minimal user input. Our approach aims to improve the performance of unsupervised domain adaptation methods by introducing novel samples into the training data without modifications to the underlying algorithms. With our methods, we show how models can not only effectively learn how to segment novel classes, with an average performance of 51% intersection over union for novel classes, but also reduce errors for other, already existing classes, reaching a higher performance level overall.
语义分割是一种计算机视觉任务,在像素级别上进行分类。因此,为语义分割标注图像的过程既耗时又昂贵。为了降低这种成本,合成生成数据的使用激增,这些数据通常是使用模拟器或视频游戏创建的,结合域适应方法,可以有效地学习如何分割真实数据。然而,这些数据集有一个特殊的局限性:由于它们的封闭集性质,如果不修改用于生成它们的工具(而这些工具通常不是公开的),就不可能包含新的类别。与此同时,生成模型取得了显著进展,特别是随着扩散模型的引入,能够在没有额外监督的情况下根据文本提示创建高质量图像。在这项工作中,我们提出了一种无监督的流程,利用Stable Diffusion和分割一切模型(Segment Anything Module)来生成带有相关分割掩码的类别示例,以及一种将生成的新类别抠图集成到语义分割数据集中的方法,所有这些都只需最少的用户输入。我们的方法旨在通过在不修改底层算法的情况下将新样本引入训练数据来提高无监督域适应方法的性能。通过我们的方法,我们展示了模型不仅可以有效地学习如何分割新类别,新类别的平均交并比性能达到51%,而且还可以减少其他现有类别的错误,总体上达到更高的性能水平。