Mao Yuxin, Zhang Jing, Xiang Mochu, Lv Yunqiu, Li Dong, Zhong Yiran, Dai Yuchao
IEEE Trans Image Process. 2025;34:4108-4119. doi: 10.1109/TIP.2025.3580269.
Audio-visual Segmentation (AVS) is conceptualized as a conditional generation task, where audio is considered as the conditional variable for segmenting the sound producer(s). In this case, audio should be extensively explored to maximize its contribution for the final segmentation task. We propose a contrastive conditional latent diffusion model for audio-visual segmentation (AVS) to thoroughly investigate the impact of audio, where the correlation between audio and the final segmentation map is modeled to guarantee the strong correlation between them. To achieve semantic-correlated representation learning, our framework incorporates a latent diffusion model. The diffusion model learns the conditional generation process of the ground-truth segmentation map, resulting in ground-truth aware inference during the denoising process at the test stage. As our model is conditional, it is vital to ensure that the conditional variable contributes to the model output. We thus extensively model the contribution of the audio signal by minimizing the density ratio between the conditional probability of the multimodal data, e.g. conditioned on the audio-visual data, and that of the unimodal data, e.g. conditioned on the audio data only. In this way, our latent diffusion model via density ratio optimization explicitly maximizes the contribution of audio for AVS, which can then be achieved with contrastive learning as a constraint, where the diffusion part serves as the main objective to achieve maximum likelihood estimation, and the density ratio optimization part imposes the constraint. By adopting this latent diffusion model via contrastive learning, we effectively enhance the contribution of audio for AVS. The effectiveness of our solution is validated through experimental results on the benchmark dataset. Code and results are online via our project page: https://github.com/OpenNLPLab/DiffusionAVS.
视听分割(AVS)被概念化为一个条件生成任务,其中音频被视为用于分割声音产生者的条件变量。在这种情况下,应广泛探索音频,以使其对最终分割任务的贡献最大化。我们提出了一种用于视听分割(AVS)的对比条件潜在扩散模型,以全面研究音频的影响,其中对音频与最终分割图之间的相关性进行建模,以确保它们之间有很强的相关性。为了实现语义相关的表示学习,我们的框架纳入了一个潜在扩散模型。扩散模型学习真实分割图的条件生成过程,从而在测试阶段的去噪过程中实现对真实情况的感知推理。由于我们的模型是条件性的,确保条件变量对模型输出有贡献至关重要。因此,我们通过最小化多模态数据的条件概率(例如以视听数据为条件)与单模态数据的条件概率(例如仅以音频数据为条件)之间的密度比,广泛地对音频信号的贡献进行建模。通过这种方式,我们基于密度比优化的潜在扩散模型明确地最大化了音频对AVS的贡献,这可以通过以对比学习为约束来实现,其中扩散部分作为实现最大似然估计的主要目标,而密度比优化部分施加约束。通过采用这种基于对比学习的潜在扩散模型,我们有效地增强了音频对AVS的贡献。我们的解决方案的有效性通过在基准数据集上的实验结果得到了验证。代码和结果可通过我们的项目页面在线获取:https://github.com/OpenNLPLab/DiffusionAVS。