Wang Wei-Cheng, De Coninck Sander, Leroux Sam, Simoens Pieter
IDLab, Ghent University-imec, Ghent, Belgium.
Front Robot AI. 2025 Jan 13;11:1490718. doi: 10.3389/frobt.2024.1490718. eCollection 2024.
Smart cities deploy various sensors such as microphones and RGB cameras to collect data to improve the safety and comfort of the citizens. As data annotation is expensive, self-supervised methods such as contrastive learning are used to learn audio-visual representations for downstream tasks. Focusing on surveillance data, we investigate two common limitations of audio-visual contrastive learning: false negatives and the minimal sufficient information bottleneck. Irregular, yet frequently recurring events can lead to a considerable number of false-negative pairs and disrupt the model's training. To tackle this challenge, we propose a novel method for generating contrastive pairs based on the distance between embeddings of different modalities, rather than relying solely on temporal cues. The semantically synchronized pairs can then be used to ease the minimal sufficient information bottleneck along with the new loss function for multiple positives. We experimentally validate our approach on real-world data and show how the learnt representations can be used for different downstream tasks, including audio-visual event localization, anomaly detection, and event search. Our approach reaches similar performance as state-of-the-art modality- and task-specific approaches.
智慧城市部署各种传感器,如麦克风和RGB摄像头,以收集数据,提高市民的安全性和舒适度。由于数据标注成本高昂,对比学习等自监督方法被用于学习视听表征,以用于下游任务。针对监控数据,我们研究了视听对比学习的两个常见局限性:误报和最小充分信息瓶颈。不规则但频繁发生的事件可能导致大量误报对,并干扰模型训练。为应对这一挑战,我们提出了一种基于不同模态嵌入之间距离生成对比对的新方法,而不是仅仅依赖时间线索。然后,语义同步对可用于缓解最小充分信息瓶颈,同时使用针对多个正样本的新损失函数。我们在真实世界数据上通过实验验证了我们的方法,并展示了学习到的表征如何用于不同的下游任务,包括视听事件定位、异常检测和事件搜索。我们的方法达到了与当前最先进的特定模态和任务方法相似的性能。