Dai Dengxin, Vasudevan Arun Balajee, Matas Jiri, Van Gool Luc
IEEE Trans Pattern Anal Mach Intell. 2023 Jan;45(1):123-136. doi: 10.1109/TPAMI.2022.3155643. Epub 2022 Dec 5.
Humans can robustly recognize and localize objects by using visual and/or auditory cues. While machines are able to do the same with visual data already, less work has been done with sounds. This work develops an approach for scene understanding purely based on binaural sounds. The considered tasks include predicting the semantic masks of sound-making objects, the motion of sound-making objects, and the depth map of the scene. To this aim, we propose a novel sensor setup and record a new audio-visual dataset of street scenes with eight professional binaural microphones and a 360 camera. The co-existence of visual and audio cues is leveraged for supervision transfer. In particular, we employ a cross-modal distillation framework that consists of multiple vision 'teacher' methods and a sound 'student' method - the student method is trained to generate the same results as the teacher methods do. This way, the auditory system can be trained without using human annotations. To further boost the performance, we propose another novel auxiliary task, coined Spatial Sound Super-Resolution, to increase the directional resolution of sounds. We then formulate the four tasks into one end-to-end trainable multi-tasking network aiming to boost the overall performance. Experimental results show that 1) our method achieves good results for all four tasks, 2) the four tasks are mutually beneficial - training them together achieves the best performance, 3) the number and orientation of microphones are both important, and 4) features learned from the standard spectrogram and features obtained by the classic signal processing pipeline are complementary for auditory perception tasks. The data and code are released on the project page: https://www.trace.ethz.ch/publications/2020/sound_perception/index.html.
人类可以通过视觉和/或听觉线索稳健地识别物体并确定其位置。虽然机器已经能够对视觉数据做到这一点,但在声音方面所做的工作较少。这项工作开发了一种完全基于双耳声音的场景理解方法。所考虑的任务包括预测发声物体的语义掩码、发声物体的运动以及场景的深度图。为此,我们提出了一种新颖的传感器设置,并使用八个专业双耳麦克风和一个360度摄像头记录了一个新的街景视听数据集。利用视觉和音频线索的共存进行监督转移。具体来说,我们采用了一个跨模态蒸馏框架,该框架由多种视觉“教师”方法和一种声音“学生”方法组成——训练学生方法以生成与教师方法相同的结果。这样,听觉系统可以在不使用人工标注的情况下进行训练。为了进一步提高性能,我们提出了另一个新颖的辅助任务,即空间声音超分辨率,以提高声音的方向分辨率。然后,我们将这四个任务制定为一个端到端可训练的多任务网络,旨在提高整体性能。实验结果表明:1)我们的方法在所有四个任务上都取得了良好的结果;2)这四个任务相互有益——一起训练它们可实现最佳性能;3)麦克风的数量和方向都很重要;4)从标准频谱图中学习的特征和通过经典信号处理管道获得的特征在听觉感知任务中是互补的。数据和代码已在项目页面上发布:https://www.trace.ethz.ch/publications/2020/sound_perception/index.html。