Jiang Xilin, Han Cong, Li Yinghao Aaron, Mesgarani Nima
Department of Electrical Engineering, Columbia University, USA.
Proc IEEE Int Conf Acoust Speech Signal Process. 2024 Apr;2024:1281-1285. doi: 10.1109/icassp48485.2024.10447391. Epub 2024 Mar 18.
In this study, we present a simple multi-channel framework for contrastive learning (MC-SimCLR) to encode 'what' and 'where' of spatial audios. MC-SimCLR learns joint spectral and spatial representations from unlabeled spatial audios, thereby enhancing both event classification and sound localization in downstream tasks. At its core, we propose a multi-level data augmentation pipeline that augments different levels of audio features, including waveforms, Mel spectrograms, and generalized cross-correlation (GCC) features. In addition, we introduce simple yet effective channel-wise augmentation methods to randomly swap the order of the microphones and mask Mel and GCC channels. By using these augmentations, we find that linear layers on top of the learned representation significantly outperform supervised models in terms of both event classification accuracy and localization error. We also perform a comprehensive analysis of the effect of each augmentation method and a comparison of the fine-tuning performance using different amounts of labeled data.
在本研究中,我们提出了一种用于对比学习的简单多通道框架(MC-SimCLR),以对空间音频的“内容”和“位置”进行编码。MC-SimCLR从未标记的空间音频中学习联合频谱和空间表示,从而在下游任务中增强事件分类和声音定位。其核心是,我们提出了一种多级数据增强管道,该管道增强不同级别的音频特征,包括波形、梅尔频谱图和广义互相关(GCC)特征。此外,我们引入了简单而有效的逐通道增强方法,以随机交换麦克风的顺序,并对梅尔和GCC通道进行掩码处理。通过使用这些增强方法,我们发现,在事件分类准确率和定位误差方面,基于所学表示之上的线性层显著优于监督模型。我们还对每种增强方法的效果进行了全面分析,并比较了使用不同数量标记数据时的微调性能。