基于三维多通道特征相关深度学习网络的声场景分类。

Acoustic scene classification based on three-dimensional multi-channel feature-correlated deep learning networks.

机构信息

Weihai Beiyang Electrical Group Co., Ltd, Weihai, Shandong, China.

School of Mechanical, Electrical, and Information Engineering, Shandong University, Jinan, China.

出版信息

Sci Rep. 2022 Aug 12;12(1):13730. doi: 10.1038/s41598-022-17863-z.

DOI:10.1038/s41598-022-17863-z

PMID:35962021

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9374676/

Abstract

As an effective approach to perceive environments, acoustic scene classification (ASC) has received considerable attention in the past few years. Generally, ASC is deemed a challenging task due to subtle differences between various classes of environmental sounds. In this paper, we propose a novel approach to perform accurate classification based on the aggregation of spatial-temporal features extracted from a multi-branch three-dimensional (3D) convolution neural network (CNN) model. The novelties of this paper are as follows. First, we form multiple frequency-domain representations of signals by fully utilizing expert knowledge on acoustics and discrete wavelet transformations (DWT). Secondly, we propose a novel 3D CNN architecture featuring residual connections and squeeze-and-excitation attentions (3D-SE-ResNet) to effectively capture both long-term and short-term correlations inherent in environmental sounds. Thirdly, an auxiliary supervised branch based on the chromatogram of the original signal is incorporated in the proposed architecture to alleviate overfitting risks by providing supplementary information to the model. The performance of the proposed multi-input multi-feature 3D-CNN architecture is numerically evaluated on a typical large-scale dataset in the 2019 IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2019) and is shown to obtain noticeable performance gains over the state-of-the-art methods in the literature.

摘要

作为一种有效的环境感知方法，声学场景分类（ASC）在过去几年中受到了广泛关注。通常，由于各种环境声音类之间的细微差异，ASC 被认为是一项具有挑战性的任务。在本文中，我们提出了一种基于从多分支三维（3D）卷积神经网络（CNN）模型提取的时空特征聚合来进行准确分类的新方法。本文的新颖之处如下。首先，我们通过充分利用声学和离散小波变换（DWT）方面的专业知识，形成信号的多个频域表示。其次，我们提出了一种新颖的 3D CNN 架构，具有残差连接和挤压-激励注意力（3D-SE-ResNet），可有效捕获环境声音中固有的长期和短期相关性。第三，在提出的架构中结合了基于原始信号色图的辅助监督分支，通过为模型提供补充信息来减轻过拟合风险。所提出的多输入多特征 3D-CNN 架构的性能在 2019 年 IEEE AASP 检测和分类声学场景和事件挑战赛（DCASE 2019）上的典型大规模数据集上进行了数值评估，并在文献中的最新方法中获得了显著的性能提升。