基于卷积神经网络和回归模型的声源定位。

Sound Source Localization Using a Convolutional Neural Network and Regression Model.

机构信息

Department of Electrical Engineering, National Taipei University of Technology, Taipei 10608, Taiwan.

出版信息

Sensors (Basel). 2021 Dec 1;21(23):8031. doi: 10.3390/s21238031.

DOI:10.3390/s21238031

PMID:34884042

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8659937/

Abstract

In this research, a novel sound source localization model is introduced that integrates a convolutional neural network with a regression model (CNN-R) to estimate the sound source angle and distance based on the acoustic characteristics of the interaural phase difference (IPD). The IPD features of the sound signal are firstly extracted from time-frequency domain by short-time Fourier transform (STFT). Then, the IPD features map is fed to the CNN-R model as an image for sound source localization. The Pyroomacoustics platform and the multichannel impulse response database (MIRD) are used to generate both simulated and real room impulse response (RIR) datasets. The experimental results show that an average accuracy of 98.96% and 98.31% are achieved by the proposed CNN-R for angle and distance estimations in the simulation scenario at SNR = 30 dB and RT60 = 0.16 s, respectively. Moreover, in the real environment, the average accuracies of the angle and distance estimations are 99.85% and 99.38% at SNR = 30 dB and RT60 = 0.16 s, respectively. The performance obtained in both scenarios is superior to that of existing models, indicating the potential of the proposed CNN-R model for real-life applications.

摘要

在这项研究中，引入了一种新颖的声源定位模型，该模型将卷积神经网络与回归模型（CNN-R）相结合，基于耳间相位差（IPD）的声学特性来估计声源角度和距离。通过短时傅里叶变换（STFT），首先从时频域提取声音信号的 IPD 特征。然后，将 IPD 特征图作为图像输入到 CNN-R 模型中，用于声源定位。使用 Pyroomacoustics 平台和多通道脉冲响应数据库（MIRD）生成模拟和真实房间脉冲响应（RIR）数据集。实验结果表明，在 SNR = 30dB 和 RT60 = 0.16s 的模拟场景下，所提出的 CNN-R 在角度和距离估计方面的平均准确率分别达到 98.96%和 98.31%。此外，在真实环境中，在 SNR = 30dB 和 RT60 = 0.16s 时，角度和距离估计的平均准确率分别为 99.85%和 99.38%。在这两种情况下的性能都优于现有模型，表明所提出的 CNN-R 模型在实际应用中的潜力。