Department of Electronics, University of Alcalá, Campus Universitario s/n, Alcalá de Henares, 28805 Madrid, Spain.
Sensors (Basel). 2018 Oct 12;18(10):3418. doi: 10.3390/s18103418.
This paper presents a novel approach for indoor acoustic source localization using microphone arrays, based on a Convolutional Neural Network (CNN). In the proposed solution, the CNN is designed to directly estimate the three-dimensional position of a single acoustic source using the raw audio signal as the input information and avoiding the use of hand-crafted audio features. Given the limited amount of available localization data, we propose, in this paper, a training strategy based on two steps. We first train our network using semi-synthetic data generated from close talk speech recordings. We simulate the time delays and distortion suffered in the signal that propagate from the source to the array of microphones. We then fine tune this network using a small amount of real data. Our experimental results, evaluated on a publicly available dataset recorded in a real room, show that this approach is able to produce networks that significantly improve existing localization methods based on strategies and also those presented in very recent proposals based on Convolutional Recurrent Neural Networks (CRNN). In addition, our experiments show that the performance of our CNN method does not show a relevant dependency on the speaker's gender, nor on the size of the signal window being used.
本文提出了一种基于卷积神经网络(CNN)的新型室内声源定位方法,使用麦克风阵列。在提出的解决方案中,CNN 被设计为直接使用原始音频信号作为输入信息来估计单个声源的三维位置,而无需使用手工制作的音频特征。鉴于可用定位数据的数量有限,我们在本文中提出了一种基于两步的训练策略。我们首先使用来自近距离谈话录音的半合成数据来训练我们的网络。我们模拟信号从源传播到麦克风阵列时所经历的时间延迟和失真。然后,我们使用少量真实数据对该网络进行微调。我们在真实房间中记录的公共可用数据集上进行的实验结果表明,这种方法能够生成显著优于现有基于策略的定位方法的网络,也优于最近基于卷积递归神经网络(CRNN)的方法。此外,我们的实验表明,我们的 CNN 方法的性能与说话者的性别以及使用的信号窗口的大小无关。