Department of Power Mechanical Engineering, National Tsing Hua University, No. 101, Section 2, Kuang-Fu Road, Hsinchu, Taiwan 30013.
Department of Electronic Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon Tong, China.
J Acoust Soc Am. 2020 Jan;147(1):11. doi: 10.1121/10.0000492.
A household sound event classification system consisting of an audio localization and enhancement front-end cascaded with an intelligent classification back-end is presented. The front-end is composed of a sparsely deployed microphone array and a preprocessing unit to localize the source and extract the associated signal. In the front-end, a two-stage method and a direct method are compared for localization. The two-stage method introduces a subspace algorithm to estimate the time difference of arrival, followed by a constrained least squares algorithm to determine the source location. The direct localization methods, the delay-and-sum beamformer, the minimum power distortionless response beamformer, and the multiple signal classification algorithm are compared in terms of localization performance for sparse array configuration. A modified particle swarm optimization algorithm enabled an efficient grid-search. A minimum variance distortionless response beamformer in conjunction with a minimum-mean-square-error postfilter is exploited to extract the source signals for sound event classification tasks that follow. The back-end of the system is a sound event classifier that is based on convolutional neural networks (CNNs), and convolutional long short-term memory networks Mel-spectrograms are used as the input features to the CNNs. Simulations and experiments conducted in a live room have demonstrated the strength and weakness of the direct and two-stage methods. Signal quality enhancement using the array-based front-end proves beneficial for improved classification accuracy over a single microphone.
提出了一种由音频定位和增强前端与智能分类后端级联组成的家用声音事件分类系统。前端由稀疏部署的麦克风阵列和预处理单元组成,用于定位声源并提取相关信号。在前端,比较了用于定位的两阶段方法和直接方法。两阶段方法引入子空间算法来估计到达时间差,然后使用约束最小二乘算法来确定声源位置。直接定位方法,延迟和求和波束形成器,最小功率无失真响应波束形成器和多信号分类算法在稀疏阵列配置下的定位性能方面进行了比较。改进的粒子群优化算法实现了高效的网格搜索。利用最小方差无失真响应波束形成器和最小均方误差后滤波器来提取声源信号,用于后续的声音事件分类任务。系统的后端是基于卷积神经网络 (CNN) 的声音事件分类器,卷积长短时记忆网络梅尔频谱图用作 CNN 的输入特征。在实际房间中进行的模拟和实验证明了直接和两阶段方法的优缺点。使用基于阵列的前端进行信号质量增强可提高分类准确性,优于单个麦克风。