Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China.
ICT Cluster, Singapore Institute of Technology, 10 Dover Drive, Singapore.
Neural Netw. 2021 Jul;139:201-211. doi: 10.1016/j.neunet.2021.03.014. Epub 2021 Mar 18.
Attention-based convolutional neural network (CNN) models are increasingly being adopted for speaker and language recognition (SR/LR) tasks. These include time, frequency, spatial and channel attention, which can focus on useful time frames, frequency bands, regions or channels while extracting features. However, these traditional attention methods lack the exploration of complex information and multi-scale long-range speech feature interactions, which can benefit SR/LR tasks. To address these issues, this paper firstly proposes mixed-order attention (MOA) for low frame-level speech features to gain the finest grain multi-order information at higher resolution. We then combine that with a non-local attention (NLA) mechanism and a dilated residual structure to balance fine grained local detail with convolution from multi-scale long-range time/frequency regions in feature space. The proposed dilated mixed-order non-local attention network (D-MONA) exploits the detail available from the first and the second-order feature attention analysis, but achieves this over a much wider context than purely local attention. Experiments are conducted on three datasets, including two SR tasks of Voxceleb and CN-celeb, and one LR task, NIST LRE 07. For SR, D-MONA improves on ResNet-34 results by at least 29% and 15% for Voxceleb1 and CN-celeb respectively. For the LR task, a large improvement is achieved over ResNet-34 of 21% for the challenging 3s utterance condition, 59% for the 10s condition and 67% for the 30s condition. It also outperforms the state-of-the-art deep bottleneck feature-DNN (DBF-DNN) x-vector system at all scales.
基于注意力的卷积神经网络(CNN)模型越来越多地被应用于说话人识别和语言识别(SR/LR)任务中。这些注意力机制包括时间、频率、空间和通道注意力,它们可以在提取特征时关注有用的时间帧、频段、区域或通道。然而,这些传统的注意力方法缺乏对复杂信息和多尺度长时语音特征交互的探索,而这些探索对 SR/LR 任务是有益的。为了解决这些问题,本文首先提出了混合阶注意力(MOA),用于低帧率语音特征,以在更高的分辨率下获得最细粒度的多阶信息。然后,我们将其与非局部注意力(NLA)机制和扩张残差结构相结合,在特征空间中平衡精细的局部细节与来自多尺度长时/频域的卷积。所提出的扩张混合阶非局部注意力网络(D-MONA)利用了来自一阶和二阶特征注意力分析的细节,但与纯粹的局部注意力相比,它的上下文范围要大得多。在三个数据集上进行了实验,包括 Voxceleb 和 CN-celeb 两个 SR 任务,以及 NIST LRE 07 一个 LR 任务。对于 SR,D-MONA 在 Voxceleb1 和 CN-celeb 上分别比 ResNet-34 提高了至少 29%和 15%。对于 LR 任务,在具有挑战性的 3s 发音条件下,与 ResNet-34 相比,D-MONA 有 21%的显著提高,在 10s 条件下有 59%的提高,在 30s 条件下有 67%的提高。它还在所有规模上都优于最先进的深度瓶颈特征-DNN(DBF-DNN)x 向量系统。