Audio, Speech and Language Processing Group (ASLP@NPU), ASGO, School of Computer Science, Northwestern Polytechnical University, Xi'an, China.
China Mobile Research Institute, China.
Neural Netw. 2022 Jun;150:28-42. doi: 10.1016/j.neunet.2022.03.003. Epub 2022 Mar 10.
A keyword spotting (KWS) system running on smart devices should accurately detect the appearances and predict the locations of predefined keywords from audio streams, with small footprint and high efficiency. To this end, this paper proposes a new two-stage KWS method which combines a novel multi-scale depthwise temporal convolution (MDTC) feature extractor and a two-stage keyword detection and localization module. The MDTC feature extractor learns multi-scale feature representation efficiently with dilated depthwise temporal convolution, modeling both the temporal context and the speech rate variation. We use a region proposal network (RPN) as the first-stage KWS. At each frame, we design multiple time regions, which all take the current frame as the end position but have different start positions. These time regions (or formally anchors) are used to indicate rough location candidates of keyword. With frame level features from the MDTC feature extractor as inputs, RPN learns to propose keyword region proposals based on the designed anchors. To alleviate the keyword/non-keyword class imbalance problem, we specifically introduce a hard example mining algorithm to select effective negative anchors in RPN training. The keyword region proposals from the first-stage RPN contain keyword location information which is subsequently used to explicitly extract keyword related sequential features to train the second-stage KWS. The second-stage system learns to classify and transform region proposal to keyword IDs and ground-truth keyword region respectively. Experiments on the Google Speech Command dataset show that the proposed MDTC feature extractor surpasses several competitive feature extractors with a new state-of-the-art command classification error rate of 1.74%. With the MDTC feature extractor, we further conduct wake-up word (WuW) detection and localization experiments on a commercial WuW dataset. Compared to a strong baseline, our proposed two-stage method achieves relatively 27-32% better false rejection rate at one false alarm per hour, while for keyword localization, the two-stage approach achieves more than 0.95 mean intersection-over-union ratio, which is clearly better than the one-stage RPN method.
一个运行在智能设备上的关键词发现(KWS)系统应该能够准确地从音频流中检测到预定义关键词的出现并预测其位置,同时具有较小的占用空间和较高的效率。为此,本文提出了一种新的两阶段 KWS 方法,该方法结合了一种新颖的多尺度深度时间卷积(MDTC)特征提取器和一个两阶段关键词检测和定位模块。MDTC 特征提取器通过空洞深度时间卷积高效地学习多尺度特征表示,对时间上下文和语音速率变化进行建模。我们使用区域建议网络(RPN)作为第一阶段 KWS。在每一帧,我们设计了多个时间区域,每个区域都以当前帧为结束位置,但起始位置不同。这些时间区域(或正式的锚点)用于指示关键词的大致位置候选。RPN 使用来自 MDTC 特征提取器的帧级特征作为输入,根据设计的锚点学习提出关键词区域建议。为了缓解关键词/非关键词的类别不平衡问题,我们特别引入了一种硬例挖掘算法,在 RPN 训练中选择有效的负锚点。第一阶段 RPN 提出的关键词区域建议包含关键词位置信息,随后用于显式提取关键词相关的顺序特征,以训练第二阶段 KWS。第二阶段系统学习对区域建议进行分类和转换,分别得到关键词 ID 和关键词真实区域。在 Google Speech Command 数据集上的实验表明,所提出的 MDTC 特征提取器在命令分类错误率上达到了新的最先进水平 1.74%,优于几个有竞争力的特征提取器。使用 MDTC 特征提取器,我们还在一个商业的唤醒词(WuW)数据集上进行了唤醒词检测和定位实验。与一个强大的基线相比,我们提出的两阶段方法在每小时一个误报的情况下,假拒率相对降低了 27-32%,而对于关键词定位,两阶段方法的平均交并比(IoU)超过 0.95,明显优于单阶段 RPN 方法。