Beijing National Research Center for Information Science and Technology, Department of Electronic Engineering, Tsinghua University, Beijing 100084, China.
Neural Netw. 2021 Jul;139:326-334. doi: 10.1016/j.neunet.2021.04.002. Epub 2021 Apr 10.
Keyword search (KWS) means searching for keywords given by the user from continuous speech. Conventional KWS systems are based on Automatic Speech Recognition (ASR), where the input speech has to be first processed by the ASR system before keyword searching. In the recent decade, as deep learning and deep neural networks (DNN) become increasingly popular, KWS systems can also be trained in an end-to-end (E2E) manner. The main advantage of E2E KWS is that there is no need for speech recognition, which makes the training and searching procedure much more straightforward than the traditional ones. This article proposes an E2E KWS model, which consists of four parts: speech encoder-decoder, query encoder-decoder, attention mechanism, and energy scorer. Firstly, the proposed model outperforms the baseline model. Secondly, we find that under various supervision, character or phoneme sequences, speech or query encoders can extract the corresponding information, resulting in different performances. Moreover, we introduce an attention mechanism and invent a novel energy scorer, where the former can help locate keywords. The latter can make final decisions by considering speech embeddings, query embeddings, and attention weights in parallel. We evaluate our model on low resource conditions with about 10-hour training data for four different languages. The experiment results prove that the proposed model can work well on low resource conditions.
关键词搜索(KWS)是指从连续语音中搜索用户给定的关键词。传统的 KWS 系统基于自动语音识别(ASR),其中输入语音必须先经过 ASR 系统处理,然后才能进行关键词搜索。在过去的十年中,随着深度学习和深度神经网络(DNN)的日益普及,KWS 系统也可以以端到端(E2E)的方式进行训练。E2E KWS 的主要优势在于不需要语音识别,这使得训练和搜索过程比传统方法更加简单直接。本文提出了一种 E2E KWS 模型,它由四个部分组成:语音编码器-解码器、查询编码器-解码器、注意力机制和能量评分器。首先,所提出的模型优于基线模型。其次,我们发现,在各种监督下,字符或音素序列、语音或查询编码器都可以提取相应的信息,从而导致不同的性能。此外,我们引入了注意力机制并发明了一种新颖的能量评分器,前者可以帮助定位关键词,后者可以通过并行考虑语音嵌入、查询嵌入和注意力权重来做出最终决策。我们在资源有限的情况下,使用大约 10 小时的训练数据对四种不同的语言进行了评估。实验结果证明了所提出的模型在资源有限的情况下能够很好地工作。