Speech and Hearing Research Group, Department of Computer Science, University of Sheffield, UK.
Neural Netw. 2021 Oct;142:329-339. doi: 10.1016/j.neunet.2021.05.024. Epub 2021 May 25.
In this paper, a hierarchical attention network is proposed to generate robust utterance-level embeddings (H-vectors) for speaker identification and verification. Since different parts of an utterance may have different contributions to speaker identities, the use of hierarchical structure aims to learn speaker related information locally and globally. In the proposed approach, frame-level encoder and attention are applied on segments of an input utterance and generate individual segment vectors. Then, segment level attention is applied on the segment vectors to construct an utterance representation. To evaluate the quality of the learned utterance-level speaker embeddings on speaker identification and verification, the proposed approach is tested on several benchmark datasets, such as the NIST SRE2008 Part1, the Switchboard Cellular (Part1), the CallHome American English Speech ,the Voxceleb1 and Voxceleb2 datasets. In comparison with some strong baselines, the obtained results show that the use of H-vectors can achieve better identification and verification performances in various acoustic conditions.
本文提出了一种层次注意力网络,用于生成用于说话人识别和验证的鲁棒的话语级嵌入(H-向量)。由于话语的不同部分可能对说话人身份有不同的贡献,因此分层结构的使用旨在局部和全局地学习与说话人相关的信息。在提出的方法中,帧级编码器和注意力应用于输入话语的片段上,并生成各个片段向量。然后,在片段向量上应用片段级注意力来构建话语表示。为了评估所学习的话语级说话人嵌入在说话人识别和验证方面的质量,该方法在几个基准数据集上进行了测试,例如 NIST SRE2008 第 1 部分、Switchboard Cellular(第 1 部分)、CallHome American English Speech、Voxceleb1 和 Voxceleb2 数据集。与一些强大的基线相比,所获得的结果表明,在各种声学条件下,使用 H-向量可以实现更好的识别和验证性能。