College of Intelligent Equipment, Shandong University of Science and Technology, Taian, 271019, Shandong, China.
College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao, 266590, Shandong, China.
Sci Rep. 2022 Oct 28;12(1):18171. doi: 10.1038/s41598-022-22977-5.
Most of the current excellent models in speaker verification are ResNet-based deep models and attention-based models. These models have a general weakness, which is the large number of parameters and high hardware requirements. On the other hand, many deep structures only generate embedding features from the features extracted by the last frame-level layer, which causes shallow features and channel-related features to be ignored. To solve these problems, this paper proposed a shallow speaker verification model based on Lambda-vector, its main structure is composed of three Lambda-SE modules. The module extracts long-distance dependencies between frame-level features and channel-related interaction information to enhance representation of features. Meanwhile, so that adequately mine the information in deep and shallow features, the model introduces multi-layer feature aggregation to fuse the features of different frame-level layers together. It can increase the detailed information in the deep features and improve the model's ability to represent complex information. The experimental results on the public datasets Voxceleb1 and Voxceleb2 show that the model has more stable training speed, fewer model parameters, and better identification performances than baseline models.
目前大多数优秀的说话人确认模型都是基于 ResNet 的深度模型和基于注意力的模型。这些模型有一个普遍的弱点,即参数数量多,硬件要求高。另一方面,许多深度结构仅从最后一帧级别的特征中提取嵌入特征,这导致忽略了浅层特征和与通道相关的特征。为了解决这些问题,本文提出了一种基于 Lambda-vector 的浅层说话人确认模型,其主要结构由三个 Lambda-SE 模块组成。该模块提取帧级特征之间的长距离依赖关系和与通道相关的交互信息,以增强特征的表示能力。同时,为了充分挖掘深层和浅层特征中的信息,模型引入了多层特征聚合,将不同帧级层的特征融合在一起。这可以增加深层特征中的详细信息,并提高模型对复杂信息的表示能力。在公开数据集 Voxceleb1 和 Voxceleb2 上的实验结果表明,与基线模型相比,该模型具有更稳定的训练速度、更少的模型参数和更好的识别性能。