Wang Weijie, Zhao Hong, Yang Yikun, Chang YouKang, You Haojie
School of Computer and Communication, Lanzhou University of Technology, Lanzhou, China.
School of Information Science & Engineering, Lanzhou University, Lanzhou, China.
PeerJ Comput Sci. 2023 Apr 21;9:e1276. doi: 10.7717/peerj-cs.1276. eCollection 2023.
Short utterance speaker verification (SV) in the actual application is the task of accepting or rejecting the identity claim of a speaker based on a few enrollment utterances. Traditional methods have used deep neural networks to extract speaker representations for verification. Recently, several meta-learning approaches have learned a deep distance metric to distinguish speakers within meta-tasks. Among them, a prototypical network learns a metric space that may be used to compute the distance to the prototype center of speakers, in order to classify speaker identity. We use emphasized channel attention, propagation and aggregation in TDNN (ECAPA-TDNN) to implement the necessary function for the prototypical network, which is a nonlinear mapping from the input space to the metric space for either few-shot SV task. In addition, optimizing only for speakers in given meta-tasks cannot be sufficient to learn distinctive speaker features. Thus, we used an episodic training strategy, in which the classes of the support and query sets correspond to the classes of the entire training set, further improving the model performance. The proposed model outperforms comparison models on the VoxCeleb1 dataset and has a wide range of practical applications.
实际应用中的短语音说话人验证(SV)任务是根据少数注册语音来接受或拒绝说话人的身份声明。传统方法使用深度神经网络提取说话人特征用于验证。最近,一些元学习方法学习了深度距离度量来在元任务中区分说话人。其中,原型网络学习一个度量空间,该空间可用于计算到说话人原型中心的距离,以便对说话人身份进行分类。我们在时延深度神经网络(TDNN)中使用增强通道注意力、传播和聚合(ECAPA-TDNN)来实现原型网络所需的功能,这是一个针对少样本SV任务从输入空间到度量空间的非线性映射。此外,仅针对给定元任务中的说话人进行优化不足以学习到独特的说话人特征。因此,我们采用了一种情节训练策略,其中支持集和查询集的类别与整个训练集的类别相对应,进一步提高了模型性能。所提出的模型在VoxCeleb1数据集上优于对比模型,具有广泛的实际应用。