Department of Computer Engineering, Chosun University, Gwangju 61452, Republic of Korea.
Intelligent Image Processing Research Center, Korea Electronics Technology Institute, Seongnam 13509, Republic of Korea.
Sensors (Basel). 2024 Sep 25;24(19):6213. doi: 10.3390/s24196213.
Speaker recognition is a technology that identifies the speaker in an input utterance by extracting speaker-distinguishable features from the speech signal. Speaker recognition is used for system security and authentication; therefore, it is crucial to extract unique features of the speaker to achieve high recognition rates. Representative methods for extracting these features include a classification approach, or utilizing contrastive learning to learn the speaker relationship between representations and then using embeddings extracted from a specific layer of the model. This paper introduces a framework for developing robust speaker recognition models through contrastive learning. This approach aims to minimize the similarity to hard negative samples-those that are genuine negatives, but have extremely similar features to the positives, leading to potential mistaken. Specifically, our proposed method trains the model by estimating hard negative samples within a mini-batch during contrastive learning, and then utilizes a cross-attention mechanism to determine speaker agreement for pairs of utterances. To demonstrate the effectiveness of our proposed method, we compared the performance of a deep learning model trained with a conventional loss function utilized in speaker recognition with that of a deep learning model trained using our proposed method, as measured by the equal error rate (EER), an objective performance metric. Our results indicate that when trained with the voxceleb2 dataset, the proposed method achieved an EER of 0.98% on the voxceleb1-E dataset and 1.84% on the voxceleb1-H dataset.
说话人识别是一种通过从语音信号中提取说话人可区分的特征来识别输入话语中的说话人的技术。说话人识别用于系统安全和认证;因此,提取说话人的独特特征对于实现高识别率至关重要。提取这些特征的代表性方法包括分类方法,或利用对比学习来学习表示之间的说话人关系,然后使用从模型的特定层提取的嵌入。本文介绍了一种通过对比学习开发鲁棒说话人识别模型的框架。该方法旨在最小化与硬负样本的相似性-那些是真正的负样本,但与正样本具有极其相似的特征,从而导致潜在的错误识别。具体来说,我们的方法通过在对比学习过程中估计小批量中的硬负样本来训练模型,然后利用交叉注意机制来确定对语音的同意。为了展示我们提出的方法的有效性,我们将使用传统损失函数训练的深度学习模型的性能与使用我们提出的方法训练的深度学习模型的性能进行了比较,通过误识率(EER)来衡量,这是一个客观的性能指标。我们的结果表明,当使用 voxceleb2 数据集进行训练时,所提出的方法在 voxceleb1-E 数据集上的 EER 为 0.98%,在 voxceleb1-H 数据集上的 EER 为 1.84%。