基于硬负例采样的对比说话人表示学习在说话人识别中的应用。

Contrastive Speaker Representation Learning with Hard Negative Sampling for Speaker Recognition.

机构信息

Department of Computer Engineering, Chosun University, Gwangju 61452, Republic of Korea.

Intelligent Image Processing Research Center, Korea Electronics Technology Institute, Seongnam 13509, Republic of Korea.

出版信息

Sensors (Basel). 2024 Sep 25;24(19):6213. doi: 10.3390/s24196213.

DOI:10.3390/s24196213

PMID:39409253

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11478696/

Abstract

Speaker recognition is a technology that identifies the speaker in an input utterance by extracting speaker-distinguishable features from the speech signal. Speaker recognition is used for system security and authentication; therefore, it is crucial to extract unique features of the speaker to achieve high recognition rates. Representative methods for extracting these features include a classification approach, or utilizing contrastive learning to learn the speaker relationship between representations and then using embeddings extracted from a specific layer of the model. This paper introduces a framework for developing robust speaker recognition models through contrastive learning. This approach aims to minimize the similarity to hard negative samples-those that are genuine negatives, but have extremely similar features to the positives, leading to potential mistaken. Specifically, our proposed method trains the model by estimating hard negative samples within a mini-batch during contrastive learning, and then utilizes a cross-attention mechanism to determine speaker agreement for pairs of utterances. To demonstrate the effectiveness of our proposed method, we compared the performance of a deep learning model trained with a conventional loss function utilized in speaker recognition with that of a deep learning model trained using our proposed method, as measured by the equal error rate (EER), an objective performance metric. Our results indicate that when trained with the voxceleb2 dataset, the proposed method achieved an EER of 0.98% on the voxceleb1-E dataset and 1.84% on the voxceleb1-H dataset.

摘要

说话人识别是一种通过从语音信号中提取说话人可区分的特征来识别输入话语中的说话人的技术。说话人识别用于系统安全和认证；因此，提取说话人的独特特征对于实现高识别率至关重要。提取这些特征的代表性方法包括分类方法，或利用对比学习来学习表示之间的说话人关系，然后使用从模型的特定层提取的嵌入。本文介绍了一种通过对比学习开发鲁棒说话人识别模型的框架。该方法旨在最小化与硬负样本的相似性-那些是真正的负样本，但与正样本具有极其相似的特征，从而导致潜在的错误识别。具体来说，我们的方法通过在对比学习过程中估计小批量中的硬负样本来训练模型，然后利用交叉注意机制来确定对语音的同意。为了展示我们提出的方法的有效性，我们将使用传统损失函数训练的深度学习模型的性能与使用我们提出的方法训练的深度学习模型的性能进行了比较，通过误识率（EER）来衡量，这是一个客观的性能指标。我们的结果表明，当使用 voxceleb2 数据集进行训练时，所提出的方法在 voxceleb1-E 数据集上的 EER 为 0.98%，在 voxceleb1-H 数据集上的 EER 为 1.84%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d296/11478696/e2c0cc7798ba/sensors-24-06213-g001.jpg

相似文献

Contrastive Speaker Representation Learning with Hard Negative Sampling for Speaker Recognition.基于硬负例采样的对比说话人表示学习在说话人识别中的应用。

Sensors (Basel). 2024 Sep 25;24(19):6213. doi: 10.3390/s24196213.

H-VECTORS: Improving the robustness in utterance-level speaker embeddings using a hierarchical attention model.H-VECTORS：使用分层注意力模型提高语句级说话人嵌入的鲁棒性。

Neural Netw. 2021 Oct;142:329-339. doi: 10.1016/j.neunet.2021.05.024. Epub 2021 May 25.

Learning speaker-specific characteristics with a deep neural architecture.利用深度神经架构学习特定说话者的特征。

IEEE Trans Neural Netw. 2011 Nov;22(11):1744-56. doi: 10.1109/TNN.2011.2167240. Epub 2011 Sep 26.

Speaker recognition based on deep learning: An overview.基于深度学习的说话人识别：综述。

Neural Netw. 2021 Aug;140:65-99. doi: 10.1016/j.neunet.2021.03.004. Epub 2021 Mar 17.

Phonetic variability constrained bottleneck features for joint speaker recognition and physical task stress detection.用于联合说话人识别和身体任务压力检测的语音变异受限瓶颈特征

J Acoust Soc Am. 2020 Nov;148(5):2912. doi: 10.1121/10.0002455.

Ensemble learning with speaker embeddings in multiple speech task stimuli for depression detection.在用于抑郁症检测的多语音任务刺激中结合说话人嵌入的集成学习。

Front Neurosci. 2023 Mar 23;17:1141621. doi: 10.3389/fnins.2023.1141621. eCollection 2023.

Partially supervised speaker clustering.部分监督的说话人聚类。

IEEE Trans Pattern Anal Mach Intell. 2012 May;34(5):959-71. doi: 10.1109/TPAMI.2011.174.

Few-shot short utterance speaker verification using meta-learning.基于元学习的少样本短语音说话人验证

PeerJ Comput Sci. 2023 Apr 21;9:e1276. doi: 10.7717/peerj-cs.1276. eCollection 2023.

Cluster-Based Pairwise Contrastive Loss for Noise-Robust Speech Recognition.用于抗噪声语音识别的基于聚类的成对对比损失

Sensors (Basel). 2024 Apr 17;24(8):2573. doi: 10.3390/s24082573.

Cost-sensitive learning for emotion robust speaker recognition.用于情感鲁棒性说话人识别的代价敏感学习

ScientificWorldJournal. 2014;2014:628516. doi: 10.1155/2014/628516. Epub 2014 Jun 4.

本文引用的文献

Speaker recognition based on deep learning: An overview.基于深度学习的说话人识别：综述。

Neural Netw. 2021 Aug;140:65-99. doi: 10.1016/j.neunet.2021.03.004. Epub 2021 Mar 17.

Res2Net: A New Multi-Scale Backbone Architecture.Res2Net：一种新的多尺度骨干网络架构。

IEEE Trans Pattern Anal Mach Intell. 2021 Feb;43(2):652-662. doi: 10.1109/TPAMI.2019.2938758. Epub 2021 Jan 8.

Survey of clustering algorithms.聚类算法综述

IEEE Trans Neural Netw. 2005 May;16(3):645-78. doi: 10.1109/TNN.2005.845141.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于硬负例采样的对比说话人表示学习在说话人识别中的应用。

Contrastive Speaker Representation Learning with Hard Negative Sampling for Speaker Recognition.

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献