• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于注意力的时频聚合在说话人验证中的应用。

Attention-Based Temporal-Frequency Aggregation for Speaker Verification.

机构信息

National Laboratory of Radar Signal Processing, Xidian University, Xi’an 710071, China.

出版信息

Sensors (Basel). 2022 Mar 10;22(6):2147. doi: 10.3390/s22062147.

DOI:10.3390/s22062147
PMID:35336315
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8953125/
Abstract

Convolutional neural networks (CNNs) have significantly promoted the development of speaker verification (SV) systems because of their powerful deep feature learning capability. In CNN-based SV systems, utterance-level aggregation is an important component, and it compresses the frame-level features generated by the CNN frontend into an utterance-level representation. However, most of the existing aggregation methods aggregate the extracted features across time and cannot capture the speaker-dependent information contained in the frequency domain. To handle this problem, this paper proposes a novel attention-based frequency aggregation method, which focuses on the key frequency bands that provide more information for utterance-level representation. Meanwhile, two more effective temporal-frequency aggregation methods are proposed in combination with the existing temporal aggregation methods. The two proposed methods can capture the speaker-dependent information contained in both the time domain and frequency domain of frame-level features, thus improving the discriminability of speaker embedding. Besides, a powerful CNN-based SV system is developed and evaluated on the TIMIT and Voxceleb datasets. The experimental results indicate that the CNN-based SV system using the temporal-frequency aggregation method achieves a superior equal error rate of 5.96% on Voxceleb compared with the state-of-the-art baseline models.

摘要

卷积神经网络 (CNN) 因其强大的深度特征学习能力,极大地推动了说话人验证 (SV) 系统的发展。在基于 CNN 的 SV 系统中,话语级聚合是一个重要的组成部分,它将 CNN 前端生成的帧级特征压缩为话语级表示。然而,大多数现有的聚合方法在时间上进行特征聚合,无法捕捉到频域中包含的说话人相关信息。为了解决这个问题,本文提出了一种新颖的基于注意力的频率聚合方法,该方法关注为话语级表示提供更多信息的关键频带。同时,结合现有的时间聚合方法,提出了两种更有效的时频聚合方法。这两种提出的方法可以捕获帧级特征的时域和频域中包含的说话人相关信息,从而提高说话人嵌入的可辨别性。此外,还开发了一个基于强大 CNN 的 SV 系统,并在 TIMIT 和 Voxceleb 数据集上进行了评估。实验结果表明,与最先进的基线模型相比,基于 CNN 的 SV 系统在 Voxceleb 上使用时频聚合方法可实现更优的等错误率 5.96%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/555c/8953125/b414fc19c896/sensors-22-02147-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/555c/8953125/62329e682bbb/sensors-22-02147-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/555c/8953125/a90915d8c9a8/sensors-22-02147-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/555c/8953125/0f79169729bf/sensors-22-02147-g003a.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/555c/8953125/d074dd2e3bdb/sensors-22-02147-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/555c/8953125/b414fc19c896/sensors-22-02147-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/555c/8953125/62329e682bbb/sensors-22-02147-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/555c/8953125/a90915d8c9a8/sensors-22-02147-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/555c/8953125/0f79169729bf/sensors-22-02147-g003a.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/555c/8953125/d074dd2e3bdb/sensors-22-02147-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/555c/8953125/b414fc19c896/sensors-22-02147-g005.jpg

相似文献

1
Attention-Based Temporal-Frequency Aggregation for Speaker Verification.基于注意力的时频聚合在说话人验证中的应用。
Sensors (Basel). 2022 Mar 10;22(6):2147. doi: 10.3390/s22062147.
2
ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition.ResSKNet-SSDP:一种高效、轻量级的端到端说话人识别架构。
Sensors (Basel). 2023 Jan 20;23(3):1203. doi: 10.3390/s23031203.
3
Bidirectional Attention for Text-Dependent Speaker Verification.基于双向注意力的文本相关说话人验证。
Sensors (Basel). 2020 Nov 27;20(23):6784. doi: 10.3390/s20236784.
4
D-MONA: A dilated mixed-order non-local attention network for speaker and language recognition.D-MONA:一种用于说话人识别和语言识别的扩展混合阶非局部注意力网络。
Neural Netw. 2021 Jul;139:201-211. doi: 10.1016/j.neunet.2021.03.014. Epub 2021 Mar 18.
5
Few-shot short utterance speaker verification using meta-learning.基于元学习的少样本短语音说话人验证
PeerJ Comput Sci. 2023 Apr 21;9:e1276. doi: 10.7717/peerj-cs.1276. eCollection 2023.
6
H-VECTORS: Improving the robustness in utterance-level speaker embeddings using a hierarchical attention model.H-VECTORS:使用分层注意力模型提高语句级说话人嵌入的鲁棒性。
Neural Netw. 2021 Oct;142:329-339. doi: 10.1016/j.neunet.2021.05.024. Epub 2021 May 25.
7
Lambda-vector modeling temporal and channel interactions for text-independent speaker verification.基于 Lambda-vector 的建模方法用于文本无关说话人验证中的时频和信道交互。
Sci Rep. 2022 Oct 28;12(1):18171. doi: 10.1038/s41598-022-22977-5.
8
BPCNN: Bi-Point Input for Convolutional Neural Networks in Speaker Spoofing Detection.BPCNN:用于说话人伪造检测的卷积神经网络的双点输入。
Sensors (Basel). 2022 Jun 14;22(12):4483. doi: 10.3390/s22124483.
9
Which to select?: Analysis of speaker representation with graph attention networks.该选择哪一个?:使用图注意力网络对说话者表征进行分析
J Acoust Soc Am. 2024 Oct 1;156(4):2701-2708. doi: 10.1121/10.0032393.
10
Phonetic variability constrained bottleneck features for joint speaker recognition and physical task stress detection.用于联合说话人识别和身体任务压力检测的语音变异受限瓶颈特征
J Acoust Soc Am. 2020 Nov;148(5):2912. doi: 10.1121/10.0002455.

引用本文的文献

1
ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition.ResSKNet-SSDP:一种高效、轻量级的端到端说话人识别架构。
Sensors (Basel). 2023 Jan 20;23(3):1203. doi: 10.3390/s23031203.

本文引用的文献

1
Evaluating the Performance of Speaker Recognition Solutions in E-Commerce Applications.评估语音识别解决方案在电子商务应用中的性能。
Sensors (Basel). 2021 Sep 17;21(18):6231. doi: 10.3390/s21186231.
2
A Two-Level Speaker Identification System via Fusion of Heterogeneous Classifiers and Complementary Feature Cooperation.基于异质分类器融合与互补特征协作的两级说话人识别系统。
Sensors (Basel). 2021 Jul 28;21(15):5097. doi: 10.3390/s21155097.
3
Bidirectional Attention for Text-Dependent Speaker Verification.基于双向注意力的文本相关说话人验证。
Sensors (Basel). 2020 Nov 27;20(23):6784. doi: 10.3390/s20236784.
4
Forensic Speaker Verification Using Ordinary Least Squares.基于最小二乘法的法庭语音验证
Sensors (Basel). 2019 Oct 10;19(20):4385. doi: 10.3390/s19204385.