• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用深度神经嵌入的半监督音频驱动电视新闻说话人分割

Semi-supervised audio-driven TV-news speaker diarization using deep neural embeddings.

作者信息

Tsipas Nikolaos, Vrysis Lazaros, Konstantoudakis Konstantinos, Dimoulas Charalampos

机构信息

Aristotle University of Thessaloniki, Thessaloniki, Greece.

出版信息

J Acoust Soc Am. 2020 Dec;148(6):3751. doi: 10.1121/10.0002924.

DOI:10.1121/10.0002924
PMID:33379899
Abstract

In this paper, an audio-driven, multimodal approach for speaker diarization in multimedia content is introduced and evaluated. The proposed algorithm is based on semi-supervised clustering of audio-visual embeddings, generated using deep learning techniques. The two modes, audio and video, are separately addressed; a long short-term memory Siamese neural network is employed to produce embeddings from audio, whereas a pre-trained convolutional neural network is deployed to generate embeddings from two-dimensional blocks representing the faces of speakers detected in video frames. In both cases, the models are trained using cost functions that favor smaller spatial distances between samples from the same speaker and greater spatial distances between samples from different speakers. A fusion stage, based on hypotheses derived from the established practices in television content production, is deployed on top of the unimodal sub-components to improve speaker diarization performance. The proposed methodology is evaluated against VoxCeleb, a large-scale dataset with hundreds of available speakers and AVL-SD, a newly developed, publicly available dataset aiming at capturing the peculiarities of TV news content under different scenarios. In order to promote reproducible research and collaboration in the field, the implemented algorithm is provided as an open-source software package.

摘要

本文介绍并评估了一种用于多媒体内容中说话人聚类的音频驱动多模态方法。所提出的算法基于使用深度学习技术生成的视听嵌入的半监督聚类。音频和视频这两种模态被分别处理;使用长短期记忆连体神经网络从音频中生成嵌入,而部署一个预训练的卷积神经网络从表示视频帧中检测到的说话人面部的二维块中生成嵌入。在这两种情况下,模型使用代价函数进行训练,该代价函数有利于同一说话人的样本之间具有较小的空间距离,以及不同说话人的样本之间具有较大的空间距离。基于电视内容制作中的既定做法得出的假设,在单模态子组件之上部署了一个融合阶段,以提高说话人聚类性能。所提出的方法针对VoxCeleb(一个拥有数百个可用说话人的大规模数据集)和AVL-SD(一个新开发的、公开可用的数据集,旨在捕捉不同场景下电视新闻内容的特点)进行评估。为了促进该领域的可重复研究和合作,所实现的算法作为一个开源软件包提供。

相似文献

1
Semi-supervised audio-driven TV-news speaker diarization using deep neural embeddings.使用深度神经嵌入的半监督音频驱动电视新闻说话人分割
J Acoust Soc Am. 2020 Dec;148(6):3751. doi: 10.1121/10.0002924.
2
Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model.基于预训练的视听同步模型的多模态说话人分割。
Sensors (Basel). 2019 Nov 25;19(23):5163. doi: 10.3390/s19235163.
3
Development of Supervised Speaker Diarization System Based on the PyAnnote Audio Processing Library.基于 PyAnnote 音频处理库的监督式说话人标注系统的开发。
Sensors (Basel). 2023 Feb 13;23(4):2082. doi: 10.3390/s23042082.
4
Multisensory Fusion for Unsupervised Spatiotemporal Speaker Diarization.用于无监督时空说话人分离的多感官融合
Sensors (Basel). 2024 Jun 29;24(13):4229. doi: 10.3390/s24134229.
5
Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion.基于时空贝叶斯融合的视听说话人定界
IEEE Trans Pattern Anal Mach Intell. 2018 May;40(5):1086-1099. doi: 10.1109/TPAMI.2017.2648793. Epub 2017 Jan 5.
6
Meta-learning with Latent Space Clustering in Generative Adversarial Network for Speaker Diarization.生成对抗网络中用于说话人聚类的潜在空间聚类元学习
IEEE/ACM Trans Audio Speech Lang Process. 2021;29:1204-1219. doi: 10.1109/taslp.2021.3061885. Epub 2021 Feb 26.
7
Multimodal Speaker Diarization.多模态说话人分割。
IEEE Trans Pattern Anal Mach Intell. 2012 Jan;34(1):79-93. doi: 10.1109/TPAMI.2011.47. Epub 2011 Mar 10.
8
End-to-end neural speaker diarization with an iterative adaptive attractor estimation.基于迭代自适应吸引子估计的端到端神经说话人聚类
Neural Netw. 2023 Sep;166:566-578. doi: 10.1016/j.neunet.2023.07.043. Epub 2023 Aug 1.
9
The Impact of Speaker Diarization on DNN-based Autism Severity Estimation.说话人分段对基于 DNN 的自闭症严重程度估计的影响。
Annu Int Conf IEEE Eng Med Biol Soc. 2022 Jul;2022:3414-3417. doi: 10.1109/EMBC48229.2022.9871523.
10
Evaluation of Deep Clustering for Diarization of Aphasic Speech.用于失语症语音分离的深度聚类评估
Stud Health Technol Inform. 2019;260:81-88.

引用本文的文献

1
Multisensory Fusion for Unsupervised Spatiotemporal Speaker Diarization.用于无监督时空说话人分离的多感官融合
Sensors (Basel). 2024 Jun 29;24(13):4229. doi: 10.3390/s24134229.