• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于语音停顿模式的深度伪造语音检测研究:算法开发与验证

Investigation of Deepfake Voice Detection Using Speech Pause Patterns: Algorithm Development and Validation.

作者信息

Kulangareth Nikhil Valsan, Kaufman Jaycee, Oreskovic Jessica, Fossat Yan

机构信息

Klick Labs, Toronto, ON, Canada.

出版信息

JMIR Biomed Eng. 2024 Mar 21;9:e56245. doi: 10.2196/56245.

DOI:10.2196/56245
PMID:38875685
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11041410/
Abstract

BACKGROUND

The digital era has witnessed an escalating dependence on digital platforms for news and information, coupled with the advent of "deepfake" technology. Deepfakes, leveraging deep learning models on extensive data sets of voice recordings and images, pose substantial threats to media authenticity, potentially leading to unethical misuse such as impersonation and the dissemination of false information.

OBJECTIVE

To counteract this challenge, this study aims to introduce the concept of innate biological processes to discern between authentic human voices and cloned voices. We propose that the presence or absence of certain perceptual features, such as pauses in speech, can effectively distinguish between cloned and authentic audio.

METHODS

A total of 49 adult participants representing diverse ethnic backgrounds and accents were recruited. Each participant contributed voice samples for the training of up to 3 distinct voice cloning text-to-speech models and 3 control paragraphs. Subsequently, the cloning models generated synthetic versions of the control paragraphs, resulting in a data set consisting of up to 9 cloned audio samples and 3 control samples per participant. We analyzed the speech pauses caused by biological actions such as respiration, swallowing, and cognitive processes. Five audio features corresponding to speech pause profiles were calculated. Differences between authentic and cloned audio for these features were assessed, and 5 classical machine learning algorithms were implemented using these features to create a prediction model. The generalization capability of the optimal model was evaluated through testing on unseen data, incorporating a model-naive generator, a model-naive paragraph, and model-naive participants.

RESULTS

Cloned audio exhibited significantly increased time between pauses (P<.001), decreased variation in speech segment length (P=.003), increased overall proportion of time speaking (P=.04), and decreased rates of micro- and macropauses in speech (both P=.01). Five machine learning models were implemented using these features, with the AdaBoost model demonstrating the highest performance, achieving a 5-fold cross-validation balanced accuracy of 0.81 (SD 0.05). Other models included support vector machine (balanced accuracy 0.79, SD 0.03), random forest (balanced accuracy 0.78, SD 0.04), logistic regression, and decision tree (balanced accuracies 0.76, SD 0.10 and 0.72, SD 0.06). When evaluating the optimal AdaBoost model, it achieved an overall test accuracy of 0.79 when predicting unseen data.

CONCLUSIONS

The incorporation of perceptual, biological features into machine learning models demonstrates promising results in distinguishing between authentic human voices and cloned audio.

摘要

背景

数字时代见证了人们对数字平台获取新闻和信息的依赖不断升级,同时“深度伪造”技术也应运而生。深度伪造利用深度学习模型处理大量语音记录和图像数据集,对媒体真实性构成了重大威胁,可能导致诸如假冒和传播虚假信息等不道德的滥用行为。

目的

为应对这一挑战,本研究旨在引入先天生物过程的概念,以辨别真实人类声音和克隆声音。我们提出,某些感知特征的存在与否,如语音中的停顿,能够有效区分克隆音频和真实音频。

方法

招募了49名代表不同种族背景和口音的成年参与者。每位参与者提供语音样本,用于训练多达3个不同的语音克隆文本转语音模型和3个对照段落。随后,克隆模型生成对照段落的合成版本,从而形成一个数据集,每位参与者的数据集包含多达9个克隆音频样本和3个对照样本。我们分析了由呼吸、吞咽和认知过程等生物行为引起的语音停顿。计算了与语音停顿特征相对应的5个音频特征。评估了这些特征在真实音频和克隆音频之间的差异,并使用这些特征实施了5种经典机器学习算法来创建预测模型。通过对未见数据进行测试,包括模型无关的生成器、模型无关的段落和模型无关的参与者,评估了最优模型的泛化能力。

结果

克隆音频的停顿间隔时间显著增加(P<.001),语音片段长度的变化减少(P=.003),说话总时间比例增加(P=.04),语音中的微停顿和大停顿发生率降低(P均为.01)。利用这些特征实施了5种机器学习模型,其中AdaBoost模型表现最佳,在5折交叉验证中平衡准确率达到0.81(标准差0.05)。其他模型包括支持向量机(平衡准确率0.79,标准差0.03)、随机森林(平衡准确率0.78,标准差0.04)、逻辑回归和决策树(平衡准确率分别为0.76,标准差0.10和0.72,标准差0.06)。在评估最优AdaBoost模型时,对未见数据进行预测时其总体测试准确率达到0.79。

结论

将感知生物特征纳入机器学习模型在区分真实人类声音和克隆音频方面显示出了有前景的结果。

相似文献

1
Investigation of Deepfake Voice Detection Using Speech Pause Patterns: Algorithm Development and Validation.基于语音停顿模式的深度伪造语音检测研究:算法开发与验证
JMIR Biomed Eng. 2024 Mar 21;9:e56245. doi: 10.2196/56245.
2
A Review of Image Processing Techniques for Deepfakes.深度伪造的图像处理技术综述。
Sensors (Basel). 2022 Jun 16;22(12):4556. doi: 10.3390/s22124556.
3
Deep learning in automatic detection of dysphonia: Comparing acoustic features and developing a generalizable framework.深度学习在嗓音障碍自动检测中的应用:比较声学特征并开发一个可推广的框架。
Int J Lang Commun Disord. 2023 Mar;58(2):279-294. doi: 10.1111/1460-6984.12783. Epub 2022 Sep 18.
4
Detection of Mild Cognitive Impairment From Non-Semantic, Acoustic Voice Features: The Framingham Heart Study.从非语义、声学语音特征检测轻度认知障碍:弗雷明汉心脏研究。
JMIR Aging. 2024 Aug 22;7:e55126. doi: 10.2196/55126.
5
Warning: Humans cannot reliably detect speech deepfakes.警告:人类无法可靠地识别语音深度伪造。
PLoS One. 2023 Aug 2;18(8):e0285333. doi: 10.1371/journal.pone.0285333. eCollection 2023.
6
Human detection of political speech deepfakes across transcripts, audio, and video.跨转录本、音频和视频检测政治演讲深度伪造。
Nat Commun. 2024 Sep 2;15(1):7629. doi: 10.1038/s41467-024-51998-z.
7
A blended framework for audio spoof detection with sequential models and bags of auditory bites.一种结合了序列模型和音频片段包的音频伪造检测的混合框架。
Sci Rep. 2024 Aug 30;14(1):20192. doi: 10.1038/s41598-024-71026-w.
8
A Robust Approach to Multimodal Deepfake Detection.一种用于多模态深度伪造检测的稳健方法。
J Imaging. 2023 Jun 19;9(6):122. doi: 10.3390/jimaging9060122.
9
Efficient Pause Extraction and Encode Strategy for Alzheimer's Disease Detection Using Only Acoustic Features from Spontaneous Speech.仅使用自发语音的声学特征进行阿尔茨海默病检测的高效停顿提取与编码策略
Brain Sci. 2023 Mar 11;13(3):477. doi: 10.3390/brainsci13030477.
10
Impact of Audio Data Compression on Feature Extraction for Vocal Biomarker Detection: Validation Study.音频数据压缩对嗓音生物标志物检测特征提取的影响:验证研究
JMIR Biomed Eng. 2024 Apr 15;9:e56246. doi: 10.2196/56246.

本文引用的文献

1
Warning: Humans cannot reliably detect speech deepfakes.警告:人类无法可靠地识别语音深度伪造。
PLoS One. 2023 Aug 2;18(8):e0285333. doi: 10.1371/journal.pone.0285333. eCollection 2023.
2
Face/Off: Changing the face of movies with deepfakes.变脸:用深度伪造技术改变电影面貌。
PLoS One. 2023 Jul 6;18(7):e0287503. doi: 10.1371/journal.pone.0287503. eCollection 2023.
3
SciPy 1.0: fundamental algorithms for scientific computing in Python.SciPy 1.0:Python 中的科学计算基础算法。
Nat Methods. 2020 Mar;17(3):261-272. doi: 10.1038/s41592-019-0686-2. Epub 2020 Feb 3.