Suppr超能文献

一种基于多阶声学仿真的预训练框架用于重放语音欺骗检测。

A Pre-Training Framework Based on Multi-Order Acoustic Simulation for Replay Voice Spoofing Detection.

作者信息

Go Changhwan, Park Nam In, Jeon Oc-Yeub, Chun Chanjun

机构信息

Department of Computer Engineering, Chosun University, Gwangju 61452, Republic of Korea.

Digital Analysis Division, National Forensic Service, Wonju 26460, Republic of Korea.

出版信息

Sensors (Basel). 2023 Aug 20;23(16):7280. doi: 10.3390/s23167280.

Abstract

Voice spoofing attempts to break into a specific automatic speaker verification (ASV) system by forging the user's voice and can be used through methods such as text-to-speech (TTS), voice conversion (VC), and replay attacks. Recently, deep learning-based voice spoofing countermeasures have been developed. However, the problem with replay is that it is difficult to construct a large number of datasets because it requires a physical recording process. To overcome these problems, this study proposes a pre-training framework based on multi-order acoustic simulation for replay voice spoofing detection. Multi-order acoustic simulation utilizes existing clean signal and room impulse response (RIR) datasets to generate audios, which simulate the various acoustic configurations of the original and replayed audios. The acoustic configuration refers to factors such as the microphone type, reverberation, time delay, and noise that may occur between a speaker and microphone during the recording process. We assume that a deep learning model trained on an audio that simulates the various acoustic configurations of the original and replayed audios can classify the acoustic configurations of the original and replay audios well. To validate this, we performed pre-training to classify the audio generated by the multi-order acoustic simulation into three classes: clean signal, audio simulating the acoustic configuration of the original audio, and audio simulating the acoustic configuration of the replay audio. We also set the weights of the pre-training model to the initial weights of the replay voice spoofing detection model using the existing replay voice spoofing dataset and then performed fine-tuning. To validate the effectiveness of the proposed method, we evaluated the performance of the conventional method without pre-training and proposed method using an objective metric, i.e., the accuracy and F1-score. As a result, the conventional method achieved an accuracy of 92.94%, F1-score of 86.92% and the proposed method achieved an accuracy of 98.16%, F1-score of 95.08%.

摘要

语音欺骗试图通过伪造用户声音闯入特定的自动说话人验证(ASV)系统,并且可以通过诸如文本转语音(TTS)、语音转换(VC)和重放攻击等方法来实现。最近,基于深度学习的语音欺骗对策已经被开发出来。然而,重放攻击的问题在于,由于它需要物理录制过程,所以很难构建大量的数据集。为了克服这些问题,本研究提出了一种基于多阶声学模拟的预训练框架,用于重放语音欺骗检测。多阶声学模拟利用现有的纯净信号和房间脉冲响应(RIR)数据集来生成音频,这些音频模拟了原始音频和重放音频的各种声学配置。声学配置是指诸如麦克风类型、混响、时间延迟以及在录制过程中扬声器和麦克风之间可能出现的噪声等因素。我们假设在模拟原始音频和重放音频的各种声学配置的音频上训练的深度学习模型能够很好地对原始音频和重放音频的声学配置进行分类。为了验证这一点,我们进行了预训练,将多阶声学模拟生成的音频分为三类:纯净信号、模拟原始音频声学配置的音频以及模拟重放音频声学配置的音频。我们还使用现有的重放语音欺骗数据集将预训练模型的权重设置为重放语音欺骗检测模型的初始权重,然后进行微调。为了验证所提方法的有效性,我们使用客观指标,即准确率和F1分数,评估了未进行预训练的传统方法和所提方法的性能。结果,传统方法的准确率为92.94%,F1分数为86.92%,而所提方法的准确率为98.16%,F1分数为95.08%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/adf7/10458210/01d0723400fc/sensors-23-07280-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验