Electrical Engineering Department, Prince Mohammad bin Fahd University, P.O. Box 1664, Al Khobar 31952, Saudi Arabia.
Department of Computer Engineering, College of Computers and Information Technology, Taif University, P.O. Box 11099, Taif 21944, Saudi Arabia.
Sensors (Basel). 2023 Jul 24;23(14):6637. doi: 10.3390/s23146637.
Voice-controlled devices are in demand due to their hands-free controls. However, using voice-controlled devices in sensitive scenarios like smartphone applications and financial transactions requires protection against fraudulent attacks referred to as "speech spoofing". The algorithms used in spoof attacks are practically unknown; hence, further analysis and development of spoof-detection models for improving spoof classification are required. A study of the spoofed-speech spectrum suggests that high-frequency features are able to discriminate genuine speech from spoofed speech well. Typically, linear or triangular filter banks are used to obtain high-frequency features. However, a Gaussian filter can extract more global information than a triangular filter. In addition, MFCC features are preferable among other speech features because of their lower covariance. Therefore, in this study, the use of a Gaussian filter is proposed for the extraction of inverted MFCC (iMFCC) features, providing high-frequency features. Complementary features are integrated with iMFCC to strengthen the features that aid in the discrimination of spoof speech. Deep learning has been proven to be efficient in classification applications, but the selection of its hyper-parameters and architecture is crucial and directly affects performance. Therefore, a Bayesian algorithm is used to optimize the BiLSTM network. Thus, in this study, we build a high-frequency-based optimized BiLSTM network to classify the spoofed-speech signal, and we present an extensive investigation using the ASVSpoof 2017 dataset. The optimized BiLSTM model is successfully trained with the least epoch and achieved a 99.58% validation accuracy. The proposed algorithm achieved a 6.58% EER on the evaluation dataset, with a relative improvement of 78% on a baseline spoof-identification system.
由于其免提控制功能,语音控制设备需求量很大。然而,在智能手机应用程序和金融交易等敏感场景中使用语音控制设备需要防止被称为“语音欺骗”的欺诈攻击。欺骗攻击中使用的算法实际上是未知的;因此,需要进一步分析和开发欺骗检测模型,以提高欺骗分类的准确性。对欺骗语音频谱的研究表明,高频特征能够很好地区分真实语音和欺骗语音。通常,使用线性或三角滤波器组来获取高频特征。然而,与三角滤波器相比,高斯滤波器可以提取更多的全局信息。此外,由于其协方差较低,MFCC 特征比其他语音特征更受欢迎。因此,在本研究中,提出使用高斯滤波器提取倒谱 MFCC(iMFCC)特征,以提供高频特征。互补特征与 iMFCC 相结合,以增强有助于区分欺骗语音的特征。深度学习在分类应用中已被证明是有效的,但选择其超参数和架构至关重要,直接影响性能。因此,使用贝叶斯算法优化 BiLSTM 网络。因此,在本研究中,我们构建了一个基于高频的优化 BiLSTM 网络来对欺骗语音信号进行分类,并使用 ASVSpoof 2017 数据集进行了广泛的研究。该优化的 BiLSTM 模型在最少的 epoch 内成功训练,并在验证集上实现了 99.58%的准确率。所提出的算法在评估数据集上实现了 6.58%的 EER,与基线欺骗识别系统相比,相对提高了 78%。