Speech and Dialogue Research Laboratory, University "Politehnica" of Bucharest, 060042 Bucharest, Romania.
Sensors (Basel). 2022 Feb 6;22(3):1228. doi: 10.3390/s22031228.
In this work, we first propose a deep neural network (DNN) system for the automatic detection of speech in audio signals, otherwise known as voice activity detection (VAD). Several DNN types were investigated, including multilayer perceptrons (MLPs), recurrent neural networks (RNNs), and convolutional neural networks (CNNs), with the best performance being obtained for the latter. Additional postprocessing techniques, i.e., hysteretic thresholding, minimum duration filtering, and bilateral extension, were employed in order to boost performance. The systems were trained and tested using several data subsets of the CENSREC-1-C database, with different simulated ambient noise conditions, and additional testing was performed on a different CENSREC-1-C data subset containing actual ambient noise, as well as on a subset of the TIMIT database. An accuracy of up to 99.13% was obtained for the CENSREC-1-C datasets, and 97.60% for the TIMIT dataset. We proceed to show how the final VAD system can be adapted and employed within an utterance-level deceptive speech detection (DSD) processing pipeline. The best DSD performance is achieved by a novel hybrid CNN-MLP network leveraging a fusion of algorithmically and automatically extracted speech features, and reaches an unweighted accuracy (UA) of 63.7% on the RLDD database, and 62.4% on the RODeCAR database.
在这项工作中,我们首先提出了一种用于自动检测音频信号中语音的深度神经网络(DNN)系统,也称为语音活动检测(VAD)。我们研究了几种 DNN 类型,包括多层感知机(MLP)、递归神经网络(RNN)和卷积神经网络(CNN),后者的性能最佳。为了提高性能,我们还采用了附加的后处理技术,即滞后阈值、最小持续时间滤波和双边扩展。我们使用 CENSREC-1-C 数据库的几个数据子集进行了系统的训练和测试,这些子集具有不同的模拟环境噪声条件,并在包含实际环境噪声的不同 CENSREC-1-C 数据子集以及 TIMIT 数据库的一个子集上进行了额外的测试。我们的系统在 CENSREC-1-C 数据集上的准确率高达 99.13%,在 TIMIT 数据集上的准确率为 97.60%。我们接着展示了如何在话语级欺骗性语音检测(DSD)处理管道中自适应和使用最终的 VAD 系统。通过利用算法和自动提取的语音特征的融合,新型混合 CNN-MLP 网络实现了最佳的 DSD 性能,在 RLDD 数据库上达到了无加权准确率(UA)为 63.7%,在 RODeCAR 数据库上达到了 62.4%。