文献检索，用中文搜 PubMed

Many speech emotion recognition systems have been designed using different features and classification methods. Still, there is a lack of knowledge and reasoning regarding the underlying speech characteristics and processing, i.e., how basic characteristics, methods, and settings affect the accuracy, to what extent, etc. This study is to extend physical perspective on speech emotion recognition by analyzing basic speech characteristics and modeling methods, e.g., time characteristics (segmentation, window types, and classification regions-lengths and overlaps), frequency ranges, frequency scales, processing of whole speech (spectrograms), vocal tract (filter banks, linear prediction coefficient (LPC) modeling), and excitation (inverse LPC filtering) signals, magnitude and phase manipulations, cepstral features, etc. In the evaluation phase the state-of-the-art classification method and rigorous statistical tests were applied, namely N-fold cross validation, paired -test, rank, and Pearson correlations. The results revealed several settings in a 75% accuracy range (seven emotions). The most successful methods were based on vocal tract features using psychoacoustic filter banks covering the 0-8 kHz frequency range. Well scoring are also spectrograms carrying vocal tract and excitation information. It was found that even basic processing like pre-emphasis, segmentation, magnitude modifications, etc., can dramatically affect the results. Most findings are robust by exhibiting strong correlations across tested databases.

On the Speech Properties and Feature Extraction Methods in Speech Emotion Recognition.

机构信息

Institute of Multimedia Information and Communication Technologies, Faculty of Electrical Engineering and Information Technology, Slovak University of Technology in Bratislava, 2412 Bratislava, Slovakia.

Institute of Robotics and Cybernetics, Faculty of Electrical Engineering and Information Technology, Slovak University of Technology in Bratislava, 2412 Bratislava, Slovakia.

出版信息

Sensors (Basel). 2021 Mar 8;21(5):1888. doi: 10.3390/s21051888.

许多语音情感识别系统都使用不同的特征和分类方法进行设计。但是，对于潜在的语音特征和处理，即基本特征、方法和设置如何影响准确性，影响程度如何等方面，仍然缺乏知识和推理。本研究通过分析基本的语音特征和建模方法，例如时间特征（分段、窗口类型和分类区域长度和重叠）、频率范围、频率标度、整体语音处理（频谱图）、声道（滤波器组、线性预测系数（LPC）建模）和激励（逆 LPC 滤波）信号、幅度和相位处理、倒谱特征等，来扩展语音情感识别的物理视角。在评估阶段，应用了最先进的分类方法和严格的统计检验，即 N 折交叉验证、配对检验、等级和 Pearson 相关。结果显示了在 75%准确率范围内的几种设置（七种情绪）。最成功的方法是基于声道特征，使用覆盖 0-8 kHz 频率范围的心理声学滤波器组。同时，携带声道和激励信息的频谱图也有很好的评分。结果发现，即使是预加重、分段、幅度修改等基本处理，也会对结果产生显著影响。大多数发现都是稳健的，在测试数据库中表现出很强的相关性。