Zeng Chi, Li Jialing, Habibi Abbas
Xinyang Vocational and Technical College, Xinyang, 464000, Henan, China.
School of Artificial Intellegence, Chongqing Youth Vocational & Technical College, Chongqing, 401320, China.
Sci Rep. 2025 Jul 18;15(1):26158. doi: 10.1038/s41598-025-08703-x.
Effective speech emotion recognition (SER) poses a significant challenge due to the intricate and subjective nature of human emotions. Recognizing emotional states accurately from speech signals has a broad spectrum of practical applications, such as healthcare, human-computer interaction, and social robotics. This study introduces an innovative approach that merges deep learning with metaheuristic algorithms to boost the efficiency of SER systems. Specifically, a stacked autoencoder (SAE) serves as the primary model, and its performance is fine-tuned using a nature-inspired hybrid algorithm that combines particle swarm optimization (PSO) with Grass Fibrous Root Optimization (GFRO). The proposed model adeptly extracts spectral and pitch features from speech signals, encompassing spectral crest, spectral entropy, spectral flux, and harmonic ratio, to capture emotional cues effectively. The model's performance is evaluated on a standard emotion recognition dataset, comparing with some state-of-the-art models, including Convolutional Neural Network (CNN), Support Vector Machine (SVM), Deep Learning (DL), CNN and Iterative Neighborhood Component Analysis (CNN/INCA), VGG-16 achieving high accuracy in identifying various emotional states.
由于人类情感的复杂性和主观性,有效的语音情感识别(SER)面临着重大挑战。从语音信号中准确识别情绪状态具有广泛的实际应用,如医疗保健、人机交互和社会机器人技术。本研究引入了一种创新方法,将深度学习与元启发式算法相结合,以提高SER系统的效率。具体而言,堆叠自动编码器(SAE)作为主要模型,其性能通过一种受自然启发的混合算法进行微调,该算法将粒子群优化(PSO)与草纤维根优化(GFRO)相结合。所提出的模型能够有效地从语音信号中提取频谱和音高特征,包括频谱峰值、频谱熵、频谱通量和谐波比,以有效捕捉情感线索。该模型的性能在一个标准的情感识别数据集上进行评估,并与一些先进模型进行比较,包括卷积神经网络(CNN)、支持向量机(SVM)、深度学习(DL)、CNN和迭代邻域成分分析(CNN/INCA)、VGG-16,在识别各种情绪状态方面取得了高精度。