基于深度学习的集成学习和特征融合的孟加拉语语音情感识别

Shakil Md Shahid Ahammed, Farid Fahmid Al, Podder Nitun Kumar, Iqbal S M Hasan Sazzad, Miah Abu Saleh Musa, Rahim Md Abdur, Karim Hezerul Abdul

Department of Computer Science and Engineering, Pabna University of Science and Technology, Pabna 6600, Bangladesh.

Centre for Image and Vision Computing (CIVC), COE for Artificial Intelligence, Faculty of Artificial Intelligence and Engineering (FAIE), Multimedia University, Cyberjaya 63100, Selangor, Malaysia.

J Imaging. 2025 Aug 14;11(8):273. doi: 10.3390/jimaging11080273.

Emotion recognition in speech is essential for enhancing human-computer interaction (HCI) systems. Despite progress in Bangla speech emotion recognition, challenges remain, including low accuracy, speaker dependency, and poor generalization across emotional expressions. Previous approaches often rely on traditional machine learning or basic deep learning models, struggling with robustness and accuracy in noisy or varied data. In this study, we propose a novel multi-stream deep learning feature fusion approach for Bangla speech emotion recognition, addressing the limitations of existing methods. Our approach begins with various data augmentation techniques applied to the training dataset, enhancing the model's robustness and generalization. We then extract a comprehensive set of handcrafted features, including Zero-Crossing Rate (ZCR), chromagram, spectral centroid, spectral roll-off, spectral contrast, spectral flatness, Mel-Frequency Cepstral Coefficients (MFCCs), Root Mean Square (RMS) energy, and Mel-spectrogram. Although these features are used as 1D numerical vectors, some of them are computed from time-frequency representations (e.g., chromagram, Mel-spectrogram) that can themselves be depicted as images, which is conceptually close to imaging-based analysis. These features capture key characteristics of the speech signal, providing valuable insights into the emotional content. Sequentially, we utilize a multi-stream deep learning architecture to automatically learn complex, hierarchical representations of the speech signal. This architecture consists of three distinct streams: the first stream uses 1D convolutional neural networks (1D CNNs), the second integrates 1D CNN with Long Short-Term Memory (LSTM), and the third combines 1D CNNs with bidirectional LSTM (Bi-LSTM). These models capture intricate emotional nuances that handcrafted features alone may not fully represent. For each of these models, we generate predicted scores and then employ ensemble learning with a soft voting technique to produce the final prediction. This fusion of handcrafted features, deep learning-derived features, and ensemble voting enhances the accuracy and robustness of emotion identification across multiple datasets. Our method demonstrates the effectiveness of combining various learning models to improve emotion recognition in Bangla speech, providing a more comprehensive solution compared with existing methods. We utilize three primary datasets-SUBESCO, BanglaSER, and a merged version of both-as well as two external datasets, RAVDESS and EMODB, to assess the performance of our models. Our method achieves impressive results with accuracies of 92.90%, 85.20%, 90.63%, 67.71%, and 69.25% for the SUBESCO, BanglaSER, merged SUBESCO and BanglaSER, RAVDESS, and EMODB datasets, respectively. These results demonstrate the effectiveness of combining handcrafted features with deep learning-based features through ensemble learning for robust emotion recognition in Bangla speech.

语音中的情感识别对于增强人机交互（HCI）系统至关重要。尽管孟加拉语语音情感识别取得了进展，但挑战依然存在，包括准确率低、依赖说话者以及跨情感表达的泛化能力差。以前的方法通常依赖传统机器学习或基本深度学习模型，在处理噪声或多样数据时难以保证鲁棒性和准确性。在本研究中，我们提出了一种用于孟加拉语语音情感识别的新型多流深度学习特征融合方法，以解决现有方法的局限性。我们的方法首先对训练数据集应用各种数据增强技术，以提高模型的鲁棒性和泛化能力。然后，我们提取一组全面的手工特征，包括过零率（ZCR）、色度图、谱质心、谱滚降、谱对比度、谱平坦度、梅尔频率倒谱系数（MFCC）、均方根（RMS）能量和梅尔频谱图。尽管这些特征被用作一维数值向量，但其中一些是从时频表示（例如，色度图、梅尔频谱图）计算得出的，这些时频表示本身可以描绘为图像，这在概念上与基于成像的分析相近。这些特征捕获了语音信号的关键特征，为情感内容提供了有价值的见解。接下来，我们利用多流深度学习架构自动学习语音信号的复杂分层表示。该架构由三个不同的流组成：第一个流使用一维卷积神经网络（1D CNN），第二个流将一维卷积神经网络与长短期记忆（LSTM）集成，第三个流将一维卷积神经网络与双向长短期记忆（Bi-LSTM）相结合。这些模型捕获了单独的手工特征可能无法完全表示的复杂情感细微差别。对于这些模型中的每一个，我们生成预测分数，然后采用带有软投票技术的集成学习来产生最终预测。手工特征、深度学习衍生特征和集成投票的这种融合提高了跨多个数据集的情感识别的准确性和鲁棒性。我们的方法证明了结合各种学习模型以改善孟加拉语语音情感识别的有效性，与现有方法相比提供了更全面的解决方案。我们使用三个主要数据集——SUBESCO、BanglaSER 以及两者的合并版本——以及两个外部数据集 RAVDESS 和 EMODB 来评估我们模型的性能。我们的方法分别在 SUBESCO、BanglaSER、合并后的 SUBESCO 和 BanglaSER、RAVDESS 和 EMODB 数据集上取得了令人印象深刻的结果，准确率分别为 92.90%、85.20%、90.63%、67.71%和 69.25%。这些结果证明了通过集成学习将手工特征与基于深度学习的特征相结合以实现孟加拉语语音鲁棒情感识别的有效性。

相似文献

Bangla Speech Emotion Recognition Using Deep Learning-Based Ensemble Learning and Feature Fusion.

J Imaging. 2025 Aug 14;11(8):273. doi: 10.3390/jimaging11080273.

Prescription of Controlled Substances: Benefits and Risks

Short-Term Memory Impairment

Cross-corpus speech emotion recognition with transformers: Leveraging handcrafted features and data augmentation.

Comput Biol Med. 2024 Sep;179:108841. doi: 10.1016/j.compbiomed.2024.108841. Epub 2024 Jul 12.

Leveraging a foundation model zoo for cell similarity search in oncological microscopy across devices.

Front Oncol. 2025 Jun 18;15:1480384. doi: 10.3389/fonc.2025.1480384. eCollection 2025.

Cognitive decline assessment using semantic linguistic content and transformer deep learning architecture.

Int J Lang Commun Disord. 2024 May-Jun;59(3):1110-1127. doi: 10.1111/1460-6984.12973. Epub 2023 Nov 16.

A deep learning model for predicting systemic lupus erythematosus-associated epitopes.

BMC Med Inform Decis Mak. 2025 Jul 1;25(1):230. doi: 10.1186/s12911-025-03056-x.

Facial Emotion Recognition of 16 Distinct Emotions From Smartphone Videos: Comparative Study of Machine Learning and Human Performance.

J Med Internet Res. 2025 Jul 2;27:e68942. doi: 10.2196/68942.

Stabilizing machine learning for reproducible and explainable results: A novel validation approach to subject-specific insights.

Comput Methods Programs Biomed. 2025 Jun 21;269:108899. doi: 10.1016/j.cmpb.2025.108899.

Development and Validation of a Convolutional Neural Network Model to Predict a Pathologic Fracture in the Proximal Femur Using Abdomen and Pelvis CT Images of Patients With Advanced Cancer.

Clin Orthop Relat Res. 2023 Nov 1;481(11):2247-2256. doi: 10.1097/CORR.0000000000002771. Epub 2023 Aug 23.

本文引用的文献

Speech emotion recognition using machine learning techniques: Feature extraction and comparison of convolutional neural network and random forest.

PLoS One. 2023 Nov 21;18(11):e0291500. doi: 10.1371/journal.pone.0291500. eCollection 2023.

A novel feature fusion network for multimodal emotion recognition from EEG and eye movement signals.

Front Neurosci. 2023 Aug 3;17:1234162. doi: 10.3389/fnins.2023.1234162. eCollection 2023.

BanglaSER: A speech emotion recognition dataset for the Bangla language.

Data Brief. 2022 Mar 22;42:108091. doi: 10.1016/j.dib.2022.108091. eCollection 2022 Jun.

SUST Bangla Emotional Speech Corpus (SUBESCO): An audio-only emotional speech corpus for Bangla.

PLoS One. 2021 Apr 30;16(4):e0250173. doi: 10.1371/journal.pone.0250173. eCollection 2021.

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions.

J Big Data. 2021;8(1):53. doi: 10.1186/s40537-021-00444-8. Epub 2021 Mar 31.

Array programming with NumPy.

Nature. 2020 Sep;585(7825):357-362. doi: 10.1038/s41586-020-2649-2. Epub 2020 Sep 16.

A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition.

Sensors (Basel). 2019 Dec 28;20(1):183. doi: 10.3390/s20010183.

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English.

PLoS One. 2018 May 16;13(5):e0196391. doi: 10.1371/journal.pone.0196391. eCollection 2018.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

Bangla Speech Emotion Recognition Using Deep Learning-Based Ensemble Learning and Feature Fusion.

J Imaging. 2025 Aug 14;11(8):273. doi: 10.3390/jimaging11080273.

Prescription of Controlled Substances: Benefits and Risks

Short-Term Memory Impairment

Cross-corpus speech emotion recognition with transformers: Leveraging handcrafted features and data augmentation.

Comput Biol Med. 2024 Sep;179:108841. doi: 10.1016/j.compbiomed.2024.108841. Epub 2024 Jul 12.

Leveraging a foundation model zoo for cell similarity search in oncological microscopy across devices.

Front Oncol. 2025 Jun 18;15:1480384. doi: 10.3389/fonc.2025.1480384. eCollection 2025.

Cognitive decline assessment using semantic linguistic content and transformer deep learning architecture.

Int J Lang Commun Disord. 2024 May-Jun;59(3):1110-1127. doi: 10.1111/1460-6984.12973. Epub 2023 Nov 16.

A deep learning model for predicting systemic lupus erythematosus-associated epitopes.

BMC Med Inform Decis Mak. 2025 Jul 1;25(1):230. doi: 10.1186/s12911-025-03056-x.

Facial Emotion Recognition of 16 Distinct Emotions From Smartphone Videos: Comparative Study of Machine Learning and Human Performance.

J Med Internet Res. 2025 Jul 2;27:e68942. doi: 10.2196/68942.

Stabilizing machine learning for reproducible and explainable results: A novel validation approach to subject-specific insights.

Comput Methods Programs Biomed. 2025 Jun 21;269:108899. doi: 10.1016/j.cmpb.2025.108899.

Development and Validation of a Convolutional Neural Network Model to Predict a Pathologic Fracture in the Proximal Femur Using Abdomen and Pelvis CT Images of Patients With Advanced Cancer.

Clin Orthop Relat Res. 2023 Nov 1;481(11):2247-2256. doi: 10.1097/CORR.0000000000002771. Epub 2023 Aug 23.

本文引用的文献

Speech emotion recognition using machine learning techniques: Feature extraction and comparison of convolutional neural network and random forest.

PLoS One. 2023 Nov 21;18(11):e0291500. doi: 10.1371/journal.pone.0291500. eCollection 2023.

A novel feature fusion network for multimodal emotion recognition from EEG and eye movement signals.

Front Neurosci. 2023 Aug 3;17:1234162. doi: 10.3389/fnins.2023.1234162. eCollection 2023.

BanglaSER: A speech emotion recognition dataset for the Bangla language.

Data Brief. 2022 Mar 22;42:108091. doi: 10.1016/j.dib.2022.108091. eCollection 2022 Jun.

SUST Bangla Emotional Speech Corpus (SUBESCO): An audio-only emotional speech corpus for Bangla.

PLoS One. 2021 Apr 30;16(4):e0250173. doi: 10.1371/journal.pone.0250173. eCollection 2021.

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions.

J Big Data. 2021;8(1):53. doi: 10.1186/s40537-021-00444-8. Epub 2021 Mar 31.

Array programming with NumPy.

Nature. 2020 Sep;585(7825):357-362. doi: 10.1038/s41586-020-2649-2. Epub 2020 Sep 16.

A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition.

Sensors (Basel). 2019 Dec 28;20(1):183. doi: 10.3390/s20010183.

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English.

PLoS One. 2018 May 16;13(5):e0196391. doi: 10.1371/journal.pone.0196391. eCollection 2018.

Bangla Speech Emotion Recognition Using Deep Learning-Based Ensemble Learning and Feature Fusion.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献