Liu Dong, Wang Zhiyong, Wang Lifeng, Chen Longxi
School of Information Engineering, Shandong Youth University of Political Science, Jinan, China.
Front Neurorobot. 2021 Jul 9;15:697634. doi: 10.3389/fnbot.2021.697634. eCollection 2021.
The redundant information, noise data generated in the process of single-modal feature extraction, and traditional learning algorithms are difficult to obtain ideal recognition performance. A multi-modal fusion emotion recognition method for speech expressions based on deep learning is proposed. Firstly, the corresponding feature extraction methods are set up for different single modalities. Among them, the voice uses the convolutional neural network-long and short term memory (CNN-LSTM) network, and the facial expression in the video uses the Inception-Res Net-v2 network to extract the feature data. Then, long and short term memory (LSTM) is used to capture the correlation between different modalities and within the modalities. After the feature selection process of the chi-square test, the single modalities are spliced to obtain a unified fusion feature. Finally, the fusion data features output by LSTM are used as the input of the classifier LIBSVM to realize the final emotion recognition. The experimental results show that the recognition accuracy of the proposed method on the MOSI and MELD datasets are 87.56 and 90.06%, respectively, which are better than other comparison methods. It has laid a certain theoretical foundation for the application of multimodal fusion in emotion recognition.
单模态特征提取过程中产生的冗余信息、噪声数据以及传统学习算法难以获得理想的识别性能。提出了一种基于深度学习的语音表情多模态融合情感识别方法。首先,针对不同的单模态设置相应的特征提取方法。其中,语音采用卷积神经网络-长短时记忆(CNN-LSTM)网络,视频中的面部表情采用Inception-Res Net-v2网络提取特征数据。然后,使用长短时记忆(LSTM)来捕捉不同模态之间以及模态内部的相关性。经过卡方检验的特征选择过程后,将单模态进行拼接以获得统一的融合特征。最后,将LSTM输出的融合数据特征作为分类器LIBSVM的输入,实现最终的情感识别。实验结果表明,该方法在MOSI和MELD数据集上的识别准确率分别为87.56%和90.06%,优于其他对比方法。为多模态融合在情感识别中的应用奠定了一定的理论基础。