Zhao Xiaoming, Liao Yuehui, Tang Zhiwei, Xu Yicheng, Tao Xin, Wang Dandan, Wang Guoyu, Lu Hongsheng
Taizhou Central Hospital (Taizhou University Hospital), Taizhou University, Taizhou, Zhejiang, China.
School of Computer Science, Hangzhou Dianzi University, Hangzhou, China.
Front Neurosci. 2023 Jan 6;16:1107284. doi: 10.3389/fnins.2022.1107284. eCollection 2022.
Recently, personality trait recognition, which aims to identify people's first impression behavior data and analyze people's psychological characteristics, has been an interesting and active topic in psychology, affective neuroscience and artificial intelligence. To effectively take advantage of spatio-temporal cues in audio-visual modalities, this paper proposes a new method of multimodal personality trait recognition integrating audio-visual modalities based on a hybrid deep learning framework, which is comprised of convolutional neural networks (CNN), bi-directional long short-term memory network (Bi-LSTM), and the Transformer network. In particular, a pre-trained deep audio CNN model is used to learn high-level segment-level audio features. A pre-trained deep face CNN model is leveraged to separately learn high-level frame-level global scene features and local face features from each frame in dynamic video sequences. Then, these extracted deep audio-visual features are fed into a Bi-LSTM and a Transformer network to individually capture long-term temporal dependency, thereby producing the final global audio and visual features for downstream tasks. Finally, a linear regression method is employed to conduct the single audio-based and visual-based personality trait recognition tasks, followed by a decision-level fusion strategy used for producing the final Big-Five personality scores and interview scores. Experimental results on the public ChaLearn First Impression-V2 personality dataset show the effectiveness of our method, outperforming other used methods.
最近,旨在识别人们的第一印象行为数据并分析人们心理特征的人格特质识别,已成为心理学、情感神经科学和人工智能领域一个有趣且活跃的话题。为了有效利用视听模态中的时空线索,本文提出了一种基于混合深度学习框架的融合视听模态的多模态人格特质识别新方法,该框架由卷积神经网络(CNN)、双向长短期记忆网络(Bi-LSTM)和Transformer网络组成。具体而言,使用预训练的深度音频CNN模型来学习高级片段级音频特征。利用预训练的深度面部CNN模型从动态视频序列中的每一帧分别学习高级帧级全局场景特征和局部面部特征。然后,将这些提取的深度视听特征输入到Bi-LSTM和Transformer网络中,以分别捕捉长期时间依赖性,从而为下游任务生成最终的全局音频和视觉特征。最后,采用线性回归方法进行基于单音频和单视觉的人格特质识别任务,随后采用决策级融合策略来生成最终的大五人格分数和面试分数。在公开的ChaLearn第一印象-V2人格数据集上的实验结果表明了我们方法的有效性,优于其他使用的方法。