Gu Yue, Li Xinyu, Huang Kaixiang, Fu Shiyu, Yang Kangning, Chen Shuhong, Zhou Moliang, Marsic Ivan
Rutgers University.
Amazon Inc., Rutgers University.
Proc ACM Int Conf Multimed. 2018 Oct;2018:537-545. doi: 10.1145/3240508.3240714.
Human conversation analysis is challenging because the meaning can be expressed through words, intonation, or even body language and facial expression. We introduce a hierarchical encoder-decoder structure with attention mechanism for conversation analysis. The hierarchical encoder learns word-level features from video, audio, and text data that are then formulated into conversation-level features. The corresponding hierarchical decoder is able to predict different attributes at given time instances. To integrate multiple sensory inputs, we introduce a novel fusion strategy with modality attention. We evaluated our system on published emotion recognition, sentiment analysis, and speaker trait analysis datasets. Our system outperformed previous state-of-the-art approaches in both classification and regressions tasks on three datasets. We also outperformed previous approaches in generalization tests on two commonly used datasets. We achieved comparable performance in predicting co-existing labels using the proposed model instead of multiple individual models. In addition, the easily-visualized modality and temporal attention demonstrated that the proposed attention mechanism helps feature selection and improves model interpretability.
人类对话分析具有挑战性,因为意义可以通过文字、语调,甚至肢体语言和面部表情来表达。我们引入了一种带有注意力机制的分层编码器-解码器结构用于对话分析。分层编码器从视频、音频和文本数据中学习单词级别的特征,然后将这些特征整合为对话级别的特征。相应的分层解码器能够在给定的时间实例预测不同的属性。为了整合多个感官输入,我们引入了一种带有模态注意力的新型融合策略。我们在已发表的情感识别、情感分析和说话者特征分析数据集上评估了我们的系统。在三个数据集的分类和回归任务中,我们的系统均优于先前的最先进方法。在两个常用数据集的泛化测试中,我们也优于先前的方法。使用所提出的模型而非多个单独模型来预测共存标签时,我们取得了可比的性能。此外,易于可视化的模态和时间注意力表明,所提出的注意力机制有助于特征选择并提高模型的可解释性。