Suppr超能文献

使用具有分层编码器-解码器的注意力多模态网络进行人类对话分析

Human Conversation Analysis Using Attentive Multimodal Networks with Hierarchical Encoder-Decoder.

作者信息

Gu Yue, Li Xinyu, Huang Kaixiang, Fu Shiyu, Yang Kangning, Chen Shuhong, Zhou Moliang, Marsic Ivan

机构信息

Rutgers University.

Amazon Inc., Rutgers University.

出版信息

Proc ACM Int Conf Multimed. 2018 Oct;2018:537-545. doi: 10.1145/3240508.3240714.

Abstract

Human conversation analysis is challenging because the meaning can be expressed through words, intonation, or even body language and facial expression. We introduce a hierarchical encoder-decoder structure with attention mechanism for conversation analysis. The hierarchical encoder learns word-level features from video, audio, and text data that are then formulated into conversation-level features. The corresponding hierarchical decoder is able to predict different attributes at given time instances. To integrate multiple sensory inputs, we introduce a novel fusion strategy with modality attention. We evaluated our system on published emotion recognition, sentiment analysis, and speaker trait analysis datasets. Our system outperformed previous state-of-the-art approaches in both classification and regressions tasks on three datasets. We also outperformed previous approaches in generalization tests on two commonly used datasets. We achieved comparable performance in predicting co-existing labels using the proposed model instead of multiple individual models. In addition, the easily-visualized modality and temporal attention demonstrated that the proposed attention mechanism helps feature selection and improves model interpretability.

摘要

人类对话分析具有挑战性,因为意义可以通过文字、语调,甚至肢体语言和面部表情来表达。我们引入了一种带有注意力机制的分层编码器-解码器结构用于对话分析。分层编码器从视频、音频和文本数据中学习单词级别的特征,然后将这些特征整合为对话级别的特征。相应的分层解码器能够在给定的时间实例预测不同的属性。为了整合多个感官输入,我们引入了一种带有模态注意力的新型融合策略。我们在已发表的情感识别、情感分析和说话者特征分析数据集上评估了我们的系统。在三个数据集的分类和回归任务中,我们的系统均优于先前的最先进方法。在两个常用数据集的泛化测试中,我们也优于先前的方法。使用所提出的模型而非多个单独模型来预测共存标签时,我们取得了可比的性能。此外,易于可视化的模态和时间注意力表明,所提出的注意力机制有助于特征选择并提高模型的可解释性。

相似文献

本文引用的文献

2
DEEP MULTIMODAL LEARNING FOR EMOTION RECOGNITION IN SPOKEN LANGUAGE.用于口语情感识别的深度多模态学习
Proc IEEE Int Conf Acoust Speech Signal Process. 2018 Apr;2018:5079-5083. doi: 10.1109/ICASSP.2018.8462440. Epub 2018 Sep 13.
4
Region-based Activity Recognition Using Conditional GAN.基于条件生成对抗网络的区域活动识别
Proc ACM Int Conf Multimed. 2017 Oct;2017:1059-1067. doi: 10.1145/3123266.3123365.
7
Hidden conditional random fields.隐条件随机字段
IEEE Trans Pattern Anal Mach Intell. 2007 Oct;29(10):1848-53. doi: 10.1109/TPAMI.2007.1124.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验