• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用具有分层编码器-解码器的注意力多模态网络进行人类对话分析

Human Conversation Analysis Using Attentive Multimodal Networks with Hierarchical Encoder-Decoder.

作者信息

Gu Yue, Li Xinyu, Huang Kaixiang, Fu Shiyu, Yang Kangning, Chen Shuhong, Zhou Moliang, Marsic Ivan

机构信息

Rutgers University.

Amazon Inc., Rutgers University.

出版信息

Proc ACM Int Conf Multimed. 2018 Oct;2018:537-545. doi: 10.1145/3240508.3240714.

DOI:10.1145/3240508.3240714
PMID:32201865
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7085889/
Abstract

Human conversation analysis is challenging because the meaning can be expressed through words, intonation, or even body language and facial expression. We introduce a hierarchical encoder-decoder structure with attention mechanism for conversation analysis. The hierarchical encoder learns word-level features from video, audio, and text data that are then formulated into conversation-level features. The corresponding hierarchical decoder is able to predict different attributes at given time instances. To integrate multiple sensory inputs, we introduce a novel fusion strategy with modality attention. We evaluated our system on published emotion recognition, sentiment analysis, and speaker trait analysis datasets. Our system outperformed previous state-of-the-art approaches in both classification and regressions tasks on three datasets. We also outperformed previous approaches in generalization tests on two commonly used datasets. We achieved comparable performance in predicting co-existing labels using the proposed model instead of multiple individual models. In addition, the easily-visualized modality and temporal attention demonstrated that the proposed attention mechanism helps feature selection and improves model interpretability.

摘要

人类对话分析具有挑战性,因为意义可以通过文字、语调,甚至肢体语言和面部表情来表达。我们引入了一种带有注意力机制的分层编码器-解码器结构用于对话分析。分层编码器从视频、音频和文本数据中学习单词级别的特征,然后将这些特征整合为对话级别的特征。相应的分层解码器能够在给定的时间实例预测不同的属性。为了整合多个感官输入,我们引入了一种带有模态注意力的新型融合策略。我们在已发表的情感识别、情感分析和说话者特征分析数据集上评估了我们的系统。在三个数据集的分类和回归任务中,我们的系统均优于先前的最先进方法。在两个常用数据集的泛化测试中,我们也优于先前的方法。使用所提出的模型而非多个单独模型来预测共存标签时,我们取得了可比的性能。此外,易于可视化的模态和时间注意力表明,所提出的注意力机制有助于特征选择并提高模型的可解释性。

相似文献

1
Human Conversation Analysis Using Attentive Multimodal Networks with Hierarchical Encoder-Decoder.使用具有分层编码器-解码器的注意力多模态网络进行人类对话分析
Proc ACM Int Conf Multimed. 2018 Oct;2018:537-545. doi: 10.1145/3240508.3240714.
2
TEP2MP: A text-emotion prediction model oriented to multi-participant text-conversation scenario with hybrid attention enhancement.TEP2MP:一种面向多参与者文本对话场景的文本情感预测模型,具有混合注意力增强功能。
Math Biosci Eng. 2022 Jan 10;19(3):2671-2699. doi: 10.3934/mbe.2022122.
3
Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment.基于词级对齐的分层注意力策略的多模态情感分析
Proc Conf Assoc Comput Linguist Meet. 2018 Jul;2018:2225-2235.
4
Feature-guided attention network for medical image segmentation.基于特征引导的注意力网络的医学图像分割。
Med Phys. 2023 Aug;50(8):4871-4886. doi: 10.1002/mp.16253. Epub 2023 Feb 16.
5
Multimodal Sentiment Analysis Based on Cross-Modal Attention and Gated Cyclic Hierarchical Fusion Networks.基于跨模态注意力和门控循环层次融合网络的多模态情感分析。
Comput Intell Neurosci. 2022 Aug 9;2022:4767437. doi: 10.1155/2022/4767437. eCollection 2022.
6
Mutual Correlation Attentive Factors in Dyadic Fusion Networks for Speech Emotion Recognition.用于语音情感识别的二元融合网络中的互相关注意因素
Proc ACM Int Conf Multimed. 2019 Oct;2019:157-166. doi: 10.1145/3343031.3351039.
7
Multimodal Abstractive Summarization using bidirectional encoder representations from transformers with attention mechanism.使用带有注意力机制的变换器双向编码器表示的多模态抽象摘要
Heliyon. 2024 Feb 18;10(4):e26162. doi: 10.1016/j.heliyon.2024.e26162. eCollection 2024 Feb 29.
8
HiMul-LGG: A hierarchical decision fusion-based local-global graph neural network for multimodal emotion recognition in conversation.HiMul-LGG:一种基于分层决策融合的局部-全局图神经网络,用于对话中的多模态情感识别。
Neural Netw. 2025 Jan;181:106764. doi: 10.1016/j.neunet.2024.106764. Epub 2024 Sep 28.
9
Toward attention-based learning to predict the risk of brain degeneration with multimodal medical data.迈向基于注意力的学习,以利用多模态医学数据预测脑退化风险。
Front Neurosci. 2023 Jan 18;16:1043626. doi: 10.3389/fnins.2022.1043626. eCollection 2022.
10
Compositional Attention Networks with Two-Stream Fusion for Video Question Answering.用于视频问答的双流融合组合注意力网络。
IEEE Trans Image Process. 2019 Sep 16. doi: 10.1109/TIP.2019.2940677.

引用本文的文献

1
A plug and play fuzzy mask extraction module for single image deraining.一种用于单图像去雨的即插即用模糊蒙版提取模块。
Sci Rep. 2025 Mar 25;15(1):10277. doi: 10.1038/s41598-025-94643-5.
2
Mutual Correlation Attentive Factors in Dyadic Fusion Networks for Speech Emotion Recognition.用于语音情感识别的二元融合网络中的互相关注意因素
Proc ACM Int Conf Multimed. 2019 Oct;2019:157-166. doi: 10.1145/3343031.3351039.
3
Multimodal Attention Network for Trauma Activity Recognition from Spoken Language and Environmental Sound.用于从口语和环境声音中识别创伤活动的多模态注意力网络。
Proc (IEEE Int Conf Healthc Inform). 2019 Jun;2019. doi: 10.1109/ichi.2019.8904713. Epub 2019 Nov 21.

本文引用的文献

1
Multi-attention Recurrent Network for Human Communication Comprehension.用于人类交流理解的多注意力循环网络。
Proc AAAI Conf Artif Intell. 2018 Feb;2018:5642-5649.
2
DEEP MULTIMODAL LEARNING FOR EMOTION RECOGNITION IN SPOKEN LANGUAGE.用于口语情感识别的深度多模态学习
Proc IEEE Int Conf Acoust Speech Signal Process. 2018 Apr;2018:5079-5083. doi: 10.1109/ICASSP.2018.8462440. Epub 2018 Sep 13.
3
Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment.基于词级对齐的分层注意力策略的多模态情感分析
Proc Conf Assoc Comput Linguist Meet. 2018 Jul;2018:2225-2235.
4
Region-based Activity Recognition Using Conditional GAN.基于条件生成对抗网络的区域活动识别
Proc ACM Int Conf Multimed. 2017 Oct;2017:1059-1067. doi: 10.1145/3123266.3123365.
5
HyperFace: A Deep Multi-Task Learning Framework for Face Detection, Landmark Localization, Pose Estimation, and Gender Recognition.超人脸:一个用于人脸检测、地标定位、姿势估计和性别识别的深度多任务学习框架。
IEEE Trans Pattern Anal Mach Intell. 2019 Jan;41(1):121-135. doi: 10.1109/TPAMI.2017.2781233. Epub 2017 Dec 8.
6
Combining Video, Audio and Lexical Indicators of Affect in Spontaneous Conversation via Particle Filtering.通过粒子滤波结合自发对话中情感的视频、音频和词汇指标。
Proc ACM Int Conf Multimodal Interact. 2012;2012:485-492. doi: 10.1145/2388676.2388781.
7
Hidden conditional random fields.隐条件随机字段
IEEE Trans Pattern Anal Mach Intell. 2007 Oct;29(10):1848-53. doi: 10.1109/TPAMI.2007.1124.