Zadeh Amir, Liang Paul Pu, Poria Soujanya, Vij Prateek, Cambria Erik, Morency Louis-Philippe
Carnegie Mellon University, USA.
NTU, Singapore.
Proc AAAI Conf Artif Intell. 2018 Feb;2018:5642-5649.
Human face-to-face communication is a complex multimodal signal. We use words (language modality), gestures (vision modality) and changes in tone (acoustic modality) to convey our intentions. Humans easily process and understand face-to-face communication, however, comprehending this form of communication remains a significant challenge for Artificial Intelligence (AI). AI must understand each modality and the interactions between them that shape the communication. In this paper, we present a novel neural architecture for understanding human communication called the Multi-attention Recurrent Network (MARN). The main strength of our model comes from discovering interactions between modalities through time using a neural component called the Multi-attention Block (MAB) and storing them in the hybrid memory of a recurrent component called the Long-short Term Hybrid Memory (LSTHM). We perform extensive comparisons on six publicly available datasets for multimodal sentiment analysis, speaker trait recognition and emotion recognition. MARN shows state-of-the-art results performance in all the datasets.
人类面对面交流是一种复杂的多模态信号。我们使用言语(语言模态)、手势(视觉模态)和语调变化(声学模态)来传达意图。人类能够轻松处理和理解面对面交流,然而,理解这种交流形式对人工智能(AI)来说仍然是一项重大挑战。人工智能必须理解每种模态以及它们之间形成交流的相互作用。在本文中,我们提出了一种用于理解人类交流的新型神经架构,称为多注意力循环网络(MARN)。我们模型的主要优势在于通过一个名为多注意力模块(MAB)的神经组件在时间维度上发现模态之间的相互作用,并将其存储在一个名为长短时混合记忆(LSTHM)的循环组件的混合记忆中。我们在六个公开可用的多模态情感分析、说话者特征识别和情感识别数据集上进行了广泛比较。MARN在所有数据集中都展现出了领先的结果表现。