Wang Yansen, Shen Ying, Liu Zhun, Liang Paul Pu, Zadeh Amir, Morency Louis-Philippe
Department of Computer Science, Tsinghua University.
School of Computer Science, Carnegie Mellon University.
Proc AAAI Conf Artif Intell. 2019 Jul;33(1):7216-7223.
Humans convey their intentions through the usage of both verbal and nonverbal behaviors during face-to-face communication. Speaker intentions often vary dynamically depending on different nonverbal contexts, such as vocal patterns and facial expressions. As a result, when modeling human language, it is essential to not only consider the literal meaning of the words but also the nonverbal contexts in which these words appear. To better model human language, we first model expressive nonverbal representations by analyzing the fine-grained visual and acoustic patterns that occur during word segments. In addition, we seek to capture the dynamic nature of nonverbal intents by shifting word representations based on the accompanying nonverbal behaviors. To this end, we propose the Recurrent Attended Variation Embedding Network (RAVEN) that models the fine-grained structure of nonverbal subword sequences and dynamically shifts word representations based on nonverbal cues. Our proposed model achieves competitive performance on two publicly available datasets for multimodal sentiment analysis and emotion recognition. We also visualize the shifted word representations in different nonverbal contexts and summarize common patterns regarding multimodal variations of word representations.
在面对面交流中,人类通过使用言语和非言语行为来传达意图。说话者的意图常常会根据不同的非言语情境动态变化,比如语音模式和面部表情。因此,在对人类语言进行建模时,不仅要考虑单词的字面意思,还要考虑这些单词出现时的非言语情境。为了更好地对人类语言进行建模,我们首先通过分析单词片段中出现的细粒度视觉和声学模式来对富有表现力的非言语表征进行建模。此外,我们试图通过根据伴随的非言语行为来转移单词表征,以捕捉非言语意图的动态性质。为此,我们提出了循环关注变化嵌入网络(RAVEN),该网络对非言语子词序列的细粒度结构进行建模,并根据非言语线索动态转移单词表征。我们提出的模型在两个用于多模态情感分析和情感识别的公开可用数据集上取得了有竞争力的性能。我们还可视化了不同非言语情境下转移后的单词表征,并总结了关于单词表征多模态变化的常见模式。