Department of Computer Science, Sapienza University of Rome, Via Salaria 113, Rome 00198, Italy.
Department of Computer Science, Mathematics and Physics, University of Udine, Via delle Scienze 206, Udine 33100, Italy.
Int J Neural Syst. 2022 Oct;32(10):2250040. doi: 10.1142/S012906572250040X. Epub 2022 Jul 25.
Human feelings expressed through verbal (e.g. voice) and non-verbal communication channels (e.g. face or body) can influence either human actions or interactions. In the literature, most of the attention was given to facial expressions for the analysis of emotions conveyed through non-verbal behaviors. Despite this, psychology highlights that the body is an important indicator of the human affective state in performing daily life activities. Therefore, this paper presents a novel method for affective action and interaction recognition from videos, exploiting multi-view representation learning and only full-body handcrafted characteristics selected following psychological and proxemic studies. Specifically, 2D skeletal data are extracted from RGB video sequences to derive diverse low-level skeleton features, i.e. multi-views, modeled through the bag-of-visual-words clustering approach generating a condition-related codebook. In this way, each affective action and interaction within a video can be represented as a frequency histogram of codewords. During the learning phase, for each affective class, training samples are used to compute its global histogram of codewords stored in a database and later used for the recognition task. In the recognition phase, the video frequency histogram representation is matched against the database of class histograms and classified as the closest affective class in terms of Euclidean distance. The effectiveness of the proposed system is evaluated on a specifically collected dataset containing 6 emotion for both actions and interactions, on which the proposed system obtains 93.64% and 90.83% accuracy, respectively. In addition, the devised strategy also achieves in line performances with other literature works based on deep learning when tested on a public collection containing 6 emotions plus a neutral state, demonstrating the effectiveness of the presented approach and confirming the findings in psychological and proxemic studies.
人类通过言语(例如声音)和非言语沟通渠道(例如面部或身体)表达的情感可以影响人类的行动或互动。在文献中,大多数注意力都集中在面部表情上,用于分析非言语行为所传达的情感。尽管如此,心理学强调,身体在执行日常生活活动时是人类情感状态的重要指标。因此,本文提出了一种从视频中识别情感动作和交互的新方法,利用多视图表示学习和仅根据心理学和近体学研究选择的全身体手工特征。具体来说,从 RGB 视频序列中提取 2D 骨骼数据,以得出各种低级骨骼特征,即多视图,通过基于视觉单词的聚类方法建模,生成与条件相关的代码本。通过这种方式,视频中的每个情感动作和交互都可以表示为代码词的频率直方图。在学习阶段,对于每个情感类别,使用训练样本计算其全局代码词直方图并存储在数据库中,然后用于识别任务。在识别阶段,将视频频率直方图表示与数据库中的类别直方图进行匹配,并根据欧几里得距离将其分类为最接近的情感类别。在专门收集的包含动作和交互的 6 种情感的数据集上评估了所提出系统的有效性,该系统在这两个数据集上的准确率分别为 93.64%和 90.83%。此外,在所提出的方法基于心理学和近体学研究的发现的同时,当在包含 6 种情感和中性状态的公共数据集上进行测试时,所设计的策略也取得了与其他基于深度学习的文献工作相当的性能,证明了所提出方法的有效性。