Wang Zhenhua, Ge Jinchao, Guo Dongyan, Zhang Jianhua, Lei Yanjing, Chen Shengyong
IEEE Trans Image Process. 2021;30:6240-6254. doi: 10.1109/TIP.2021.3093383. Epub 2021 Jul 12.
The task of human interaction understanding involves both recognizing the action of each individual in the scene and decoding the interaction relationship among people, which is useful to a series of vision applications such as camera surveillance, video-based sports analysis and event retrieval. This paper divides the task into two problems including grouping people into clusters and assigning labels to each of them, and presents an approach to solving these problems in a joint manner. Our method does not assume the number of groups is known beforehand as this will substantially restrict its application. With the observation that the two challenges are highly correlated, the key idea is to model the pairwise interacting relations among people via a complete graph and its associated energy function such that the labeling and grouping problems are translated into the minimization of the energy function. We implement this joint framework by fusing both deep features and rich contextual cues, and learn the fusion parameters from data. An alternating search algorithm is developed in order to efficiently solve the associated inference problem. By combining the grouping and labeling results obtained with our method, we are able to achieve the semantic-level understanding of human interactions. Extensive experiments are performed to qualitatively and quantitatively evaluate the effectiveness of our approach, which outperforms state-of-the-art methods on several important benchmarks. An ablation study is also performed to verify the effectiveness of different modules within our approach.
理解人类交互行为的任务包括识别场景中每个人的动作以及解读人与人之间的交互关系,这对于一系列视觉应用都很有用,如摄像头监控、基于视频的体育分析和事件检索。本文将该任务分为两个问题,即把人分组为不同的簇并为每个簇分配标签,并提出一种以联合方式解决这些问题的方法。我们的方法不假定预先知道组的数量,因为这会严重限制其应用。基于这两个挑战高度相关的观察,关键思想是通过一个完全图及其相关的能量函数对人与人之间的成对交互关系进行建模,从而将标签和分组问题转化为能量函数的最小化问题。我们通过融合深度特征和丰富的上下文线索来实现这个联合框架,并从数据中学习融合参数。为了有效解决相关的推理问题,开发了一种交替搜索算法。通过结合使用我们的方法获得的分组和标签结果,我们能够实现对人类交互行为的语义级理解。进行了大量实验,从定性和定量两方面评估我们方法的有效性,在几个重要基准测试中,该方法优于现有方法。还进行了消融研究,以验证我们方法中不同模块的有效性。