IEEE Trans Neural Netw Learn Syst. 2021 Feb;32(2):663-674. doi: 10.1109/TNNLS.2020.2978942. Epub 2021 Feb 4.
This article aims to tackle the problem of group activity recognition in the multiple-person scene. To model the group activity with multiple persons, most long short-term memory (LSTM)-based methods first learn the person-level action representations by several LSTMs and then integrate all the person-level action representations into the following LSTM to learn the group-level activity representation. This type of solution is a two-stage strategy, which neglects the "host-parasite" relationship between the group-level activity ("host") and person-level actions ("parasite") in spatiotemporal space. To this end, we propose a novel graph LSTM-in-LSTM (GLIL) for group activity recognition by modeling the person-level actions and the group-level activity simultaneously. GLIL is a "host-parasite" architecture, which can be seen as several person LSTMs (P-LSTMs) in the local view or a graph LSTM (G-LSTM) in the global view. Specifically, P-LSTMs model the person-level actions based on the interactions among persons. Meanwhile, G-LSTM models the group-level activity, where the person-level motion information in multiple P-LSTMs is selectively integrated and stored into G-LSTM based on their contributions to the inference of the group activity class. Furthermore, to use the person-level temporal features instead of the person-level static features as the input of GLIL, we introduce a residual LSTM with the residual connection to learn the person-level residual features, consisting of temporal features and static features. Experimental results on two public data sets illustrate the effectiveness of the proposed GLIL compared with state-of-the-art methods.
本文旨在解决多人场景中的群体活动识别问题。为了对多人的群体活动进行建模,大多数基于长短期记忆网络(LSTM)的方法首先通过几个 LSTM 学习人员级别的动作表示,然后将所有人员级别的动作表示集成到后续的 LSTM 中,以学习群体级别的活动表示。这种方法是一种两阶段策略,忽略了群体活动(“宿主”)和人员级别动作(“寄生虫”)在时空空间中的“宿主-寄生虫”关系。为此,我们提出了一种新颖的图 LSTM-in-LSTM(GLIL),通过同时对人员级别动作和群体级别活动进行建模来进行群体活动识别。GLIL 是一种“宿主-寄生虫”架构,可以在局部视图中视为几个人员 LSTM(P-LSTM),也可以在全局视图中视为图 LSTM(G-LSTM)。具体来说,P-LSTM 基于人员之间的相互作用来对人员级别动作进行建模。同时,G-LSTM 对群体级别活动进行建模,其中,基于人员对群体活动类推断的贡献,从多个 P-LSTM 中选择性地整合和存储人员级别的运动信息到 G-LSTM 中。此外,为了使用人员级别的时间特征而不是人员级别的静态特征作为 GLIL 的输入,我们引入了具有残差连接的残差 LSTM 来学习人员级别的残差特征,残差特征由时间特征和静态特征组成。在两个公共数据集上的实验结果表明,与最先进的方法相比,所提出的 GLIL 是有效的。