Graduate School of Artificial Intelligence, Kyungpook National University, Daegu, 41566, South Korea.
KNU-LG Electronics Convergence Research Center, AI Institute of Technology, Kyungpook National University, Daegu, 41566, South Korea.
Neural Netw. 2022 Sep;153:518-529. doi: 10.1016/j.neunet.2022.06.032. Epub 2022 Jun 30.
Temporal action proposal generation aims to generate temporal boundaries containing action instances. In real-time applications such as surveillance cameras, autonomous driving, and traffic monitoring, the online localization and recognition of human activities occurring in short temporal intervals are important areas of research. Existing approaches of temporal action proposal generation consider only the offline and frame-level feature aggregation along the temporal dimension. Those offline methods also generate many redundant irrelevant proposal regions in the frames as temporal boundaries. This leads to higher computational cost along with slow processing speed which is not suitable for online tasks. In this study, we propose a novel spatio-temporal attention network for online action proposal generation as opposed to existing offline proposal generation methods. Our novel proposed approach incorporates the inter-dependency between the spatial and temporal context information of each incoming video clip to generate more relevant online temporal action proposals. First, we propose a windowed spatial attention module to capture the inter-spatial relationship between the features of incoming frames. The windowed spatial network produces more robust clip-level feature representation and efficiently deals with noisy features such as occlusion or background scenes. Second, we introduce a temporal attention module to capture relevant temporal dynamic information mutually to the localized spatial information to model the long inter-frame temporal relationship since most online real life videos are untrimmed in nature. By applying these two attention modules sequentially, the novel proposed spatio-temporal network model is able to generate precise action boundaries at a particular instant of time. In addition, the model generates fewer discriminative temporal action proposals while maintaining a low computational cost and high processing speed suitable for online settings.
时间动作提议生成旨在生成包含动作实例的时间边界。在实时应用中,如监控摄像头、自动驾驶和交通监控,对短时间间隔内发生的人类活动进行在线定位和识别是研究的重要领域。现有的时间动作提议生成方法仅考虑了沿时间维度的离线和帧级特征聚合。这些离线方法也会在帧中生成许多冗余的不相关提议区域作为时间边界。这导致计算成本增加,处理速度较慢,不适合在线任务。在这项研究中,我们提出了一种新颖的时空注意网络,用于在线动作提议生成,而不是现有的离线提议生成方法。我们的新方法结合了每个输入视频片段的空间和时间上下文信息之间的相互依赖关系,以生成更相关的在线时间动作提议。首先,我们提出了一个窗口化的空间注意模块,以捕捉输入帧特征之间的空间关系。窗口化的空间网络生成更稳健的剪辑级特征表示,并有效地处理遮挡或背景场景等噪声特征。其次,我们引入了一个时间注意模块,以捕捉相互的相关时间动态信息,与本地化的空间信息一起建模长的帧间时间关系,因为大多数在线的真实生活视频本质上是未剪辑的。通过顺序应用这两个注意模块,新提出的时空网络模型能够在特定的时间点生成精确的动作边界。此外,该模型生成的判别性时间动作提议更少,同时保持低计算成本和高处理速度,适用于在线设置。