Sahu Abhimanyu, Chowdhury Ananda S
IEEE Trans Image Process. 2021;30:4330-4340. doi: 10.1109/TIP.2021.3070732. Epub 2021 Apr 16.
Analysis of egocentric video has recently drawn attention of researchers in the computer vision as well as multimedia communities. In this paper, we propose a weakly supervised superpixel level joint framework for localization, recognition and summarization of actions in an egocentric video. We first recognize and localize single as well as multiple action(s) in each frame of an egocentric video and then construct a summary of these detected actions. The superpixel level solution helps in precise localization of actions in addition to improving the recognition accuracy. Superpixels are extracted within the central regions of the egocentric video frames; these central regions being determined through a previously developed center-surround model. A sparse spatio-temporal video representation graph is constructed in the deep feature space with the superpixels as nodes. A weakly supervised solution using random walks yields action labels for each superpixel. After determining action label(s) for each frame from its constituent superpixels, we apply a fractional knapsack type formulation for obtaining a summary (of actions). Experimental comparisons on publicly available ADL, GTEA, EGTEA Gaze+, EgoGesture, and EPIC-Kitchens datasets show the effectiveness of the proposed solution.
以自我为中心的视频分析最近引起了计算机视觉以及多媒体领域研究人员的关注。在本文中,我们提出了一种弱监督的超像素级联合框架,用于以自我为中心的视频中的动作定位、识别和总结。我们首先在以自我为中心的视频的每一帧中识别并定位单个以及多个动作,然后构建这些检测到的动作的总结。超像素级解决方案除了提高识别准确率外,还有助于动作的精确定位。超像素在以自我为中心的视频帧的中心区域内提取;这些中心区域通过先前开发的中心-环绕模型确定。在深度特征空间中以超像素为节点构建一个稀疏的时空视频表示图。使用随机游走的弱监督解决方案为每个超像素生成动作标签。从其组成超像素确定每一帧的动作标签后,我们应用分数背包类型公式来获得(动作的)总结。在公开可用的ADL、GTEA、EGTEA Gaze+、EgoGesture和EPIC-Kitchens数据集上的实验比较表明了所提出解决方案的有效性。