Soomro Khurram, Idrees Haroon, Shah Mubarak
IEEE Trans Pattern Anal Mach Intell. 2019 Feb;41(2):459-472. doi: 10.1109/TPAMI.2018.2797266. Epub 2018 Jan 23.
This paper proposes a person-centric and online approach to the challenging problem of localization and prediction of actions and interactions in videos. Typically, localization or recognition is performed in an offline manner where all the frames in the video are processed together. This prevents timely localization and prediction of actions and interactions - an important consideration for many tasks including surveillance and human-machine interaction. In our approach, we estimate human poses at each frame and train discriminative appearance models using the superpixels inside the pose bounding boxes. Since the pose estimation per frame is inherently noisy, the conditional probability of pose hypotheses at current time-step (frame) is computed using pose estimations in the current frame and their consistency with poses in the previous frames. Next, both the superpixel and pose-based foreground likelihoods are used to infer the location of actors at each time through a Conditional Random Field enforcing spatio-temporal smoothness in color, optical flow, motion boundaries and edges among superpixels. The issue of visual drift is handled by updating the appearance models, and refining poses using motion smoothness on joint locations, in an online manner. For online prediction of action/interaction confidences, we propose an approach based on Structural SVM that operates on short video segments, and is trained with the objective that confidence of an action or interaction increases as time passes in a positive training clip. Lastly, we quantify the performance of both detection and prediction together, and analyze how the prediction accuracy varies as a time function of observed action/interaction at different levels of detection performance. Our experiments on several datasets suggest that despite using only a few frames to localize actions/interactions at each time instant, we are able to obtain competitive results to state-of-the-art offline methods.
本文针对视频中动作和交互的定位与预测这一具有挑战性的问题,提出了一种以人物为中心的在线方法。通常,定位或识别是以离线方式进行的,即对视频中的所有帧一起进行处理。这就阻碍了对动作和交互的及时定位与预测,而这对于包括监控和人机交互在内的许多任务来说都是一个重要的考量因素。在我们的方法中,我们在每一帧估计人体姿态,并使用姿态边界框内的超像素来训练判别性外观模型。由于每帧的姿态估计本身存在噪声,因此使用当前帧中的姿态估计及其与前一帧姿态的一致性来计算当前时间步(帧)姿态假设的条件概率。接下来,超像素和基于姿态的前景似然性都通过条件随机场用于推断每个时刻演员的位置,该条件随机场在超像素之间的颜色、光流、运动边界和边缘方面强制实现时空平滑性。通过在线更新外观模型,并利用关节位置的运动平滑性来细化姿态,从而处理视觉漂移问题。对于动作/交互置信度的在线预测,我们提出了一种基于结构化支持向量机的方法,该方法在短视频片段上运行,并以这样的目标进行训练:在正向训练片段中,随着时间推移,动作或交互的置信度会增加。最后,我们一起量化检测和预测的性能,并分析在不同检测性能水平下,预测准确率如何随观察到的动作/交互的时间函数而变化。我们在几个数据集上的实验表明,尽管在每个时刻仅使用少数帧来定位动作/交互,但我们能够获得与当前最先进的离线方法相竞争的结果。