Suppr超能文献

用于建模和检测人体动作的动画姿势模板。

Animated pose templates for modeling and detecting human actions.

机构信息

University of California, Los Angeles, Los Angeles.

Microsoft Research, Redmond.

出版信息

IEEE Trans Pattern Anal Mach Intell. 2014 Mar;36(3):436-52. doi: 10.1109/TPAMI.2013.144.

Abstract

This paper presents animated pose templates (APTs) for detecting short-term, long-term, and contextual actions from cluttered scenes in videos. Each pose template consists of two components: 1) a shape template with deformable parts represented in an And-node whose appearances are represented by the Histogram of Oriented Gradient (HOG) features, and 2) a motion template specifying the motion of the parts by the Histogram of Optical-Flows (HOF) features. A shape template may have more than one motion template represented by an Or-node. Therefore, each action is defined as a mixture (Or-node) of pose templates in an And-Or tree structure. While this pose template is suitable for detecting short-term action snippets in two to five frames, we extend it in two ways: 1) For long-term actions, we animate the pose templates by adding temporal constraints in a Hidden Markov Model (HMM), and 2) for contextual actions, we treat contextual objects as additional parts of the pose templates and add constraints that encode spatial correlations between parts. To train the model, we manually annotate part locations on several keyframes of each video and cluster them into pose templates using EM. This leaves the unknown parameters for our learning algorithm in two groups: 1) latent variables for the unannotated frames including pose-IDs and part locations, 2) model parameters shared by all training samples such as weights for HOG and HOF features, canonical part locations of each pose, coefficients penalizing pose-transition and part-deformation. To learn these parameters, we introduce a semi-supervised structural SVM algorithm that iterates between two steps: 1) learning (updating) model parameters using labeled data by solving a structural SVM optimization, and 2) imputing missing variables (i.e., detecting actions on unlabeled frames) with parameters learned from the previous step and progressively accepting high-score frames as newly labeled examples. This algorithm belongs to a family of optimization methods known as the Concave-Convex Procedure (CCCP) that converge to a local optimal solution. The inference algorithm consists of two components: 1) Detecting top candidates for the pose templates, and 2) computing the sequence of pose templates. Both are done by dynamic programming or, more precisely, beam search. In experiments, we demonstrate that this method is capable of discovering salient poses of actions as well as interactions with contextual objects. We test our method on several public action data sets and a challenging outdoor contextual action data set collected by ourselves. The results show that our model achieves comparable or better performance compared to state-of-the-art methods.

摘要

本文提出了一种用于从视频中的杂乱场景中检测短期、长期和上下文动作的动画姿势模板(APT)。每个姿势模板由两部分组成:1)形状模板,其具有可变形部分,以与节点表示,其外观由方向梯度直方图(HOG)特征表示,以及 2)运动模板,通过光流直方图(HOF)特征指定部分的运动。一个形状模板可以由一个或节点表示多个运动模板。因此,每个动作都被定义为与或树结构中的姿势模板的混合(或节点)。虽然这个姿势模板适用于检测两到五帧内的短期动作片段,但我们通过两种方式对其进行扩展:1)对于长期动作,我们通过在隐马尔可夫模型(HMM)中添加时间约束来使姿势模板动画化,以及 2)对于上下文动作,我们将上下文对象视为姿势模板的附加部分,并添加约束以编码部分之间的空间相关性。为了训练模型,我们在每个视频的几个关键帧上手动注释部分位置,并使用 EM 将其聚类为姿势模板。这为我们的学习算法留下了两组未知参数:1)未注释帧的潜在变量,包括姿势-ID 和部分位置,2)所有训练样本共享的模型参数,例如 HOG 和 HOF 特征的权重、每个姿势的规范部分位置、惩罚姿势转换和部分变形的系数。为了学习这些参数,我们引入了一种半监督结构 SVM 算法,该算法在两个步骤之间迭代:1)使用带有标签的数据学习(更新)模型参数,通过求解结构 SVM 优化,以及 2)使用前一步学习的参数推断缺失变量(即在无标签帧上检测动作),并逐步接受高分帧作为新的带标签示例。该算法属于称为凹-凸过程(CCCP)的一类优化方法,该方法收敛到局部最优解。推理算法由两部分组成:1)检测姿势模板的顶级候选者,以及 2)计算姿势模板的序列。这两部分都是通过动态编程完成的,或者更准确地说是通过波束搜索完成的。在实验中,我们证明了该方法能够发现动作的显著姿势以及与上下文对象的交互。我们在几个公共动作数据集和我们自己收集的具有挑战性的户外上下文动作数据集上测试了我们的方法。结果表明,与最先进的方法相比,我们的模型具有可比或更好的性能。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验