Department of Electrical Engineering and Computer Science, University of Central Florida, Orlando, USA.
IEEE Trans Pattern Anal Mach Intell. 2013 Jul;35(7):1635-48. doi: 10.1109/TPAMI.2012.253.
This paper proposes a novel representation of articulated human actions and gestures and facial expressions. The main goals of the proposed approach are: 1) to enable recognition using very few examples, i.e., one or k-shot learning, and 2) meaningful organization of unlabeled datasets by unsupervised clustering. Our proposed representation is obtained by automatically discovering high-level subactions or motion primitives, by hierarchical clustering of observed optical flow in four-dimensional, spatial, and motion flow space. The completely unsupervised proposed method, in contrast to state-of-the-art representations like bag of video words, provides a meaningful representation conducive to visual interpretation and textual labeling. Each primitive action depicts an atomic subaction, like directional motion of limb or torso, and is represented by a mixture of four-dimensional Gaussian distributions. For one--shot and k-shot learning, the sequence of primitive labels discovered in a test video are labeled using KL divergence, and can then be represented as a string and matched against similar strings of training videos. The same sequence can also be collapsed into a histogram of primitives or be used to learn a Hidden Markov model to represent classes. We have performed extensive experiments on recognition by one and k-shot learning as well as unsupervised action clustering on six human actions and gesture datasets, a composite dataset, and a database of facial expressions. These experiments confirm the validity and discriminative nature of the proposed representation.
本文提出了一种新的人体关节动作、手势和面部表情表示方法。该方法的主要目标是:1)通过少量示例(即一次或 k-shot 学习)实现识别,以及 2)通过无监督聚类对未标记数据集进行有意义的组织。我们的建议表示法是通过自动发现高级子动作或运动基元,通过对四维度、空间和运动流空间中观察到的光流进行层次聚类来获得的。与像“视频词袋”这样的最新表示方法完全不同,完全无监督的建议方法提供了一种有利于视觉解释和文本标记的有意义表示法。每个基元动作都描绘了一个原子子动作,例如肢体或躯干的定向运动,并用四个维度的高斯混合分布表示。对于一次和 k-shot 学习,在测试视频中发现的基元标签序列使用 KL 散度进行标记,然后可以表示为字符串,并与训练视频的相似字符串进行匹配。同一序列也可以折叠成基元的直方图,或者用于学习隐马尔可夫模型来表示类。我们在六个人体动作和手势数据集、一个组合数据集以及一个面部表情数据库上进行了广泛的一次和 k-shot 学习识别以及无监督动作聚类实验。这些实验证实了所提出的表示法的有效性和区分性。