IEEE Trans Image Process. 2016 Jun;25(6):2856-2865. doi: 10.1109/TIP.2016.2556940. Epub 2016 Apr 20.
This paper addresses the problem of recognizing human actions from RGB-D videos. A discriminative relational feature learning method is proposed for fusing heterogeneous RGB and depth modalities, and classifying the actions in RGB-D sequences. Our method factorizes the feature matrix of each modality, and enforces the same semantics for them in order to learn shared features from multimodal data. This allows us to capture the complex correlations between the two modalities. To improve the discriminative power of the relational features, we introduce a hinge loss to measure the classification accuracy when the features are employed for classification. This essentially performs supervised factorization, and learns discriminative features that are optimized for classification. We formulate the recognition task within a maximum margin framework, and solve the formulation using a coordinate descent algorithm. The proposed method is extensively evaluated on two public RGB-D action data sets. We demonstrate that the proposed method can learn extremely low-dimensional features with superior discriminative power, and outperforms the state-of-the-art methods. It also achieves high performance when one modality is missing in testing or training.
本文探讨了从RGB-D视频中识别人类动作的问题。提出了一种判别性关系特征学习方法,用于融合异构的RGB和深度模态,并对RGB-D序列中的动作进行分类。我们的方法对每个模态的特征矩阵进行分解,并强制它们具有相同的语义,以便从多模态数据中学习共享特征。这使我们能够捕捉两种模态之间的复杂相关性。为了提高关系特征的判别能力,我们引入了一个铰链损失来衡量当特征用于分类时的分类准确率。这本质上执行了监督分解,并学习了针对分类进行优化的判别特征。我们在最大间隔框架内制定识别任务,并使用坐标下降算法求解该公式。所提出的方法在两个公共RGB-D动作数据集上进行了广泛评估。我们证明,所提出的方法可以学习具有卓越判别能力的极低维特征,并且优于现有方法。当测试或训练中缺少一种模态时,它也能实现高性能。