IEEE Trans Pattern Anal Mach Intell. 2022 Feb;44(2):648-665. doi: 10.1109/TPAMI.2021.3107160. Epub 2022 Jan 7.
Human actions in video sequences are characterized by the complex interplay between spatial features and their temporal dynamics. In this paper, we propose novel tensor representations for compactly capturing such higher-order relationships between visual features for the task of action recognition. We propose two tensor-based feature representations, viz. (i) sequence compatibility kernel (SCK) and (ii) dynamics compatibility kernel (DCK). SCK builds on the spatio-temporal correlations between features, whereas DCK explicitly models the action dynamics of a sequence. We also explore generalization of SCK, coined SCK ⊕, that operates on subsequences to capture the local-global interplay of correlations, which can incorporate multi-modal inputs e.g., skeleton 3D body-joints and per-frame classifier scores obtained from deep learning models trained on videos. We introduce linearization of these kernels that lead to compact and fast descriptors. We provide experiments on (i) 3D skeleton action sequences, (ii) fine-grained video sequences, and (iii) standard non-fine-grained videos. As our final representations are tensors that capture higher-order relationships of features, they relate to co-occurrences for robust fine-grained recognition (Lin, 2017), (Koniusz, 2018). We use higher-order tensors and so-called Eigenvalue Power Normalization (EPN) which have been long speculated to perform spectral detection of higher-order occurrences (Koniusz, 2013), (Koniusz, 2017), thus detecting fine-grained relationships of features rather than merely count features in action sequences. We prove that a tensor of order r, built from Z dimensional features, coupled with EPN indeed detects if at least one higher-order occurrence is 'projected' into one of its [Formula: see text] subspaces of dim. r represented by the tensor, thus forming a Tensor Power Normalization metric endowed with [Formula: see text] such 'detectors'.
人体动作在视频序列中具有空间特征与其时间动态之间复杂的相互作用。在本文中,我们提出了新的张量表示法,用于紧凑地捕获视觉特征之间的这种更高阶关系,用于动作识别任务。我们提出了两种基于张量的特征表示,即(i)序列兼容性核(SCK)和(ii)动态兼容性核(DCK)。SCK 基于特征之间的时空相关性,而 DCK 则明确地对序列的动作动态建模。我们还探索了 SCK 的泛化,即 SCK⊕,它作用于子序列,以捕获相关性的局部-全局相互作用,它可以结合多模态输入,例如 3D 骨架人体关节和从视频训练的深度学习模型获得的每一帧分类器得分。我们介绍了这些核的线性化,这导致了紧凑而快速的描述符。我们在(i)3D 骨架动作序列、(ii)细粒度视频序列和(iii)标准非细粒度视频上进行了实验。由于我们的最终表示形式是张量,它可以捕获特征的高阶关系,因此与稳健的细粒度识别的共现相关(Lin,2017),(Koniusz,2018)。我们使用高阶张量和所谓的特征值幂归一化(EPN),长期以来一直推测它们执行高阶出现的谱检测(Koniusz,2013),(Koniusz,2017),从而检测特征的细粒度关系,而不仅仅是在动作序列中计数特征。我们证明,从 Z 维特征构建的阶 r 张量,与 EPN 结合使用,确实可以检测到至少一个高阶出现是否被“投影”到其[公式:见文本]个维度 r 的子空间之一,由张量表示,从而形成具有[公式:见文本]个这样的“检测器”的张量幂归一化度量。