Huang Peng, Yan Rui, Shu Xiangbo, Tu Zhewei, Dai Guangzhao, Tang Jinhui
IEEE Trans Image Process. 2024;33:297-309. doi: 10.1109/TIP.2023.3341297. Epub 2023 Dec 21.
Recognizing actions performed on unseen objects, known as Compositional Action Recognition (CAR), has attracted increasing attention in recent years. The main challenge is to overcome the distribution shift of "action-objects" pairs between the training and testing sets. Previous works for CAR usually introduce extra information (e.g. bounding box) to enhance the dynamic cues of video features. However, these approaches do not essentially eliminate the inherent inductive bias in the video, which can be regarded as the stumbling block for model generalization. Because the video features are usually extracted from the visually cluttered areas in which many objects cannot be removed or masked explicitly. To this end, this work attempts to implicitly accomplish semantic-level decoupling of "object-action" in the high-level feature space. Specifically, we propose a novel Semantic-Decoupling Transformer framework, dubbed as DeFormer, which contains two insightful sub-modules: Objects-Motion Decoupler (OMD) and Semantic-Decoupling Constrainer (SDC). In OMD, we initialize several learnable tokens incorporating annotation priors to learn an instance-level representation and then decouple it into the appearance feature and motion feature in high-level visual space. In SDC, we use textual information in the high-level language space to construct a dual-contrastive association to constrain the decoupled appearance feature and motion feature obtained in OMD. Extensive experiments verify the generalization ability of DeFormer. Specifically, compared to the baseline method, DeFormer achieves absolute improvements of 3%, 3.3%, and 5.4% under three different settings on STH-ELSE, while corresponding improvements on EPIC-KITCHENS-55 are 4.7%, 9.2%, and 4.4%. Besides, DeFormer gains state-of-the-art results either on ground-truth or detected annotations.
识别对未见物体执行的动作,即组合动作识别(CAR),近年来受到了越来越多的关注。主要挑战在于克服训练集和测试集之间“动作-物体”对的分布偏移。以往用于CAR的工作通常引入额外信息(如边界框)来增强视频特征的动态线索。然而,这些方法并没有从根本上消除视频中固有的归纳偏差,这可被视为模型泛化的绊脚石。因为视频特征通常是从视觉上杂乱的区域提取的,在这些区域中许多物体无法被明确移除或遮挡。为此,这项工作试图在高层特征空间中隐式地完成“物体-动作”的语义级解耦。具体来说,我们提出了一种新颖的语义解耦Transformer框架,称为DeFormer,它包含两个有见地的子模块:物体-运动解耦器(OMD)和语义解耦约束器(SDC)。在OMD中,我们初始化几个结合注释先验的可学习令牌,以学习实例级表示,然后在高层视觉空间中将其解耦为外观特征和运动特征。在SDC中,我们使用高层语言空间中的文本信息构建双重对比关联,以约束在OMD中获得的解耦后的外观特征和运动特征。大量实验验证了DeFormer的泛化能力。具体而言,与基线方法相比,DeFormer在STH-ELSE的三种不同设置下分别实现了3%、3.3%和5.4%的绝对提升,而在EPIC-KITCHENS-55上的相应提升分别为4.7%、9.2%和4.4%。此外,DeFormer在真实标注或检测标注上均取得了当前最优的结果。