Tang Yutao, Béjar Benjamín, Vidal René
Johns Hopkins University.
Paul Scherrer Institut.
IEEE Winter Conf Appl Comput Vis. 2024 Jan;2024:6444-6454. doi: 10.1109/wacv57701.2024.00633. Epub 2024 Apr 9.
Recent work on action recognition leverages 3D features and textual information to achieve state-of-the-art performance. However, most of the current few-shot action recognition methods still rely on 2D frame-level representations, often require additional components to model temporal relations, and employ complex distance functions to achieve accurate alignment of these representations. In addition, existing methods struggle to effectively integrate textual semantics, some resorting to concatenation or addition of textual and visual features, and some using text merely as an additional supervision without truly achieving feature fusion and information transfer from different modalities. In this work, we propose a simple yet effective emantic-ware ew-hot ction ecognition () model to address these issues. We show that directly leveraging a 3D feature extractor combined with an effective feature-fusion scheme, and a simple cosine similarity for classification can yield better performance without the need of extra components for temporal modeling or complex distance functions. We introduce an innovative scheme to encode the textual semantics into the video representation which adaptively fuses features from text and video, and encourages the visual encoder to extract more semantically consistent features. In this scheme, SAFSAR achieves alignment and fusion in a compact way. Experiments on five challenging few-shot action recognition benchmarks under various settings demonstrate that the proposed SAFSAR model significantly improves the state-of-the-art performance.
近期关于动作识别的研究利用3D特征和文本信息来实现最优性能。然而,当前大多数少样本动作识别方法仍依赖于2D帧级表示,通常需要额外的组件来对时间关系进行建模,并采用复杂的距离函数来实现这些表示的精确对齐。此外,现有方法难以有效整合文本语义,一些方法采用文本和视觉特征的拼接或相加,还有一些方法仅将文本用作额外的监督,而没有真正实现不同模态之间的特征融合和信息传递。在这项工作中,我们提出了一种简单而有效的语义感知少样本动作识别(SAFSAR)模型来解决这些问题。我们表明,直接利用3D特征提取器结合有效的特征融合方案,以及用于分类的简单余弦相似度,无需额外的时间建模组件或复杂的距离函数就能产生更好的性能。我们引入了一种创新方案,将文本语义编码到视频表示中,该方案能自适应地融合来自文本和视频的特征,并促使视觉编码器提取更多语义一致的特征。在这种方案中,SAFSAR以紧凑的方式实现对齐和融合。在各种设置下对五个具有挑战性的少样本动作识别基准数据集进行的实验表明,所提出的SAFSAR模型显著提高了当前的最优性能。