• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于少样本动作识别的语义感知视频表示

Semantic-aware Video Representation for Few-shot Action Recognition.

作者信息

Tang Yutao, Béjar Benjamín, Vidal René

机构信息

Johns Hopkins University.

Paul Scherrer Institut.

出版信息

IEEE Winter Conf Appl Comput Vis. 2024 Jan;2024:6444-6454. doi: 10.1109/wacv57701.2024.00633. Epub 2024 Apr 9.

DOI:10.1109/wacv57701.2024.00633
PMID:39171198
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11337110/
Abstract

Recent work on action recognition leverages 3D features and textual information to achieve state-of-the-art performance. However, most of the current few-shot action recognition methods still rely on 2D frame-level representations, often require additional components to model temporal relations, and employ complex distance functions to achieve accurate alignment of these representations. In addition, existing methods struggle to effectively integrate textual semantics, some resorting to concatenation or addition of textual and visual features, and some using text merely as an additional supervision without truly achieving feature fusion and information transfer from different modalities. In this work, we propose a simple yet effective emantic-ware ew-hot ction ecognition () model to address these issues. We show that directly leveraging a 3D feature extractor combined with an effective feature-fusion scheme, and a simple cosine similarity for classification can yield better performance without the need of extra components for temporal modeling or complex distance functions. We introduce an innovative scheme to encode the textual semantics into the video representation which adaptively fuses features from text and video, and encourages the visual encoder to extract more semantically consistent features. In this scheme, SAFSAR achieves alignment and fusion in a compact way. Experiments on five challenging few-shot action recognition benchmarks under various settings demonstrate that the proposed SAFSAR model significantly improves the state-of-the-art performance.

摘要

近期关于动作识别的研究利用3D特征和文本信息来实现最优性能。然而,当前大多数少样本动作识别方法仍依赖于2D帧级表示,通常需要额外的组件来对时间关系进行建模,并采用复杂的距离函数来实现这些表示的精确对齐。此外,现有方法难以有效整合文本语义,一些方法采用文本和视觉特征的拼接或相加,还有一些方法仅将文本用作额外的监督,而没有真正实现不同模态之间的特征融合和信息传递。在这项工作中,我们提出了一种简单而有效的语义感知少样本动作识别(SAFSAR)模型来解决这些问题。我们表明,直接利用3D特征提取器结合有效的特征融合方案,以及用于分类的简单余弦相似度,无需额外的时间建模组件或复杂的距离函数就能产生更好的性能。我们引入了一种创新方案,将文本语义编码到视频表示中,该方案能自适应地融合来自文本和视频的特征,并促使视觉编码器提取更多语义一致的特征。在这种方案中,SAFSAR以紧凑的方式实现对齐和融合。在各种设置下对五个具有挑战性的少样本动作识别基准数据集进行的实验表明,所提出的SAFSAR模型显著提高了当前的最优性能。

相似文献

1
Semantic-aware Video Representation for Few-shot Action Recognition.用于少样本动作识别的语义感知视频表示
IEEE Winter Conf Appl Comput Vis. 2024 Jan;2024:6444-6454. doi: 10.1109/wacv57701.2024.00633. Epub 2024 Apr 9.
2
Prototype Adaption and Projection for Few- and Zero-Shot 3D Point Cloud Semantic Segmentation.用于少样本和零样本3D点云语义分割的原型适配与投影
IEEE Trans Image Process. 2023;32:3199-3211. doi: 10.1109/TIP.2023.3279660. Epub 2023 Jun 7.
3
Improving few-shot relation extraction through semantics-guided learning.通过语义引导学习提高小样本关系抽取。
Neural Netw. 2024 Jan;169:453-461. doi: 10.1016/j.neunet.2023.10.053. Epub 2023 Nov 3.
4
VGSG: Vision-Guided Semantic-Group Network for Text-Based Person Search.VGSG:用于基于文本的行人搜索的视觉引导语义组网络。
IEEE Trans Image Process. 2024;33:163-176. doi: 10.1109/TIP.2023.3337653. Epub 2023 Dec 8.
5
KLSANet: Key local semantic alignment Network for few-shot image classification.KLSANet:用于少样本图像分类的关键局部语义对齐网络。
Neural Netw. 2024 Oct;178:106456. doi: 10.1016/j.neunet.2024.106456. Epub 2024 Jun 10.
6
Zero-Shot Human-Object Interaction Detection via Similarity Propagation.通过相似性传播实现零样本人类-物体交互检测
IEEE Trans Neural Netw Learn Syst. 2024 Dec;35(12):17805-17816. doi: 10.1109/TNNLS.2023.3309104. Epub 2024 Dec 2.
7
Learning to Compare Relation: Semantic Alignment for Few-Shot Learning.学习比较关系:少样本学习的语义对齐
IEEE Trans Image Process. 2022;31:1462-1474. doi: 10.1109/TIP.2022.3142530. Epub 2022 Jan 27.
8
Cross-modality integration framework with prediction, perception and discrimination for video anomaly detection.跨模态集成框架,具有预测、感知和判别功能,用于视频异常检测。
Neural Netw. 2024 Apr;172:106138. doi: 10.1016/j.neunet.2024.106138. Epub 2024 Jan 19.
9
Path-based knowledge reasoning with textual semantic information for medical knowledge graph completion.基于路径的知识推理与文本语义信息融合的医疗知识图谱补全方法
BMC Med Inform Decis Mak. 2021 Nov 29;21(Suppl 9):335. doi: 10.1186/s12911-021-01622-7.
10
Few-shot human-object interaction video recognition with transformers.基于 Transformer 的小样本人体目标交互视频识别。
Neural Netw. 2023 Jun;163:1-9. doi: 10.1016/j.neunet.2023.01.019. Epub 2023 Feb 10.

本文引用的文献

1
Label Independent Memory for Semi-Supervised Few-Shot Video Classification.用于半监督少样本视频分类的标签无关记忆
IEEE Trans Pattern Anal Mach Intell. 2022 Jan;44(1):273-285. doi: 10.1109/TPAMI.2020.3007511. Epub 2021 Dec 7.
2
The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines.EPIC-KITCHENS 数据集:采集、挑战与基准。
IEEE Trans Pattern Anal Mach Intell. 2021 Nov;43(11):4125-4141. doi: 10.1109/TPAMI.2020.2991965. Epub 2021 Oct 1.
3
Multi-level Semantic Feature Augmentation for One-shot Learning.用于一次性学习的多级语义特征增强
IEEE Trans Image Process. 2019 Apr 9. doi: 10.1109/TIP.2019.2910052.
4
Learning to Compose Domain-Specific Transformations for Data Augmentation.学习合成用于数据增强的特定领域变换。
Adv Neural Inf Process Syst. 2017 Dec;30:3239-3249.