• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于组合动作识别的具有名词-动词嵌入的语义解缠变压器。

Semantic-Disentangled Transformer With Noun-Verb Embedding for Compositional Action Recognition.

作者信息

Huang Peng, Yan Rui, Shu Xiangbo, Tu Zhewei, Dai Guangzhao, Tang Jinhui

出版信息

IEEE Trans Image Process. 2024;33:297-309. doi: 10.1109/TIP.2023.3341297. Epub 2023 Dec 21.

DOI:10.1109/TIP.2023.3341297
PMID:38100340
Abstract

Recognizing actions performed on unseen objects, known as Compositional Action Recognition (CAR), has attracted increasing attention in recent years. The main challenge is to overcome the distribution shift of "action-objects" pairs between the training and testing sets. Previous works for CAR usually introduce extra information (e.g. bounding box) to enhance the dynamic cues of video features. However, these approaches do not essentially eliminate the inherent inductive bias in the video, which can be regarded as the stumbling block for model generalization. Because the video features are usually extracted from the visually cluttered areas in which many objects cannot be removed or masked explicitly. To this end, this work attempts to implicitly accomplish semantic-level decoupling of "object-action" in the high-level feature space. Specifically, we propose a novel Semantic-Decoupling Transformer framework, dubbed as DeFormer, which contains two insightful sub-modules: Objects-Motion Decoupler (OMD) and Semantic-Decoupling Constrainer (SDC). In OMD, we initialize several learnable tokens incorporating annotation priors to learn an instance-level representation and then decouple it into the appearance feature and motion feature in high-level visual space. In SDC, we use textual information in the high-level language space to construct a dual-contrastive association to constrain the decoupled appearance feature and motion feature obtained in OMD. Extensive experiments verify the generalization ability of DeFormer. Specifically, compared to the baseline method, DeFormer achieves absolute improvements of 3%, 3.3%, and 5.4% under three different settings on STH-ELSE, while corresponding improvements on EPIC-KITCHENS-55 are 4.7%, 9.2%, and 4.4%. Besides, DeFormer gains state-of-the-art results either on ground-truth or detected annotations.

摘要

识别对未见物体执行的动作,即组合动作识别(CAR),近年来受到了越来越多的关注。主要挑战在于克服训练集和测试集之间“动作-物体”对的分布偏移。以往用于CAR的工作通常引入额外信息(如边界框)来增强视频特征的动态线索。然而,这些方法并没有从根本上消除视频中固有的归纳偏差,这可被视为模型泛化的绊脚石。因为视频特征通常是从视觉上杂乱的区域提取的,在这些区域中许多物体无法被明确移除或遮挡。为此,这项工作试图在高层特征空间中隐式地完成“物体-动作”的语义级解耦。具体来说,我们提出了一种新颖的语义解耦Transformer框架,称为DeFormer,它包含两个有见地的子模块:物体-运动解耦器(OMD)和语义解耦约束器(SDC)。在OMD中,我们初始化几个结合注释先验的可学习令牌,以学习实例级表示,然后在高层视觉空间中将其解耦为外观特征和运动特征。在SDC中,我们使用高层语言空间中的文本信息构建双重对比关联,以约束在OMD中获得的解耦后的外观特征和运动特征。大量实验验证了DeFormer的泛化能力。具体而言,与基线方法相比,DeFormer在STH-ELSE的三种不同设置下分别实现了3%、3.3%和5.4%的绝对提升,而在EPIC-KITCHENS-55上的相应提升分别为4.7%、9.2%和4.4%。此外,DeFormer在真实标注或检测标注上均取得了当前最优的结果。

相似文献

1
Semantic-Disentangled Transformer With Noun-Verb Embedding for Compositional Action Recognition.用于组合动作识别的具有名词-动词嵌入的语义解缠变压器。
IEEE Trans Image Process. 2024;33:297-309. doi: 10.1109/TIP.2023.3341297. Epub 2023 Dec 21.
2
Progressive Instance-Aware Feature Learning for Compositional Action Recognition.用于组合动作识别的渐进式实例感知特征学习
IEEE Trans Pattern Anal Mach Intell. 2023 Aug;45(8):10317-10330. doi: 10.1109/TPAMI.2023.3261659. Epub 2023 Jun 30.
3
Learning to Recognize Actions on Objects in Egocentric Video With Attention Dictionaries.基于注意字典的自主体视频中物体动作识别学习。
IEEE Trans Pattern Anal Mach Intell. 2023 Jun;45(6):6674-6687. doi: 10.1109/TPAMI.2021.3058649. Epub 2023 May 5.
4
Symbiotic Attention for Egocentric Action Recognition With Object-Centric Alignment.共生注意力的自我中心动作识别与目标中心对齐。
IEEE Trans Pattern Anal Mach Intell. 2023 Jun;45(6):6605-6617. doi: 10.1109/TPAMI.2020.3015894. Epub 2023 May 5.
5
Learnable Feature Augmentation Framework for Temporal Action Localization.用于时间动作定位的可学习特征增强框架
IEEE Trans Image Process. 2024;33:4002-4015. doi: 10.1109/TIP.2024.3413599. Epub 2024 Jun 28.
6
Scaling Human-Object Interaction Recognition in the Video through Zero-Shot Learning.通过零样本学习扩展视频中的人类-物体交互识别
Comput Intell Neurosci. 2021 Jun 9;2021:9922697. doi: 10.1155/2021/9922697. eCollection 2021.
7
Predicting Semantic Similarity Between Clinical Sentence Pairs Using Transformer Models: Evaluation and Representational Analysis.使用Transformer模型预测临床句子对之间的语义相似性:评估与表征分析
JMIR Med Inform. 2021 May 26;9(5):e23099. doi: 10.2196/23099.
8
Zero-Shot Human-Object Interaction Detection via Similarity Propagation.通过相似性传播实现零样本人类-物体交互检测
IEEE Trans Neural Netw Learn Syst. 2024 Dec;35(12):17805-17816. doi: 10.1109/TNNLS.2023.3309104. Epub 2024 Dec 2.
9
TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning.TransZero++:用于零样本学习的跨属性引导变换器
IEEE Trans Pattern Anal Mach Intell. 2023 Nov;45(11):12844-12861. doi: 10.1109/TPAMI.2022.3229526. Epub 2023 Oct 3.
10
Transformer-Based Approach Via Contrastive Learning for Zero-Shot Detection.基于对比学习的零样本检测的Transformer 方法。
Int J Neural Syst. 2023 Jul;33(7):2350035. doi: 10.1142/S0129065723500351. Epub 2023 Jun 14.

引用本文的文献

1
Linguistic-Driven Partial Semantic Relevance Learning for Skeleton-Based Action Recognition.基于骨架的动作识别的语言驱动部分语义相关性学习。
Sensors (Basel). 2024 Jul 26;24(15):4860. doi: 10.3390/s24154860.