从演示视频中学习视觉功能基础

Luo Hongchen, Zhai Wei, Zhang Jing, Cao Yang, Tao Dacheng

IEEE Trans Neural Netw Learn Syst. 2024 Nov;35(11):16857-16871. doi: 10.1109/TNNLS.2023.3298638. Epub 2024 Oct 29.

Visual affordance grounding aims to segment all possible interaction regions between people and objects from an image/video, which benefits many applications, such as robot grasping and action recognition. Prevailing methods predominantly depend on the appearance feature of the objects to segment each region of the image, which encounters the following two problems: 1) there are multiple possible regions in an object that people interact with and 2) there are multiple possible human interactions in the same object region. To address these problems, we propose a hand-aided affordance grounding network (HAG-Net) that leverages the aided clues provided by the position and action of the hand in demonstration videos to eliminate the multiple possibilities and better locate the interaction regions in the object. Specifically, HAG-Net adopts a dual-branch structure to process the demonstration video and object image data. For the video branch, we introduce hand-aided attention to enhance the region around the hand in each video frame and then use the long short-term memory (LSTM) network to aggregate the action features. For the object branch, we introduce a semantic enhancement module (SEM) to make the network focus on different parts of the object according to the action classes and utilize a distillation loss to align the output features of the object branch with that of the video branch and transfer the knowledge in the video branch to the object branch. Quantitative and qualitative evaluations on two challenging datasets show that our method has achieved state-of-the-art results for affordance grounding. The source code is available at: https://github.com/lhc1224/HAG-Net.

视觉可及性定位旨在从图像/视频中分割出人与物体之间所有可能的交互区域，这对许多应用有益，如机器人抓取和动作识别。主流方法主要依赖物体的外观特征来分割图像的每个区域，这存在以下两个问题：1）人们与之交互的物体中有多个可能的区域；2）同一物体区域中存在多种可能的人类交互。为了解决这些问题，我们提出了一种手动辅助可及性定位网络（HAG-Net），该网络利用演示视频中手部的位置和动作提供的辅助线索来消除多种可能性，并更好地定位物体中的交互区域。具体而言，HAG-Net采用双分支结构来处理演示视频和物体图像数据。对于视频分支，我们引入手动辅助注意力来增强每个视频帧中手部周围的区域，然后使用长短期记忆（LSTM）网络聚合动作特征。对于物体分支，我们引入一个语义增强模块（SEM），使网络根据动作类别关注物体的不同部分，并利用蒸馏损失使物体分支的输出特征与视频分支的输出特征对齐，将视频分支中的知识转移到物体分支。在两个具有挑战性的数据集上进行的定量和定性评估表明，我们的方法在可及性定位方面取得了领先成果。源代码可在以下网址获取：https://github.com/lhc1224/HAG-Net。

相似文献

Learning Visual Affordance Grounding From Demonstration Videos.

IEEE Trans Neural Netw Learn Syst. 2024 Nov;35(11):16857-16871. doi: 10.1109/TNNLS.2023.3298638. Epub 2024 Oct 29.

Knowledge enhanced bottom-up affordance grounding for robotic interaction.

PeerJ Comput Sci. 2024 Jul 5;10:e2097. doi: 10.7717/peerj-cs.2097. eCollection 2024.

Front Neurorobot. 2020 May 13;14:26. doi: 10.3389/fnbot.2020.00026. eCollection 2020.

Grounding human-object interaction to affordance behavior in multimodal datasets.

Front Artif Intell. 2023 Jan 30;6:1084740. doi: 10.3389/frai.2023.1084740. eCollection 2023.

HiSA: Hierarchically Semantic Associating for Video Temporal Grounding.

IEEE Trans Image Process. 2022;31:5178-5188. doi: 10.1109/TIP.2022.3191841. Epub 2022 Aug 4.

Visual Feature Learning on Video Object and Human Action Detection: A Systematic Review.

Micromachines (Basel). 2021 Dec 31;13(1):72. doi: 10.3390/mi13010072.

MediDRNet: Tackling category imbalance in diabetic retinopathy classification with dual-branch learning and prototypical contrastive learning.

Comput Methods Programs Biomed. 2024 Aug;253:108230. doi: 10.1016/j.cmpb.2024.108230. Epub 2024 May 17.

UIU-Net: U-Net in U-Net for Infrared Small Object Detection.

IEEE Trans Image Process. 2023;32:364-376. doi: 10.1109/TIP.2022.3228497. Epub 2022 Dec 21.

Video Question Answering With Prior Knowledge and Object-Sensitive Learning.

IEEE Trans Image Process. 2022;31:5936-5948. doi: 10.1109/TIP.2022.3205212. Epub 2022 Sep 15.

Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos.

IEEE Trans Pattern Anal Mach Intell. 2022 May;44(5):2725-2741. doi: 10.1109/TPAMI.2020.3038993. Epub 2022 Apr 1.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

Learning Visual Affordance Grounding From Demonstration Videos.

IEEE Trans Neural Netw Learn Syst. 2024 Nov;35(11):16857-16871. doi: 10.1109/TNNLS.2023.3298638. Epub 2024 Oct 29.

Knowledge enhanced bottom-up affordance grounding for robotic interaction.

PeerJ Comput Sci. 2024 Jul 5;10:e2097. doi: 10.7717/peerj-cs.2097. eCollection 2024.

Front Neurorobot. 2020 May 13;14:26. doi: 10.3389/fnbot.2020.00026. eCollection 2020.

Grounding human-object interaction to affordance behavior in multimodal datasets.

Front Artif Intell. 2023 Jan 30;6:1084740. doi: 10.3389/frai.2023.1084740. eCollection 2023.

HiSA: Hierarchically Semantic Associating for Video Temporal Grounding.

IEEE Trans Image Process. 2022;31:5178-5188. doi: 10.1109/TIP.2022.3191841. Epub 2022 Aug 4.

Visual Feature Learning on Video Object and Human Action Detection: A Systematic Review.

Micromachines (Basel). 2021 Dec 31;13(1):72. doi: 10.3390/mi13010072.

MediDRNet: Tackling category imbalance in diabetic retinopathy classification with dual-branch learning and prototypical contrastive learning.

Comput Methods Programs Biomed. 2024 Aug;253:108230. doi: 10.1016/j.cmpb.2024.108230. Epub 2024 May 17.

UIU-Net: U-Net in U-Net for Infrared Small Object Detection.

IEEE Trans Image Process. 2023;32:364-376. doi: 10.1109/TIP.2022.3228497. Epub 2022 Dec 21.

Video Question Answering With Prior Knowledge and Object-Sensitive Learning.

IEEE Trans Image Process. 2022;31:5936-5948. doi: 10.1109/TIP.2022.3205212. Epub 2022 Sep 15.

Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos.

IEEE Trans Pattern Anal Mach Intell. 2022 May;44(5):2725-2741. doi: 10.1109/TPAMI.2020.3038993. Epub 2022 Apr 1.

Learning Visual Affordance Grounding From Demonstration Videos.

作者信息

出版信息

相似文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献