Luo Hongchen, Zhai Wei, Zhang Jing, Cao Yang, Tao Dacheng
IEEE Trans Neural Netw Learn Syst. 2024 Nov;35(11):16857-16871. doi: 10.1109/TNNLS.2023.3298638. Epub 2024 Oct 29.
Visual affordance grounding aims to segment all possible interaction regions between people and objects from an image/video, which benefits many applications, such as robot grasping and action recognition. Prevailing methods predominantly depend on the appearance feature of the objects to segment each region of the image, which encounters the following two problems: 1) there are multiple possible regions in an object that people interact with and 2) there are multiple possible human interactions in the same object region. To address these problems, we propose a hand-aided affordance grounding network (HAG-Net) that leverages the aided clues provided by the position and action of the hand in demonstration videos to eliminate the multiple possibilities and better locate the interaction regions in the object. Specifically, HAG-Net adopts a dual-branch structure to process the demonstration video and object image data. For the video branch, we introduce hand-aided attention to enhance the region around the hand in each video frame and then use the long short-term memory (LSTM) network to aggregate the action features. For the object branch, we introduce a semantic enhancement module (SEM) to make the network focus on different parts of the object according to the action classes and utilize a distillation loss to align the output features of the object branch with that of the video branch and transfer the knowledge in the video branch to the object branch. Quantitative and qualitative evaluations on two challenging datasets show that our method has achieved state-of-the-art results for affordance grounding. The source code is available at: https://github.com/lhc1224/HAG-Net.
视觉可及性定位旨在从图像/视频中分割出人与物体之间所有可能的交互区域,这对许多应用有益,如机器人抓取和动作识别。主流方法主要依赖物体的外观特征来分割图像的每个区域,这存在以下两个问题:1)人们与之交互的物体中有多个可能的区域;2)同一物体区域中存在多种可能的人类交互。为了解决这些问题,我们提出了一种手动辅助可及性定位网络(HAG-Net),该网络利用演示视频中手部的位置和动作提供的辅助线索来消除多种可能性,并更好地定位物体中的交互区域。具体而言,HAG-Net采用双分支结构来处理演示视频和物体图像数据。对于视频分支,我们引入手动辅助注意力来增强每个视频帧中手部周围的区域,然后使用长短期记忆(LSTM)网络聚合动作特征。对于物体分支,我们引入一个语义增强模块(SEM),使网络根据动作类别关注物体的不同部分,并利用蒸馏损失使物体分支的输出特征与视频分支的输出特征对齐,将视频分支中的知识转移到物体分支。在两个具有挑战性的数据集上进行的定量和定性评估表明,我们的方法在可及性定位方面取得了领先成果。源代码可在以下网址获取:https://github.com/lhc1224/HAG-Net。