Dalian University of Technology, Dalian, 116024, Liaoning, China.
Neural Netw. 2024 Feb;170:242-253. doi: 10.1016/j.neunet.2023.11.002. Epub 2023 Nov 13.
Recent two-stage detector-based methods show superiority in Human-Object Interaction (HOI) detection along with the successful application of transformer. However, these methods are limited to extracting the global contextual features through instance-level attention without considering the perspective of human-object interaction pairs, and the fusion enhancement of interaction pair features lacks further exploration. The human-object interaction pairs guiding global context extraction relative to instance guiding global context extraction more fully utilize the semantics between human-object pairs, which helps HOI recognition. To this end, we propose a two-stage Global Context and Pairwise-level Fusion Features Integration Network (GFIN) for HOI detection. Specifically, the first stage employs an object detector for instance feature extraction. The second stage aims to capture the semantic-rich visual information through the proposed three modules, Global Contextual Feature Extraction Encoder (GCE), Pairwise Interaction Query Decoder (PID), and Human-Object Pairwise-level Attention Fusion Module (HOF). The GCE module intends to extract the global context memory by the proposed crossover-residual mechanism and then integrate it with the local instance memory from the DETR object detector. HOF utilizes the proposed pairwise-level attention mechanism to fuse and enhance the first stage's multi-layer feature. PID outputs multi-label interaction recognition results with the input of the query sequence from HOF and the memory from GCE. Finally, comprehensive experiments conducted on HICO-DET and V-COCO datasets demonstrate that the proposed GFIN significantly outperforms the state-of-the-art methods. Code is available at https://github.com/ddwhzh/GFIN.
基于两阶段检测器的方法在人体目标交互(HOI)检测中表现出优势,同时成功应用了 Transformer。然而,这些方法仅限于通过实例级注意力提取全局上下文特征,而没有考虑人体目标交互对的视角,并且对交互对特征的融合增强缺乏进一步的探索。与实例引导的全局上下文提取相比,引导全局上下文提取的人体目标交互对更充分地利用了人体目标对之间的语义,这有助于 HOI 识别。为此,我们提出了一种用于 HOI 检测的两阶段全局上下文和对级融合特征集成网络(GFIN)。具体来说,第一阶段使用目标检测器进行实例特征提取。第二阶段旨在通过我们提出的三个模块捕捉丰富的语义视觉信息,即全局上下文特征提取编码器(GCE)、对级交互查询解码器(PID)和人体目标对级注意力融合模块(HOF)。GCE 模块通过提出的交叉残差机制来提取全局上下文记忆,然后将其与来自 DETR 目标检测器的局部实例记忆融合。HOF 利用提出的对级注意力机制融合和增强第一阶段的多层特征。PID 利用来自 HOF 的查询序列和来自 GCE 的记忆作为输入,输出多标签交互识别结果。最后,在 HICO-DET 和 V-COCO 数据集上进行的综合实验表明,我们提出的 GFIN 显著优于最先进的方法。代码可在 https://github.com/ddwhzh/GFIN 上获得。