Quan Hutuo, Lai Huicheng, Gao Guxue, Ma Jun, Li Junkai, Chen Dongji
College of Computer Science and Technology, Xinjiang University, Urumqi 830017, China.
Xinjiang Key Laboratory of Signal Detection and Processing, Xinjiang University, Urumqi 830017, China.
Entropy (Basel). 2024 Feb 27;26(3):205. doi: 10.3390/e26030205.
Human-object interaction (HOI) detection aims to localize and recognize the relationship between humans and objects, which helps computers understand high-level semantics. In HOI detection, two-stage and one-stage methods have distinct advantages and disadvantages. The two-stage methods can obtain high-quality human-object pair features based on object detection but lack contextual information. The one-stage transformer-based methods can model good global features but cannot benefit from object detection. The ideal model should have the advantages of both methods. Therefore, we propose the Pairwise Convolutional neural network (CNN)-Transformer (PCT), a simple and effective two-stage method. The model both fully utilizes the object detector and has rich contextual information. Specifically, we obtain pairwise CNN features from the CNN backbone. These features are fused with pairwise transformer features to enhance the pairwise representations. The enhanced representations are superior to using CNN and transformer features individually. In addition, the global features of the transformer provide valuable contextual cues. We fairly compare the performance of pairwise CNN and pairwise transformer features in HOI detection. The experimental results show that the previously neglected CNN features still have a significant edge. Compared to state-of-the-art methods, our model achieves competitive results on the HICO-DET and V-COCO datasets.
人与物体交互(HOI)检测旨在定位和识别人类与物体之间的关系,这有助于计算机理解高级语义。在HOI检测中,两阶段方法和一阶段方法各有优缺点。两阶段方法可以基于物体检测获得高质量的人与物体对特征,但缺乏上下文信息。基于一阶段变压器的方法可以对良好的全局特征进行建模,但无法从物体检测中受益。理想的模型应该兼具这两种方法的优点。因此,我们提出了成对卷积神经网络(CNN)-变压器(PCT),一种简单有效的两阶段方法。该模型既充分利用了物体检测器,又具有丰富的上下文信息。具体来说,我们从CNN主干中获得成对的CNN特征。这些特征与成对的变压器特征相融合,以增强成对表示。增强后的表示优于单独使用CNN和变压器特征。此外,变压器的全局特征提供了有价值的上下文线索。我们在HOI检测中公平地比较了成对CNN和成对变压器特征的性能。实验结果表明,先前被忽视的CNN特征仍然具有显著优势。与现有最先进的方法相比,我们的模型在HICO-DET和V-COCO数据集上取得了有竞争力的结果。