Zong Daoming, Sun Shiliang
IEEE Trans Neural Netw Learn Syst. 2024 Dec;35(12):17805-17816. doi: 10.1109/TNNLS.2023.3309104. Epub 2024 Dec 2.
Human-object interaction (HOI) detection involves identifying interactions represented as , requiring the localization of human-object pairs and interaction classification within an image. This work focuses on the challenge of detecting HOIs with unseen objects using the prevalent Transformer architecture. Our empirical analysis reveals that the performance degradation of novel HOI instances primarily arises from misclassifying unseen objects as confusable seen objects. To address this issue, we propose a similarity propagation (SP) scheme that leverages cosine similarity distance to regulate the prediction margin between seen and unseen objects. In addition, we introduce pseudo-supervision for unseen objects based on class semantic similarities during training. Furthermore, we incorporate semantic-aware instance-level and interaction-level contrastive losses with Transformer to enhance intraclass compactness and interclass separability, resulting in improved visual representations. Extensive experiments on two challenging benchmarks, V-COCO and HICO-DET, demonstrate the effectiveness of our model, outperforming current state-of-the-art methods under various zero-shot settings.
人机交互(HOI)检测涉及识别表示为 的交互,这需要在图像中定位人与物体对并进行交互分类。这项工作聚焦于使用流行的Transformer架构检测未见物体的人机交互这一挑战。我们的实证分析表明,新型HOI实例的性能下降主要源于将未见物体误分类为易混淆的可见物体。为解决此问题,我们提出一种相似性传播(SP)方案,该方案利用余弦相似性距离来调节可见和未见物体之间的预测边界。此外,我们在训练期间基于类语义相似性为未见物体引入伪监督。此外,我们将语义感知的实例级和交互级对比损失与Transformer相结合,以增强类内紧凑性和类间可分离性,从而改进视觉表示。在两个具有挑战性的基准V-COCO和HICO-DET上进行的广泛实验证明了我们模型的有效性,在各种零样本设置下优于当前的最先进方法。