IEEE Trans Neural Netw Learn Syst. 2023 Sep;34(9):5760-5773. doi: 10.1109/TNNLS.2021.3131154. Epub 2023 Sep 1.
The existing works on human-object interaction (HOI) detection usually rely on expensive large-scale labeled image datasets. However, in real scenes, labeled data may be insufficient, and some rare HOI categories have few samples. This poses great challenges for deep-learning-based HOI detection models. Existing works tackle it by introducing compositional learning or word embedding but still need large-scale labeled data or extremely rely on the well-learned knowledge. In contrast, the freely available unlabeled videos contain rich motion-relevant information that can help infer rare HOIs. In this article, we creatively propose a multitask learning (MTL) perspective to assist in HOI detection with the aid of motion-relevant knowledge learning on unlabeled videos. Specifically, we design the appearance reconstruction loss (ARL) and sequential motion mining module in a self-supervised manner to learn more generalizable motion representations for promoting the detection of rare HOIs. Moreover, to better transfer motion-related knowledge from unlabeled videos to HOI images, a domain discriminator is introduced to decrease the domain gap between two domains. Extensive experiments on the HICO-DET dataset with rare categories and the V-COCO dataset with minimum supervision demonstrate the effectiveness of motion-aware knowledge implied in unlabeled videos for HOI detection.
现有的人与物交互 (HOI) 检测工作通常依赖于昂贵的大规模标记图像数据集。然而,在实际场景中,标记数据可能不足,并且一些罕见的 HOI 类别样本很少。这对基于深度学习的 HOI 检测模型提出了巨大的挑战。现有工作通过引入组合学习或词嵌入来解决这个问题,但仍然需要大规模的标记数据或极度依赖已学习的知识。相比之下,免费的未标记视频包含丰富的与运动相关的信息,可以帮助推断罕见的 HOI。在本文中,我们创造性地提出了一种多任务学习 (MTL) 视角,通过在未标记视频上学习与运动相关的知识来辅助 HOI 检测。具体来说,我们以自监督的方式设计了外观重建损失 (ARL) 和顺序运动挖掘模块,以学习更具泛化性的运动表示,从而促进罕见 HOI 的检测。此外,为了更好地将运动相关知识从未标记的视频转移到 HOI 图像,引入了一个域鉴别器来减小两个域之间的域差距。在具有罕见类别的 HICO-DET 数据集和具有最小监督的 V-COCO 数据集上进行的广泛实验证明了未标记视频中隐含的运动感知知识对 HOI 检测的有效性。