Suppr超能文献

面向场景图生成和人机交互检测的统一基于 Transformer 的框架。

Toward a Unified Transformer-Based Framework for Scene Graph Generation and Human-Object Interaction Detection.

出版信息

IEEE Trans Image Process. 2023;32:6274-6288. doi: 10.1109/TIP.2023.3330304. Epub 2023 Nov 20.

Abstract

Scene graph generation (SGG) and human-object interaction (HOI) detection are two important visual tasks aiming at localising and recognising relationships between objects, and interactions between humans and objects, respectively. Prevailing works treat these tasks as distinct tasks, leading to the development of task-specific models tailored to individual datasets. However, we posit that the presence of visual relationships can furnish crucial contextual and intricate relational cues that significantly augment the inference of human-object interactions. This motivates us to think if there is a natural intrinsic relationship between the two tasks, where scene graphs can serve as a source for inferring human-object interactions. In light of this, we introduce SG2HOI+, a unified one-step model based on the Transformer architecture. Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection. Concretely, we initiate a relation Transformer tasked with generating relation triples from a suite of visual features. Subsequently, we employ another transformer-based decoder to predict human-object interactions based on the generated relation triples. A comprehensive series of experiments conducted across established benchmark datasets including Visual Genome, V-COCO, and HICO-DET demonstrates the compelling performance of our SG2HOI+ model in comparison to prevalent one-stage SGG models. Remarkably, our approach achieves competitive performance when compared to state-of-the-art HOI methods. Additionally, we observe that our SG2HOI+ jointly trained on both SGG and HOI tasks in an end-to-end manner yields substantial improvements for both tasks compared to individualized training paradigms.

摘要

场景图生成 (SGG) 和人体-物体交互 (HOI) 检测是两个重要的视觉任务,分别旨在定位和识别物体之间的关系以及人体和物体之间的交互。现有的工作将这些任务视为不同的任务,导致开发了专门针对各个数据集的特定于任务的模型。然而,我们认为存在视觉关系可以提供重要的上下文和复杂的关系线索,这极大地增强了对人体-物体交互的推断。这促使我们思考这两个任务之间是否存在自然的内在关系,即场景图是否可以作为推断人体-物体交互的来源。有鉴于此,我们引入了 SG2HOI+,这是一个基于 Transformer 架构的统一一步模型。我们的方法采用两个交互式分层 Transformer,无缝地统一了 SGG 和 HOI 检测任务。具体来说,我们启动一个关系 Transformer,负责从一系列视觉特征中生成关系三元组。随后,我们使用另一个基于 Transformer 的解码器根据生成的关系三元组预测人体-物体交互。在包括 Visual Genome、V-COCO 和 HICO-DET 在内的一系列成熟基准数据集上进行的全面实验表明,与现有的单阶段 SGG 模型相比,我们的 SG2HOI+模型具有强大的性能。值得注意的是,与最先进的 HOI 方法相比,我们的方法表现出色。此外,我们观察到,我们的 SG2HOI+ 以端到端的方式在 SGG 和 HOI 任务上联合训练,与单独的训练范例相比,这两个任务都取得了实质性的改进。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验