IEEE Trans Cybern. 2022 Jul;52(7):5961-5972. doi: 10.1109/TCYB.2021.3052522. Epub 2022 Jul 4.
Scene graph generation (SGG) is built on top of detected objects to predict object pairwise visual relations for describing the image content abstraction. Existing works have revealed that if the links between objects are given as prior knowledge, the performance of SGG is significantly improved. Inspired by this observation, in this article, we propose a relation regularized network (R2-Net), which can predict whether there is a relationship between two objects and encode this relation into object feature refinement and better SGG. Specifically, we first construct an affinity matrix among detected objects to represent the probability of a relationship between two objects. Graph convolution networks (GCNs) over this relation affinity matrix are then used as object encoders, producing relation-regularized representations of objects. With these relation-regularized features, our R2-Net can effectively refine object labels and generate scene graphs. Extensive experiments are conducted on the visual genome dataset for three SGG tasks (i.e., predicate classification, scene graph classification, and scene graph detection), demonstrating the effectiveness of our proposed method. Ablation studies also verify the key roles of our proposed components in performance improvement.
场景图生成 (SGG) 建立在检测到的对象之上,以预测对象对之间的视觉关系,用于描述图像内容的抽象。现有研究表明,如果将对象之间的链接作为先验知识给出,则 SGG 的性能将显著提高。受此观察的启发,在本文中,我们提出了一种关系正则化网络 (R2-Net),它可以预测两个对象之间是否存在关系,并将这种关系编码为对象特征细化和更好的 SGG。具体来说,我们首先构建检测到的对象之间的亲和度矩阵,以表示两个对象之间存在关系的概率。然后,在这个关系亲和度矩阵上使用图卷积网络 (GCN) 作为对象编码器,生成对象的关系正则化表示。利用这些关系正则化的特征,我们的 R2-Net 可以有效地细化对象标签并生成场景图。在视觉基因组数据集上进行了广泛的实验,用于三个 SGG 任务(即谓语分类、场景图分类和场景图检测),验证了我们提出的方法的有效性。消融研究还验证了我们提出的组件在性能提升方面的关键作用。