Chen Vincent S, Varma Paroma, Krishna Ranjay, Bernstein Michael, Ré Christopher, Fei-Fei Li
Stanford University.
Proc IEEE Int Conf Comput Vis. 2019 Oct-Nov;2019:2580-2590. doi: 10.1109/iccv.2019.00267. Epub 2020 Feb 27.
Visual knowledge bases such as Visual Genome power numerous applications in computer vision, including visual question answering and captioning, but suffer from sparse, incomplete relationships. All scene graph models to date are limited to training on a small set of visual relationships that have thousands of training labels each. Hiring human annotators is expensive, and using textual knowledge base completion methods are incompatible with visual data. In this paper, we introduce a semi-supervised method that assigns probabilistic relationship labels to a large number of unlabeled images using few' labeled examples. We analyze visual relationships to suggest two types of image-agnostic features that are used to generate noisy heuristics, whose outputs are aggregated using a factor graph-based generative model. With as few as 10 labeled examples per relationship, the generative model creates enough training data to train any existing state-of-the-art scene graph model. We demonstrate that our method outperforms all baseline approaches on scene graph prediction by 5.16 recall@ 100 for PREDCLS. In our limited label setting, we define a complexity metric for relationships that serves as an indicator (R = 0.778) for conditions under which our method succeeds over transfer learning, the de-facto approach for training with limited labels.
诸如视觉基因组(Visual Genome)这样的视觉知识库为计算机视觉中的众多应用提供支持,包括视觉问答和图像字幕,但存在关系稀疏、不完整的问题。迄今为止,所有场景图模型都局限于在少量视觉关系上进行训练,每种关系有数千个训练标签。雇佣人工标注员成本高昂,而使用文本知识库补全方法又与视觉数据不兼容。在本文中,我们介绍了一种半监督方法,该方法使用少量带标签示例为大量无标签图像分配概率关系标签。我们分析视觉关系以提出两种与图像无关的特征,用于生成有噪声的启发式方法,其输出通过基于因子图的生成模型进行汇总。每种关系只需少至10个带标签示例,生成模型就能创建足够的训练数据来训练任何现有的先进场景图模型。我们证明,在场景图预测方面,我们的方法在PREDCLS的召回率@100上比所有基线方法高出5.16。在我们的有限标签设置中,我们为关系定义了一个复杂度度量,作为我们的方法优于迁移学习(有限标签训练的实际方法)的条件指标(R = 0.778)。