Suppr超能文献

通过带有图像描述的弱监督学习进行分层场景解析

Hierarchical Scene Parsing by Weakly Supervised Learning with Image Descriptions.

作者信息

Zhang Ruimao, Lin Liang, Wang Guangrun, Wang Meng, Zuo Wangmeng

出版信息

IEEE Trans Pattern Anal Mach Intell. 2019 Mar;41(3):596-610. doi: 10.1109/TPAMI.2018.2799846. Epub 2018 Jan 30.

Abstract

This paper investigates a fundamental problem of scene understanding: how to parse a scene image into a structured configuration (i.e., a semantic object hierarchy with object interaction relations). We propose a deep architecture consisting of two networks: i) a convolutional neural network (CNN) extracting the image representation for pixel-wise object labeling and ii) a recursive neural network (RsNN) discovering the hierarchical object structure and the inter-object relations. Rather than relying on elaborative annotations (e.g., manually labeled semantic maps and relations), we train our deep model in a weakly-supervised learning manner by leveraging the descriptive sentences of the training images. Specifically, we decompose each sentence into a semantic tree consisting of nouns and verb phrases, and apply these tree structures to discover the configurations of the training images. Once these scene configurations are determined, then the parameters of both the CNN and RsNN are updated accordingly by back propagation. The entire model training is accomplished through an Expectation-Maximization method. Extensive experiments show that our model is capable of producing meaningful scene configurations and achieving more favorable scene labeling results on two benchmarks (i.e., PASCAL VOC 2012 and SYSU-Scenes) compared with other state-of-the-art weakly-supervised deep learning methods. In particular, SYSU-Scenes contains more than 5,000 scene images with their semantic sentence descriptions, which is created by us for advancing research on scene parsing.

摘要

本文研究场景理解的一个基本问题

如何将场景图像解析为结构化配置(即具有对象交互关系的语义对象层次结构)。我们提出了一种由两个网络组成的深度架构:i)一个卷积神经网络(CNN),用于提取图像表示以进行逐像素对象标注;ii)一个递归神经网络(RsNN),用于发现层次化对象结构和对象间关系。我们不是依赖详尽的注释(例如手动标注的语义图和关系),而是通过利用训练图像的描述性句子,以弱监督学习的方式训练我们的深度模型。具体来说,我们将每个句子分解为由名词和动词短语组成的语义树,并应用这些树结构来发现训练图像的配置。一旦确定了这些场景配置,那么CNN和RsNN的参数都会通过反向传播相应地更新。整个模型训练通过期望最大化方法完成。大量实验表明,与其他现有的弱监督深度学习方法相比,我们的模型能够生成有意义的场景配置,并在两个基准测试(即PASCAL VOC 2012和SYSU - Scenes)上取得更优的场景标注结果。特别是,SYSU - Scenes包含5000多张带有语义句子描述的场景图像,这是我们为推进场景解析研究而创建的。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验