Zhang Ruimao, Lin Liang, Wang Guangrun, Wang Meng, Zuo Wangmeng
IEEE Trans Pattern Anal Mach Intell. 2019 Mar;41(3):596-610. doi: 10.1109/TPAMI.2018.2799846. Epub 2018 Jan 30.
This paper investigates a fundamental problem of scene understanding: how to parse a scene image into a structured configuration (i.e., a semantic object hierarchy with object interaction relations). We propose a deep architecture consisting of two networks: i) a convolutional neural network (CNN) extracting the image representation for pixel-wise object labeling and ii) a recursive neural network (RsNN) discovering the hierarchical object structure and the inter-object relations. Rather than relying on elaborative annotations (e.g., manually labeled semantic maps and relations), we train our deep model in a weakly-supervised learning manner by leveraging the descriptive sentences of the training images. Specifically, we decompose each sentence into a semantic tree consisting of nouns and verb phrases, and apply these tree structures to discover the configurations of the training images. Once these scene configurations are determined, then the parameters of both the CNN and RsNN are updated accordingly by back propagation. The entire model training is accomplished through an Expectation-Maximization method. Extensive experiments show that our model is capable of producing meaningful scene configurations and achieving more favorable scene labeling results on two benchmarks (i.e., PASCAL VOC 2012 and SYSU-Scenes) compared with other state-of-the-art weakly-supervised deep learning methods. In particular, SYSU-Scenes contains more than 5,000 scene images with their semantic sentence descriptions, which is created by us for advancing research on scene parsing.
如何将场景图像解析为结构化配置(即具有对象交互关系的语义对象层次结构)。我们提出了一种由两个网络组成的深度架构:i)一个卷积神经网络(CNN),用于提取图像表示以进行逐像素对象标注;ii)一个递归神经网络(RsNN),用于发现层次化对象结构和对象间关系。我们不是依赖详尽的注释(例如手动标注的语义图和关系),而是通过利用训练图像的描述性句子,以弱监督学习的方式训练我们的深度模型。具体来说,我们将每个句子分解为由名词和动词短语组成的语义树,并应用这些树结构来发现训练图像的配置。一旦确定了这些场景配置,那么CNN和RsNN的参数都会通过反向传播相应地更新。整个模型训练通过期望最大化方法完成。大量实验表明,与其他现有的弱监督深度学习方法相比,我们的模型能够生成有意义的场景配置,并在两个基准测试(即PASCAL VOC 2012和SYSU - Scenes)上取得更优的场景标注结果。特别是,SYSU - Scenes包含5000多张带有语义句子描述的场景图像,这是我们为推进场景解析研究而创建的。