IEEE Trans Pattern Anal Mach Intell. 2015 Dec;37(12):2478-91. doi: 10.1109/TPAMI.2015.2424880.
A typical scene category contains an enormous number of distinct scene configurations that are composed of objects and regions of varying shapes in different layouts. In this paper, we first propose a representation named hierarchical space tiling (HST) to quantize the huge and continuous scene configuration space. Then, we augment the HST with attributes (nouns and adjectives) to describe the semantics of the objects and regions inside a scene. We present a weakly supervised method for simultaneously learning the scene configurations and attributes from a collection of natural images associated with descriptive text. The precise locations of attributes are unknown in the input and are mapped to the HST nodes through learning. Starting with a full HST, we iteratively estimate the HST model under a learning-by-parsing framework. Given a test image, we compute the most probable parse tree with the associated attributes by dynamic programming. We quantitatively analyze the representative efficiency of HST, show the learned representation is less ambiguous and has semantically meaningful inner concepts. In applications, we apply our model to four tasks: scene classification, attribute recognition, attribute localization, and pixel-wise scene labeling, and show the performance improvements as well as higher efficiency.
典型的场景类别包含大量不同形状的物体和区域,它们以不同的布局组合在一起。在本文中,我们首先提出了一种名为层次空间划分(HST)的表示方法,用于量化庞大而连续的场景配置空间。然后,我们使用属性(名词和形容词)来增强 HST,以描述场景中物体和区域的语义。我们提出了一种从与描述性文本相关的自然图像集合中同时学习场景配置和属性的弱监督方法。在输入中,属性的精确位置是未知的,并且通过学习映射到 HST 节点。从完整的 HST 开始,我们在学习解析框架下迭代地估计 HST 模型。对于测试图像,我们通过动态规划计算具有相关属性的最可能解析树。我们对 HST 的代表性效率进行了定量分析,结果表明学习到的表示方法歧义性更小,并且具有有意义的内在概念。在应用中,我们将模型应用于四个任务:场景分类、属性识别、属性定位和像素级场景标注,并展示了性能的提升以及更高的效率。