IEEE Trans Pattern Anal Mach Intell. 2017 Jul;39(7):1444-1454. doi: 10.1109/TPAMI.2016.2592911. Epub 2016 Jul 19.
We propose a novel approach to semantic scene labeling in urban scenarios, which aims to combine excellent recognition performance with highest levels of computational efficiency. To that end, we exploit efficient tree-structured models on two levels: pixels and superpixels. At the pixel level, we propose to unify pixel labeling and the extraction of semantic texton features within a single architecture, so-called encode-and-classify trees. At the superpixel level, we put forward a multi-cue segmentation tree that groups superpixels at multiple granularities. Through learning, the segmentation tree effectively exploits and aggregates a wide range of complementary information present in the data. A tree-structured CRF is then used to jointly infer the labels of all regions across the tree. Finally, we introduce a novel object-centric evaluation method that specifically addresses the urban setting with its strongly varying object scales. Our experiments demonstrate competitive labeling performance compared to the state of the art, while achieving near real-time frame rates of up to 20 fps.
我们提出了一种新的语义场景标注方法,旨在将优秀的识别性能与最高的计算效率相结合。为此,我们在两个层面上利用高效的树结构模型:像素和超像素。在像素层面,我们提出在单个架构内统一像素标记和语义纹理特征提取,即所谓的“编码-分类”树。在超像素层面,我们提出了一种多线索分割树,可以在多个粒度上对超像素进行分组。通过学习,分割树有效地利用和聚合了数据中存在的广泛的互补信息。然后,使用树结构的条件随机场来共同推断树中所有区域的标签。最后,我们引入了一种新的以对象为中心的评估方法,该方法专门针对具有强烈变化的对象尺度的城市环境。我们的实验结果表明,与现有技术相比,我们的标注性能具有竞争力,同时实现了近实时的帧率,最高可达 20 fps。