IEEE Trans Pattern Anal Mach Intell. 2017 Jan;39(1):115-127. doi: 10.1109/TPAMI.2016.2537339. Epub 2016 Mar 2.
In this work, we address the human parsing task with a novel Contextualized Convolutional Neural Network (Co-CNN) architecture, which well integrates the cross-layer context, global image-level context, semantic edge context, within-super-pixel context and cross-super-pixel neighborhood context into a unified network. Given an input human image, Co-CNN produces the pixelwise categorization in an end-to-end way. First, the cross-layer context is captured by our basic local-to-global-to-local structure, which hierarchically combines the global semantic information and the local fine details across different convolutional layers. Second, the global image-level label prediction is used as an auxiliary objective in the intermediate layer of the Co-CNN, and its outputs are further used for guiding the feature learning in subsequent convolutional layers to leverage the global image-level context. Third, semantic edge context is further incorporated into Co-CNN, where the high-level semantic boundaries are leveraged to guide pixel-wise labeling. Finally, to further utilize the local super-pixel contexts, the within-super-pixel smoothing and cross-super-pixel neighbourhood voting are formulated as natural sub-components of the Co-CNN to achieve the local label consistency in both training and testing process. Comprehensive evaluations on two public datasets well demonstrate the significant superiority of our Co-CNN over other state-of-the-arts for human parsing. In particular, the F-1 score on the large dataset [1] reaches 81.72 percent by Co-CNN, significantly higher than 62.81 percent and 64.38 percent by the state-of-the-art algorithms, M-CNN [2] and ATR [1], respectively. By utilizing our newly collected large dataset for training, our Co-CNN can achieve 85.36 percent in F-1 score.
在这项工作中,我们使用一种新颖的上下文卷积神经网络(Co-CNN)架构来处理人体解析任务,该架构将跨层上下文、全局图像级上下文、语义边缘上下文、超像素内上下文和跨超像素邻域上下文很好地集成到一个统一的网络中。给定一幅输入的人体图像,Co-CNN以端到端的方式生成像素级分类。首先,我们通过基本的局部到全局再到局部结构捕捉跨层上下文,该结构将不同卷积层的全局语义信息和局部精细细节分层结合起来。其次,全局图像级标签预测在Co-CNN的中间层用作辅助目标,其输出进一步用于指导后续卷积层的特征学习,以利用全局图像级上下文。第三,语义边缘上下文进一步融入Co-CNN,利用高层语义边界来指导像素级标注。最后,为了进一步利用局部超像素上下文,将超像素内平滑和跨超像素邻域投票作为Co-CNN的自然子组件,以在训练和测试过程中实现局部标签一致性。在两个公共数据集上的综合评估充分证明了我们的Co-CNN相对于其他最先进的人体解析方法具有显著优势。特别是,Co-CNN在大型数据集[1]上的F-1分数达到81.72%,分别显著高于最先进算法M-CNN[2]和ATR[1]的62.81%和64.38%。通过使用我们新收集的大型数据集进行训练,我们的Co-CNN在F-1分数上可以达到85.36%。