IEEE Trans Pattern Anal Mach Intell. 2015 Dec;37(12):2402-14. doi: 10.1109/TPAMI.2015.2408360.
In this work, the human parsing task, namely decomposing a human image into semantic fashion/body regions, is formulated as an active template regression (ATR) problem, where the normalized mask of each fashion/body item is expressed as the linear combination of the learned mask templates, and then morphed to a more precise mask with the active shape parameters, including position, scale and visibility of each semantic region. The mask template coefficients and the active shape parameters together can generate the human parsing results, and are thus called the structure outputs for human parsing. The deep Convolutional Neural Network (CNN) is utilized to build the end-to-end relation between the input human image and the structure outputs for human parsing. More specifically, the structure outputs are predicted by two separate networks. The first CNN network is with max-pooling, and designed to predict the template coefficients for each label mask, while the second CNN network is without max-pooling to preserve sensitivity to label mask position and accurately predict the active shape parameters. For a new image, the structure outputs of the two networks are fused to generate the probability of each label for each pixel, and super-pixel smoothing is finally used to refine the human parsing result. Comprehensive evaluations on a large dataset well demonstrate the significant superiority of the ATR framework over other state-of-the-arts for human parsing. In particular, the F1-score reaches 64.38 percent by our ATR framework, significantly higher than 44.76 percent based on the state-of-the-art algorithm [28].
在这项工作中,人体解析任务,即将人体图像分解为语义时尚/身体区域,被表述为主动模板回归(ATR)问题,其中每个时尚/身体项的归一化掩模被表示为学习掩模模板的线性组合,然后通过主动形状参数进行变形,以获得更精确的掩模,这些参数包括每个语义区域的位置、比例和可见性。掩模模板系数和主动形状参数一起可以生成人体解析结果,因此被称为人体解析的结构输出。深度卷积神经网络(CNN)被用于建立输入人体图像和人体解析结构输出之间的端到端关系。更具体地说,结构输出由两个单独的网络来预测。第一个 CNN 网络具有最大池化,旨在预测每个标签掩模的模板系数,而第二个 CNN 网络没有最大池化,以保留对标签掩模位置的敏感性,并准确预测主动形状参数。对于新图像,两个网络的结构输出被融合以生成每个像素的每个标签的概率,最后使用超像素平滑来细化人体解析结果。在一个大型数据集上的综合评估很好地证明了 ATR 框架在人体解析方面明显优于其他最先进技术的优越性。特别是,我们的 ATR 框架的 F1 分数达到 64.38%,显著高于基于最先进算法[28]的 44.76%。