IEEE Trans Image Process. 2018 Mar;27(3):1049-1059. doi: 10.1109/TIP.2017.2740160. Epub 2017 Aug 14.
Crowd counting is a challenging task, mainly due to the severe occlusions among dense crowds. This paper aims to take a broader view to address crowd counting from the perspective of semantic modeling. In essence, crowd counting is a task of pedestrian semantic analysis involving three key factors: pedestrians, heads, and their context structure. The information of different body parts is an important cue to help us judge whether there exists a person at a certain position. Existing methods usually perform crowd counting from the perspective of directly modeling the visual properties of either the whole body or the heads only, without explicitly capturing the composite body-part semantic structure information that is crucial for crowd counting. In our approach, we first formulate the key factors of crowd counting as semantic scene models. Then, we convert the crowd counting problem into a multi-task learning problem, such that the semantic scene models are turned into different sub-tasks. Finally, the deep convolutional neural networks are used to learn the sub-tasks in a unified scheme. Our approach encodes the semantic nature of crowd counting and provides a novel solution in terms of pedestrian semantic analysis. In experiments, our approach outperforms the state-of-the-art methods on four benchmark crowd counting data sets. The semantic structure information is demonstrated to be an effective cue in scene of crowd counting.
人群计数是一项具有挑战性的任务,主要是由于密集人群之间的严重遮挡。本文旨在从语义建模的角度更全面地解决人群计数问题。从本质上讲,人群计数是行人语义分析的任务,涉及三个关键因素:行人、头部及其上下文结构。不同身体部位的信息是帮助我们判断某个位置是否存在人员的重要线索。现有的方法通常从直接建模整个身体或仅头部的视觉属性的角度进行人群计数,而没有显式捕获对于人群计数至关重要的组合身体部位语义结构信息。在我们的方法中,我们首先将人群计数的关键因素形式化为语义场景模型。然后,我们将人群计数问题转化为多任务学习问题,使得语义场景模型变成不同的子任务。最后,使用深度卷积神经网络以统一的方案学习子任务。我们的方法对人群计数进行语义编码,并为行人语义分析提供了新的解决方案。在实验中,我们的方法在四个基准人群计数数据集上优于最先进的方法。语义结构信息被证明是人群计数场景中的有效线索。