Department of Civil Engineering, The University of Tokyo, 4-6-1 Komaba, Meguro, Tokyo 1538505, Japan.
Institute of Industrial Science, The University of Tokyo, 4-6-1 Komaba, Meguro, Tokyo 1538505, Japan.
Sensors (Basel). 2020 Aug 27;20(17):4855. doi: 10.3390/s20174855.
Continually improving crowd counting neural networks have been developed in recent years. The accuracy of these networks has reached such high levels that further improvement is becoming very difficult. However, this high accuracy lacks deeper semantic information, such as social roles (e.g., student, company worker, or police officer) or location-based roles (e.g., pedestrian, tenant, or construction worker). Some of these can be learned from the same set of features as the human nature of an entity, whereas others require wider contextual information from the human surroundings. The primary end-goal of developing recognition software is to involve them in autonomous decision-making systems. Therefore, it must be foolproof, which is, it must have good semantic understanding of the input. In this study, we focus on counting pedestrians in helicopter footage and introduce a dataset created from helicopter videos for this purpose. We use semantic segmentation to extract the required additional contextual information from the surroundings of an entity. We demonstrate that it is possible to increase the pedestrian counting accuracy in this manner. Furthermore, we show that crowd counting and semantic segmentation can be simultaneously achieved, with comparable or even improved accuracy, by using the same crowd counting neural network for both tasks through hard parameter sharing. The presented method is generic and it can be applied to arbitrary crowd density estimation methods. A link to the dataset is available at the end of the paper.
近年来,不断改进的人群计数神经网络已经被开发出来。这些网络的准确性已经达到了非常高的水平,进一步提高变得非常困难。然而,这种高精度缺乏更深层次的语义信息,例如社会角色(例如,学生、公司员工或警察)或基于位置的角色(例如,行人、租户或建筑工人)。其中一些可以从与实体的自然属性相同的特征集中学习到,而其他的则需要来自人类环境的更广泛的上下文信息。开发识别软件的主要最终目标是将其应用于自主决策系统。因此,它必须是万无一失的,也就是说,它必须对输入有很好的语义理解。在这项研究中,我们专注于在直升机镜头中计算行人数量,并为此目的引入了一个从直升机视频创建的数据集。我们使用语义分割从实体的周围环境中提取所需的额外上下文信息。我们证明,通过使用相同的人群计数神经网络同时完成人群计数和语义分割,并且通过硬参数共享来实现类似甚至更好的准确性,这种方法是可行的。所提出的方法是通用的,可以应用于任意人群密度估计方法。数据集的链接在论文的结尾处提供。