Xue Yao, Li Yonghui, Liu Siming, Zhang Xingjun, Qian Xueming
IEEE Trans Image Process. 2021;30:2745-2757. doi: 10.1109/TIP.2021.3049963. Epub 2021 Feb 12.
Crowd scene analysis receives growing attention due to its wide applications. Grasping the accurate crowd location is important for identifying high-risk regions. In this article, we propose a Compressed Sensing based Output Encoding (CSOE) scheme, which casts detecting pixel coordinates of small objects into a task of signal regression in encoding signal space. To prevent gradient vanishing, we derive our own sparse reconstruction backpropagation rule that is adaptive to distinct implementations of sparse reconstruction and makes the whole model end-to-end trainable. With the support of CSOE and the backpropagation rule, the proposed method shows more robustness to deep model training error, which is especially harmful to crowd counting and localization. The proposed method achieves state-of-the-art performance across four mainstream datasets, especially achieves excellent results in highly crowded scenes. A series of analysis and experiments support our claim that regression in CSOE space is better than traditionally detecting coordinates of small objects in pixel space for highly crowded scenes.
由于人群场景分析具有广泛的应用,因此受到越来越多的关注。掌握准确的人群位置对于识别高风险区域很重要。在本文中,我们提出了一种基于压缩感知的输出编码(CSOE)方案,该方案将检测小物体的像素坐标转换为编码信号空间中的信号回归任务。为了防止梯度消失,我们推导了自己的稀疏重建反向传播规则,该规则适用于稀疏重建的不同实现,并使整个模型能够端到端地训练。在CSOE和反向传播规则的支持下,所提出的方法对深度模型训练误差表现出更强的鲁棒性,而深度模型训练误差对人群计数和定位尤其有害。该方法在四个主流数据集上取得了领先的性能,尤其在高度拥挤的场景中取得了优异的结果。一系列分析和实验支持了我们的观点,即在高度拥挤的场景中,CSOE空间中的回归比传统的在像素空间中检测小物体的坐标要好。