Image Processing Group, Department of Signal Theory and Communications, Universitat Politècnica de Catalunya (UPC), 08034 Barcelona, Spain.
Sensors (Basel). 2022 Apr 21;22(9):3171. doi: 10.3390/s22093171.
Foreground object segmentation is a crucial first step for surveillance systems based on networks of video sensors. This problem in the context of dynamic scenes has been widely explored in the last two decades, but it still has open research questions due to challenges such as strong shadows, background clutter and illumination changes. After years of solid work based on statistical background pixel modeling, most current proposals use convolutional neural networks (CNNs) either to model the background or to make the foreground/background decision. Although these new techniques achieve outstanding results, they usually require specific training for each scene, which is unfeasible if we aim at designing software for embedded video systems and smart cameras. Our approach to the problem does not require specific context or scene training, and thus no manual labeling. We propose a network for a refinement step on top of conventional state-of-the-art background subtraction systems. By using a statistical technique to produce a rough mask, we do not need to train the network for each scene. The proposed method can take advantage of the specificity of the classic techniques, while obtaining the highly accurate segmentation that a deep learning system provides. We also show the advantage of using an adversarial network to improve the generalization ability of the network and produce more consistent results than an equivalent non-adversarial network. The results provided were obtained by training the network on a common database, without fine-tuning for specific scenes. Experiments on the unseen part of the CDNet database provided 0.82 a F-score, and 0.87 was achieved for LASIESTA databases, which is a database unrelated to the training one. On this last database, the results outperformed by 8.75% those available in the official table. The results achieved for CDNet are well above those of the methods not based on CNNs, and according to the literature, among the best for the context-unsupervised CNNs systems.
前景目标分割是基于视频传感器网络的监控系统的关键第一步。在过去的二十年中,针对动态场景的这一问题已经得到了广泛的探索,但由于强阴影、背景杂乱和光照变化等挑战,它仍然存在尚未解决的研究问题。在基于统计背景像素建模的多年扎实工作之后,大多数当前的方案要么使用卷积神经网络(CNN)来建模背景,要么做出前景/背景决策。尽管这些新技术取得了出色的成果,但它们通常需要针对每个场景进行特定的培训,如果我们的目标是为嵌入式视频系统和智能相机设计软件,那么这是不可行的。我们解决这个问题的方法不需要特定的上下文或场景培训,因此也不需要手动标记。我们提出了一种网络,用于在传统的最先进的背景减除系统之上进行细化步骤。通过使用统计技术生成粗略的掩模,我们不需要针对每个场景训练网络。所提出的方法可以利用经典技术的特异性,同时获得深度学习系统提供的高度准确的分割。我们还展示了使用对抗网络来提高网络的泛化能力并产生比等效的非对抗网络更一致的结果的优势。所提供的结果是通过在通用数据库上训练网络获得的,而无需针对特定场景进行微调。在 CDNet 数据库的未观察部分上进行的实验提供了 0.82 的 F 分数,而对于与训练数据库无关的 LASIESTA 数据库则达到了 0.87。在最后一个数据库上,结果比官方表格中提供的结果高出 8.75%。所取得的结果明显优于那些不基于 CNN 的方法,并且根据文献,在无上下文监督的 CNN 系统中,这些结果是最好的。