Department of Computer Science and Engineering (CSE), University of Ioannina, 45110 Ioannina, Greece.
Institute for Language and Speech Processing (ILSP), Athena Research and Innovation Center, 15125 Athens, Greece.
Sensors (Basel). 2023 Nov 29;23(23):9510. doi: 10.3390/s23239510.
Visual tracking and attribute estimation related to age or gender information of multiple person entities in a scene are mature research topics with the advent of deep learning techniques. However, when it comes to indoor images such as video sequences of retail consumers, data are not always adequate or accurate enough to essentially train effective models for consumer detection and tracking under various adverse factors. This in turn affects the quality of recognizing age or gender for those detected instances. In this work, we introduce two novel datasets: comprises 145 video sequences compliant to personal information regulations as far as facial images are concerned and is a set of cropped body images from each sequence that can be used for numerous computer vision tasks. We also propose an end-to-end framework which comprises CNNs as object detectors, LSTMs for motion forecasting of the tracklet association component in a sequence, along with a multi-attribute classification model for apparent demographic estimation of the detected outputs, aiming to capture useful metadata of consumer product preferences. Obtained results on tracking and age/gender prediction are promising with respect to reference systems while they indicate the proposed model's potential for practical consumer metadata extraction.
在场景中对多个人体实体的年龄或性别相关的视觉跟踪和属性估计是一个成熟的研究课题,随着深度学习技术的出现。然而,当涉及到室内图像,如零售消费者的视频序列时,数据并不总是足够充足或准确,无法为消费者检测和跟踪各种不利因素下的有效模型提供基本训练。这反过来又影响了对检测到的实例的年龄或性别识别的质量。在这项工作中,我们引入了两个新的数据集:[数据集 1] 包含 145 个视频序列,这些序列符合个人信息法规,就面部图像而言,[数据集 2] 是从每个序列裁剪出的身体图像,可用于许多计算机视觉任务。我们还提出了一个端到端框架,该框架包括作为目标检测器的 CNN、用于序列中轨迹关联组件的运动预测的 LSTM,以及用于明显的人口统计学估计的多属性分类模型,旨在捕获消费者产品偏好的有用元数据。与参考系统相比,在跟踪和年龄/性别预测方面的结果是有希望的,这表明了所提出的模型在实际消费者元数据提取方面的潜力。