基于身体和手部-物体 ROI 的深度学习行为识别。

Body and Hand-Object ROI-Based Behavior Recognition Using Deep Learning.

机构信息

Interdisciplinary Program in IT-Bio Convergence System, Department of Electronics Engineering, Chosun University, Gwangju 61452, Korea.

Intelligent Robotics Research Division, Electronics Telecommunications Research Institute, Daejeon 34129, Korea.

出版信息

Sensors (Basel). 2021 Mar 6;21(5):1838. doi: 10.3390/s21051838.

DOI:10.3390/s21051838

PMID:33800776

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7961580/

Abstract

Behavior recognition has applications in automatic crime monitoring, automatic sports video analysis, and context awareness of so-called silver robots. In this study, we employ deep learning to recognize behavior based on body and hand-object interaction regions of interest (ROIs). We propose an ROI-based four-stream ensemble convolutional neural network (CNN). Behavior recognition data are mainly composed of images and skeletons. The first stream uses a pre-trained 2D-CNN by converting the 3D skeleton sequence into pose evolution images (PEIs). The second stream inputs the RGB video into the 3D-CNN to extract temporal and spatial features. The most important information in behavior recognition is identification of the person performing the action. Therefore, if the neural network is trained by removing ambient noise and placing the ROI on the person, feature analysis can be performed by focusing on the behavior itself rather than learning the entire region. Therefore, the third stream inputs the RGB video limited to the body-ROI into the 3D-CNN. The fourth stream inputs the RGB video limited to ROIs of hand-object interactions into the 3D-CNN. Finally, because better performance is expected by combining the information of the models trained with attention to these ROIs, better recognition will be possible through late fusion of the four stream scores. The Electronics and Telecommunications Research Institute (ETRI)-Activity3D dataset was used for the experiments. This dataset contains color images, images of skeletons, and depth images of 55 daily behaviors of 50 elderly and 50 young individuals. The experimental results showed that the proposed model improved recognition by at least 4.27% and up to 20.97% compared to other behavior recognition methods.

摘要

行为识别在自动犯罪监控、自动体育视频分析和所谓的银色机器人的上下文感知等领域有着广泛的应用。在本研究中，我们采用深度学习方法，基于人体和手-物交互感兴趣区域（ROI）来识别行为。我们提出了一种基于 ROI 的四流集成卷积神经网络（CNN）。行为识别数据主要由图像和骨骼组成。第一流通过将 3D 骨骼序列转换为姿势演化图像（PEIs），使用预训练的 2D-CNN。第二流将 RGB 视频输入 3D-CNN 以提取时间和空间特征。行为识别中最重要的信息是识别执行动作的人。因此，如果神经网络通过去除环境噪声并将 ROI 放置在人身上进行训练，则可以通过专注于行为本身而不是学习整个区域来进行特征分析。因此，第三流将限制在人体-ROI 的 RGB 视频输入 3D-CNN。第四流将限制在手-物交互 ROIs 的 RGB 视频输入 3D-CNN。最后，由于通过对这些 ROI 进行关注来训练模型的信息可以更好地结合，因此通过对四个流得分进行后融合，识别性能会更好。实验使用电子和电信研究所（ETRI）-Activity3D 数据集。该数据集包含 50 位老年人和 50 位年轻人的 55 种日常行为的彩色图像、骨骼图像和深度图像。实验结果表明，与其他行为识别方法相比，所提出的模型至少提高了 4.27%，最高提高了 20.97%。