IEEE Trans Cybern. 2013 Oct;43(5):1383-94. doi: 10.1109/TCYB.2013.2276433. Epub 2013 Aug 27.
Recognizing complex human activities usually requires the detection and modeling of individual visual features and the interactions between them. Current methods only rely on the visual features extracted from 2-D images, and therefore often lead to unreliable salient visual feature detection and inaccurate modeling of the interaction context between individual features. In this paper, we show that these problems can be addressed by combining data from a conventional camera and a depth sensor (e.g., Microsoft Kinect). We propose a novel complex activity recognition and localization framework that effectively fuses information from both grayscale and depth image channels at multiple levels of the video processing pipeline. In the individual visual feature detection level, depth-based filters are applied to the detected human/object rectangles to remove false detections. In the next level of interaction modeling, 3-D spatial and temporal contexts among human subjects or objects are extracted by integrating information from both grayscale and depth images. Depth information is also utilized to distinguish different types of indoor scenes. Finally, a latent structural model is developed to integrate the information from multiple levels of video processing for an activity detection. Extensive experiments on two activity recognition benchmarks (one with depth information) and a challenging grayscale + depth human activity database that contains complex interactions between human-human, human-object, and human-surroundings demonstrate the effectiveness of the proposed multilevel grayscale + depth fusion scheme. Higher recognition and localization accuracies are obtained relative to the previous methods.
识别复杂的人类活动通常需要检测和建模个体的视觉特征及其之间的相互作用。当前的方法仅依赖于从 2D 图像中提取的视觉特征,因此经常导致不可靠的显著视觉特征检测和个体特征之间交互上下文的不准确建模。在本文中,我们展示了通过结合来自传统摄像机和深度传感器(例如 Microsoft Kinect)的数据可以解决这些问题。我们提出了一种新颖的复杂活动识别和定位框架,该框架有效地融合了视频处理管道多个层次的灰度和深度图像通道的信息。在个体视觉特征检测级别,将基于深度的滤波器应用于检测到的人体/物体矩形,以去除错误检测。在下一级别的交互建模中,通过整合来自灰度和深度图像的信息来提取人体/物体之间的 3D 时空上下文。深度信息还用于区分不同类型的室内场景。最后,开发了一个潜在的结构模型,以整合来自多个视频处理级别的信息,用于活动检测。在两个活动识别基准(一个具有深度信息)和一个具有复杂的人机交互、人机物交互和人与环境交互的挑战性灰度+深度人类活动数据库上进行了广泛的实验,证明了所提出的多层次灰度+深度融合方案的有效性。与之前的方法相比,获得了更高的识别和定位精度。