Zhang Chun-Yang, Xiao Yong-Yi, Lin Jin-Cheng, Chen C L Philip, Liu Wenxi, Tong Yu-Hong
IEEE Trans Cybern. 2022 Jan;52(1):398-410. doi: 10.1109/TCYB.2020.2973300. Epub 2022 Jan 11.
Data representation learning is one of the most important problems in machine learning. Unsupervised representation learning becomes meritorious as it has no necessity of label information with observed data. Due to the highly time-consuming learning of deep-learning models, there are many machine-learning models directly adapting well-trained deep models that are obtained in a supervised and end-to-end manner as feature abstractors to distinct problems. However, it is obvious that different machine-learning tasks require disparate representation of original input data. Taking human action recognition as an example, it is well known that human actions in a video sequence are 3-D signals containing both visual appearance and motion dynamics of humans and objects. Therefore, the data representation approaches with the capabilities to capture both spatial and temporal correlations in videos are meaningful. Most of the existing human motion recognition models build classifiers based on deep-learning structures such as deep convolutional networks. These models require a large quantity of training videos with annotations. Meanwhile, these supervised models cannot recognize samples from the distinct dataset without retraining. In this article, we propose a new 3-D deconvolutional network (3DDN) for representation learning of high-dimensional video data, in which the high-level features are obtained through the optimization approach. The proposed 3DDN decomposes the video frames into spatiotemporal features under a sparse constraint in an unsupervised way. In addition, it also can be regarded as a building block to develop deep architectures by stacking. The high-level representation of input sequential data can be used in multiple downstream machine-learning tasks, we evaluate the proposed 3DDN and its deep models in human action recognition. The experimental results from three datasets: 1) KTH data; 2) HMDB-51; and 3) UCF-101, demonstrate that the proposed 3DDN is an alternative approach to feedforward convolutional neural networks (CNNs), that attains comparable results.
数据表示学习是机器学习中最重要的问题之一。无监督表示学习很有价值,因为它不需要观测数据的标签信息。由于深度学习模型的学习过程非常耗时,有许多机器学习模型直接采用经过监督的端到端方式训练好的深度模型作为特征提取器,以适应不同的问题。然而,很明显不同的机器学习任务需要对原始输入数据进行不同的表示。以人类动作识别为例,众所周知,视频序列中的人类动作是包含人类和物体视觉外观及运动动态的三维信号。因此,具有捕捉视频中空间和时间相关性能力的数据表示方法是有意义的。现有的大多数人类动作识别模型基于深度卷积网络等深度学习结构构建分类器。这些模型需要大量带注释的训练视频。同时,这些监督模型在没有重新训练的情况下无法识别来自不同数据集的样本。在本文中,我们提出了一种用于高维视频数据表示学习的新型三维反卷积网络(3DDN),其中通过优化方法获得高级特征。所提出的3DDN以无监督的方式在稀疏约束下将视频帧分解为时空特征。此外,它也可以被视为通过堆叠来开发深度架构的一个构建模块。输入序列数据的高级表示可用于多个下游机器学习任务,我们在人类动作识别中评估了所提出的3DDN及其深度模型。来自三个数据集的实验结果:1)KTH数据;2)HMDB - 51;3)UCF - 101,表明所提出的3DDN是前馈卷积神经网络(CNN)的一种替代方法,能取得可比的结果。