Ullah Hayat, Munir Arslan
Department of Computer Science, Kansas State University, Manhattan, KS 66506, USA.
J Imaging. 2023 Apr 14;9(4):82. doi: 10.3390/jimaging9040082.
Human action recognition has been actively explored over the past two decades to further advancements in video analytics domain. Numerous research studies have been conducted to investigate the complex sequential patterns of human actions in video streams. In this paper, we propose a knowledge distillation framework, which distills spatio-temporal knowledge from a large teacher model to a lightweight student model using an offline knowledge distillation technique. The proposed offline knowledge distillation framework takes two models: a large pre-trained 3DCNN (three-dimensional convolutional neural network) teacher model and a lightweight 3DCNN student model (i.e., the teacher model is pre-trained on the same dataset on which the student model is to be trained on). During offline knowledge distillation training, the distillation algorithm trains only the student model to help enable the student model to achieve the same level of prediction accuracy as the teacher model. To evaluate the performance of the proposed method, we conduct extensive experiments on four benchmark human action datasets. The obtained quantitative results verify the efficiency and robustness of the proposed method over the state-of-the-art human action recognition methods by obtaining up to 35% improvement in accuracy over existing methods. Furthermore, we evaluate the inference time of the proposed method and compare the obtained results with the inference time of the state-of-the-art methods. Experimental results reveal that the proposed method attains an improvement of up to 50× in terms of frames per seconds (FPS) over the state-of-the-art methods. The short inference time and high accuracy make our proposed framework suitable for human activity recognition in real-time applications.
在过去二十年中,人们积极探索人类动作识别技术,以推动视频分析领域的进一步发展。已经开展了大量研究,来调查视频流中人类动作的复杂序列模式。在本文中,我们提出了一种知识蒸馏框架,该框架使用离线知识蒸馏技术,将时空知识从大型教师模型提炼到轻量级学生模型中。所提出的离线知识蒸馏框架采用两个模型:一个大型预训练的3DCNN(三维卷积神经网络)教师模型和一个轻量级3DCNN学生模型(即教师模型在学生模型将要训练的同一数据集上进行预训练)。在离线知识蒸馏训练期间,蒸馏算法仅训练学生模型,以帮助学生模型达到与教师模型相同水平的预测精度。为了评估所提出方法的性能,我们在四个基准人类动作数据集上进行了广泛的实验。获得的定量结果验证了所提出方法相对于现有最先进人类动作识别方法的效率和鲁棒性,与现有方法相比,准确率提高了35%。此外,我们评估了所提出方法的推理时间,并将获得的结果与最先进方法的推理时间进行比较。实验结果表明,所提出的方法在每秒帧数(FPS)方面比最先进方法提高了高达50倍。短推理时间和高精度使得我们提出的框架适用于实时应用中的人类活动识别。