基于CLIP和Transformer网络的语义增强特征提取用于驾驶员疲劳检测

Semantically-Enhanced Feature Extraction with CLIP and Transformer Networks for Driver Fatigue Detection.

作者信息

Gao Zhen, Chen Xiaowen, Xu Jingning, Yu Rongjie, Zhang Heng, Yang Jinqiu

机构信息

School of Computer Science and Technology, Tongji University, Shanghai 201804, China.

Department of Computer Science, City University of Hong Kong, Hong Kong 999077, China.

出版信息

Sensors (Basel). 2024 Dec 12;24(24):7948. doi: 10.3390/s24247948.

DOI:10.3390/s24247948

PMID:39771685

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11679248/

Abstract

Drowsy driving is a leading cause of commercial vehicle traffic crashes. The trend is to train fatigue detection models using deep neural networks on driver video data, but challenges remain in coarse and incomplete high-level feature extraction and network architecture optimization. This paper pioneers the use of the CLIP (Contrastive Language-Image Pre-training) model for fatigue detection. And by harnessing the power of a Transformer architecture, sophisticated and long-term temporal features are adeptly extracted from video sequences, paving the way for more nuanced and accurate fatigue analysis. The proposed CT-Net (CLIP-Transformer Network) achieves an AUC (Area Under the Curve) of 0.892, a 36% accuracy improvement over the prevalent CNN-LSTM (Convolutional Neural Network-Long Short-Term Memory) end-to-end model, reaching state-of-the-art performance. Experiments show that the CLIP pre-trained model more accurately extracts facial and behavioral features from driver video frames, improving the model's AUC by 7% over the ImageNet-based pre-trained model. Moreover, compared with LSTM, the Transformer more flexibly captures long-term dependencies among temporal features, further enhancing the model's AUC by 4%.

摘要

疲劳驾驶是商用车交通事故的主要原因。目前的趋势是在驾驶员视频数据上使用深度神经网络训练疲劳检测模型，但在粗略和不完整的高级特征提取以及网络架构优化方面仍然存在挑战。本文率先将CLIP（对比语言-图像预训练）模型用于疲劳检测。通过利用Transformer架构的强大功能，从视频序列中巧妙地提取复杂的长期时间特征，为更细致、准确的疲劳分析铺平了道路。所提出的CT-Net（CLIP-Transformer网络）实现了0.892的AUC（曲线下面积），比普遍的CNN-LSTM（卷积神经网络-长短期记忆）端到端模型的准确率提高了36%，达到了当前的最优性能。实验表明，CLIP预训练模型能更准确地从驾驶员视频帧中提取面部和行为特征，比基于ImageNet的预训练模型将模型的AUC提高了7%。此外，与LSTM相比，Transformer能更灵活地捕捉时间特征之间的长期依赖关系，进一步将模型的AUC提高了4%。