Suppr超能文献

基于CLIP和Transformer网络的语义增强特征提取用于驾驶员疲劳检测

Semantically-Enhanced Feature Extraction with CLIP and Transformer Networks for Driver Fatigue Detection.

作者信息

Gao Zhen, Chen Xiaowen, Xu Jingning, Yu Rongjie, Zhang Heng, Yang Jinqiu

机构信息

School of Computer Science and Technology, Tongji University, Shanghai 201804, China.

Department of Computer Science, City University of Hong Kong, Hong Kong 999077, China.

出版信息

Sensors (Basel). 2024 Dec 12;24(24):7948. doi: 10.3390/s24247948.

Abstract

Drowsy driving is a leading cause of commercial vehicle traffic crashes. The trend is to train fatigue detection models using deep neural networks on driver video data, but challenges remain in coarse and incomplete high-level feature extraction and network architecture optimization. This paper pioneers the use of the CLIP (Contrastive Language-Image Pre-training) model for fatigue detection. And by harnessing the power of a Transformer architecture, sophisticated and long-term temporal features are adeptly extracted from video sequences, paving the way for more nuanced and accurate fatigue analysis. The proposed CT-Net (CLIP-Transformer Network) achieves an AUC (Area Under the Curve) of 0.892, a 36% accuracy improvement over the prevalent CNN-LSTM (Convolutional Neural Network-Long Short-Term Memory) end-to-end model, reaching state-of-the-art performance. Experiments show that the CLIP pre-trained model more accurately extracts facial and behavioral features from driver video frames, improving the model's AUC by 7% over the ImageNet-based pre-trained model. Moreover, compared with LSTM, the Transformer more flexibly captures long-term dependencies among temporal features, further enhancing the model's AUC by 4%.

摘要

疲劳驾驶是商用车交通事故的主要原因。目前的趋势是在驾驶员视频数据上使用深度神经网络训练疲劳检测模型,但在粗略和不完整的高级特征提取以及网络架构优化方面仍然存在挑战。本文率先将CLIP(对比语言-图像预训练)模型用于疲劳检测。通过利用Transformer架构的强大功能,从视频序列中巧妙地提取复杂的长期时间特征,为更细致、准确的疲劳分析铺平了道路。所提出的CT-Net(CLIP-Transformer网络)实现了0.892的AUC(曲线下面积),比普遍的CNN-LSTM(卷积神经网络-长短期记忆)端到端模型的准确率提高了36%,达到了当前的最优性能。实验表明,CLIP预训练模型能更准确地从驾驶员视频帧中提取面部和行为特征,比基于ImageNet的预训练模型将模型的AUC提高了7%。此外,与LSTM相比,Transformer能更灵活地捕捉时间特征之间的长期依赖关系,进一步将模型的AUC提高了4%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d269/11679248/a0883fe959a6/sensors-24-07948-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验