Jiao Shiqin, Li Guoqi, Zhang Guiyang, Zhou Jiahao, Li Jihong
School of Reliability and Systems Engineering, Beihang University, Beijing 100191, China.
Jinan Thomas School, Jinan, Shandong 250102, China.
Heliyon. 2024 Apr 16;10(8):e29596. doi: 10.1016/j.heliyon.2024.e29596. eCollection 2024 Apr 30.
Falls often pose significant safety risks to solitary individuals, especially the elderly. Implementing a fast and efficient fall detection system is an effective strategy to address this hidden danger. We propose a multimodal method based on audio and video. On the basis of using non-intrusive equipment, it reduces to a certain extent the false negative situation that the most commonly used video-based methods may face due to insufficient lighting conditions, exceeding the monitoring range, etc. Therefore, in the foreseeable future, methods based on audio and video fusion are expected to become the best solution for fall detection. Specifically, this article outlines the following methodology: the video-based model utilizes YOLOv7-Pose to extract key skeleton joints, which are then fed into a two stream Spatial Temporal Graph Convolutional Network (ST-GCN) for classification. Meanwhile, the audio-based model employs log-scaled mel spectrograms to capture different features, which are processed through the MobileNetV2 architecture for detection. The final decision fusion of the two results is achieved through linear weighting and Dempster-Shafer (D-S) theory. After evaluation, our multimodal fall detection method significantly outperforms the single modality method, especially the evaluation metric sensitivity increased from 81.67% in single video modality to 96.67% (linear weighting) and 97.50% (D-S theory), which emphasizing the effectiveness of integrating video and audio data to achieve more powerful and reliable fall detection in complex and diverse daily life environments.
跌倒常常给独居者带来重大安全风险,尤其是老年人。实施快速高效的跌倒检测系统是应对这一隐患的有效策略。我们提出一种基于音频和视频的多模态方法。在使用非侵入式设备的基础上,它在一定程度上减少了最常用的基于视频的方法可能因光照条件不足、超出监测范围等而面临的漏报情况。因此,在可预见的未来,基于音频和视频融合的方法有望成为跌倒检测的最佳解决方案。具体而言,本文概述了以下方法:基于视频的模型利用YOLOv7-Pose提取关键骨骼关节,然后将其输入双流时空图卷积网络(ST-GCN)进行分类。同时,基于音频的模型采用对数缩放的梅尔频谱图来捕捉不同特征,通过MobileNetV2架构进行处理以进行检测。通过线性加权和Dempster-Shafer(D-S)理论实现两个结果的最终决策融合。经过评估,我们的多模态跌倒检测方法显著优于单模态方法,尤其是评估指标敏感度从单视频模态的81.67%提高到了96.67%(线性加权)和97.50%(D-S理论),这强调了整合视频和音频数据以在复杂多样的日常生活环境中实现更强大、更可靠的跌倒检测的有效性。