用于人群分析与分类的视听多模态驱动混合特征学习模型

Audio-visual multi-modality driven hybrid feature learning model for crowd analysis and classification.

作者信息

Swathi H Y, Shivakumar G

机构信息

Department of Electronics and Communication Engineering, Malnad College of Engineering, Visvesvaraya Technological University, Belagavi, India.

Department of Electronics and Communication Engineering, AMC Engineering College, Visvesvaraya Technological University, Belagavi, India.

出版信息

Math Biosci Eng. 2023 May 25;20(7):12529-12561. doi: 10.3934/mbe.2023558.

The high pace emergence in advanced software systems, low-cost hardware and decentralized cloud computing technologies have broadened the horizon for vision-based surveillance, monitoring and control. However, complex and inferior feature learning over visual artefacts or video streams, especially under extreme conditions confine majority of the at-hand vision-based crowd analysis and classification systems. Retrieving event-sensitive or crowd-type sensitive spatio-temporal features for the different crowd types under extreme conditions is a highly complex task. Consequently, it results in lower accuracy and hence low reliability that confines existing methods for real-time crowd analysis. Despite numerous efforts in vision-based approaches, the lack of acoustic cues often creates ambiguity in crowd classification. On the other hand, the strategic amalgamation of audio-visual features can enable accurate and reliable crowd analysis and classification. Considering it as motivation, in this research a novel audio-visual multi-modality driven hybrid feature learning model is developed for crowd analysis and classification. In this work, a hybrid feature extraction model was applied to extract deep spatio-temporal features by using Gray-Level Co-occurrence Metrics (GLCM) and AlexNet transferrable learning model. Once extracting the different GLCM features and AlexNet deep features, horizontal concatenation was done to fuse the different feature sets. Similarly, for acoustic feature extraction, the audio samples (from the input video) were processed for static (fixed size) sampling, pre-emphasis, block framing and Hann windowing, followed by acoustic feature extraction like GTCC, GTCC-Delta, GTCC-Delta-Delta, MFCC, Spectral Entropy, Spectral Flux, Spectral Slope and Harmonics to Noise Ratio (HNR). Finally, the extracted audio-visual features were fused to yield a composite multi-modal feature set, which is processed for classification using the random forest ensemble classifier. The multi-class classification yields a crowd-classification accurac12529y of (98.26%), precision (98.89%), sensitivity (94.82%), specificity (95.57%), and F-Measure of 98.84%. The robustness of the proposed multi-modality-based crowd analysis model confirms its suitability towards real-world crowd detection and classification tasks.

先进软件系统、低成本硬件和去中心化云计算技术的快速涌现，拓宽了基于视觉的监控、监测和控制的视野。然而，在视觉伪像或视频流上进行复杂且劣质的特征学习，尤其是在极端条件下，限制了大多数现有的基于视觉的人群分析和分类系统。在极端条件下为不同人群类型检索事件敏感或人群类型敏感的时空特征是一项高度复杂的任务。因此，这会导致准确率较低，进而可靠性较低，限制了现有的实时人群分析方法。尽管在基于视觉的方法上付出了诸多努力，但缺乏声学线索往往会在人群分类中造成模糊性。另一方面，视听特征的策略性融合可以实现准确可靠的人群分析和分类。基于此动机，本研究开发了一种新颖的视听多模态驱动的混合特征学习模型用于人群分析和分类。在这项工作中，应用了一种混合特征提取模型，通过使用灰度共生矩阵（GLCM）和AlexNet可迁移学习模型来提取深度时空特征。一旦提取出不同的GLCM特征和AlexNet深度特征，就进行水平拼接以融合不同的特征集。同样，对于声学特征提取，对音频样本（来自输入视频）进行静态（固定大小）采样、预加重、分块加窗和汉宁窗处理，随后进行声学特征提取，如GTCC、GTCC - 增量、GTCC - 增量 - 增量、MFCC、谱熵、谱通量、谱斜率和谐噪比（HNR）。最后，将提取的视听特征进行融合，得到一个复合多模态特征集，使用随机森林集成分类器对其进行分类处理。多类分类的人群分类准确率为98.26%，精确率为98.89%，敏感度为94.82%，特异度为95.57%，F值为98.84%。所提出的基于多模态的人群分析模型的鲁棒性证实了其适用于现实世界的人群检测和分类任务。

Audio-visual multi-modality driven hybrid feature learning model for crowd analysis and classification.

作者信息

机构信息

出版信息

相似文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献