Doungpaisan Pafan, Khunarsa Peerapol
Faculty of Industrial Technology and Management, King Mongkut's University of Technology North Bangkok, Bangkok 10800, Thailand.
Faculty of Science and Technology, Uttaradit Rajabhat University, Uttaradit 53000, Thailand.
J Imaging. 2025 Aug 21;11(8):281. doi: 10.3390/jimaging11080281.
Gunshot sound classification plays a crucial role in public safety, forensic investigations, and intelligent surveillance systems. This study evaluates the performance of deep learning models in classifying firearm sounds by analyzing twelve time-frequency spectrogram representations, including Mel, Bark, MFCC, CQT, Cochleagram, STFT, FFT, Reassigned, Chroma, Spectral Contrast, and Wavelet. The dataset consists of 2148 gunshot recordings from four firearm types, collected in a semi-controlled outdoor environment under multi-orientation conditions. To leverage advanced computer vision techniques, all spectrograms were converted into RGB images using perceptually informed colormaps. This enabled the application of image processing approaches and fine-tuning of pre-trained Convolutional Neural Networks (CNNs) originally developed for natural image classification. Six CNN architectures-ResNet18, ResNet50, ResNet101, GoogLeNet, Inception-v3, and InceptionResNetV2-were trained on these spectrogram images. Experimental results indicate that CQT, Cochleagram, and Mel spectrograms consistently achieved high classification accuracy, exceeding 94% when paired with deep CNNs such as ResNet101 and InceptionResNetV2. These findings demonstrate that transforming time-frequency features into RGB images not only facilitates the use of image-based processing but also allows deep models to capture rich spectral-temporal patterns, providing a robust framework for accurate firearm sound classification.
枪声分类在公共安全、法医调查和智能监控系统中起着至关重要的作用。本研究通过分析十二种时频谱表示,包括梅尔(Mel)、巴克(Bark)、梅尔频率倒谱系数(MFCC)、恒定Q变换(CQT)、耳蜗图(Cochleagram)、短时傅里叶变换(STFT)、快速傅里叶变换(FFT)、重分配谱、色度图、谱对比度和小波,评估深度学习模型对枪械声音进行分类的性能。该数据集由来自四种枪械类型的2148个枪声录音组成,是在半控制的户外环境中的多方向条件下收集的。为了利用先进的计算机视觉技术,所有谱图都使用感知信息色图转换为RGB图像。这使得能够应用图像处理方法,并对最初为自然图像分类而开发的预训练卷积神经网络(CNN)进行微调。六种CNN架构——残差网络18(ResNet18)、残差网络50(ResNet50)、残差网络101(ResNet101)、谷歌网络(GoogLeNet)、Inception-v3和InceptionResNetV2——在这些谱图图像上进行了训练。实验结果表明,CQT、耳蜗图和梅尔谱图始终实现了较高的分类准确率,与ResNet101和InceptionResNetV2等深度CNN搭配时超过了94%。这些发现表明,将时频特征转换为RGB图像不仅便于基于图像的处理的使用,还允许深度模型捕捉丰富的频谱-时间模式,为准确的枪械声音分类提供了一个强大的框架。