车辆中的声学事件检测：一种多标签分类方法。

Acoustic Event Detection in Vehicles: A Multi-Label Classification Approach.

作者信息

Antony Anaswara, Theimer Wolfgang, Grossetti Giovanni, Friedrich Christoph M

机构信息

Department of Computer Science, University of Applied Sciences and Arts (FH Dortmund), 44227 Dortmund, Germany.

Volkswagen Infotainment GmbH, 44803 Bochum, Germany.

出版信息

Sensors (Basel). 2025 Apr 19;25(8):2591. doi: 10.3390/s25082591.

DOI:10.3390/s25082591

PMID:40285281

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12031074/

Abstract

Autonomous driving technologies for environmental perception are mostly based on visual cues obtained from sensors like cameras, RADAR, or LiDAR. They capture the environment as if seen through "human eyes". If this visual information is complemented with auditory information, thereby also providing "ears", driverless cars can become more reliable and safer. In this paper, an Acoustic Event Detection model is presented that can detect various acoustic events in an automotive context along with their time of occurrence to create an audio scene description. The proposed detection methodology uses the pre-trained network Bidirectional Encoder representation from Audio Transformers (BEATs) and a single-layer neural network trained on the database of real audio recordings collected from different cars. The performance of the model is evaluated for different parameters and datasets. The segment-based results for a duration of 1 s show that the model performs well for 11 sound classes with a mean accuracy of 0.93 and F1-Score of 0.39 for a confidence threshold of 0.5. The threshold-independent metric mAP has a value of 0.77. The model also performs well for sound mixtures containing two overlapping events with mean accuracy, F1-Score, and mAP equal to 0.89, 0.42, and 0.658, respectively.

摘要

用于环境感知的自动驾驶技术大多基于从摄像头、雷达或激光雷达等传感器获取的视觉线索。它们捕捉环境的方式就好像是通过“人眼”看到的一样。如果这种视觉信息能辅以听觉信息，从而为无人驾驶汽车也提供“耳朵”，那么无人驾驶汽车就能变得更加可靠和安全。本文提出了一种声学事件检测模型，该模型可以检测汽车环境中的各种声学事件及其发生时间，以创建音频场景描述。所提出的检测方法使用预训练的音频变压器双向编码器表示（BEATs）网络和在从不同汽车收集的真实音频记录数据库上训练的单层神经网络。针对不同参数和数据集对模型的性能进行了评估。持续时间为1秒的基于片段的结果表明，对于11种声音类别，该模型表现良好，在置信阈值为0.5时，平均准确率为0.93，F1分数为0.39。与阈值无关的指标平均精度均值（mAP）为0.77。该模型对于包含两个重叠事件的声音混合也表现良好，平均准确率、F1分数和mAP分别为0.89、0.42和0.658。