Saradopoulos Ioannis, Potamitis Ilyas, Ntalampiras Stavros, Rigakis Iraklis, Manifavas Charalampos, Konstantaras Antonios
Department of Electronics, Hellenic Mediterranean University, 73133 Chania, Crete, Greece.
Department of Music Technology & Acoustics, Hellenic Mediterranean University, 74133 Rethymno, Crete, Greece.
Sensors (Basel). 2025 Apr 20;25(8):2597. doi: 10.3390/s25082597.
We present a system that integrates diverse technologies to achieve real-time, distributed audio surveillance. The system employs a network of microphones mounted on ESP32 platforms, which transmit compressed audio chunks via an MQTT protocol to Raspberry Pi5 devices for acoustic classification. These devices host an audio transformer model trained on the AudioSet dataset, enabling the real-time classification and timestamping of audio events with high accuracy. The output of the transformer is kept in a database of events and is subsequently converted into JSON format. The latter is further parsed into a graph structure that encapsulates the annotated soundscape, providing a rich and dynamic representation of audio environments. These graphs are subsequently traversed and analyzed using dedicated Python code and large language models (LLMs), enabling the system to answer complex queries about the nature, relationships, and context of detected audio events. We introduce a novel graph parsing method that achieves low false-alarm rates. In the task of analyzing the audio from a 1 h and 40 min long movie featuring hazardous driving practices, our approach achieved an accuracy of 0.882, precision of 0.8, recall of 1.0, and an F1 score of 0.89. By combining the robustness of distributed sensing and the precision of transformer-based audio classification, our approach that treats audio as text paves the way for advanced applications in acoustic surveillance, environmental monitoring, and beyond.
我们展示了一个集成多种技术以实现实时分布式音频监控的系统。该系统采用安装在ESP32平台上的麦克风网络,这些麦克风通过MQTT协议将压缩的音频块传输到Raspberry Pi5设备进行声学分类。这些设备运行在AudioSet数据集上训练的音频变压器模型,能够高精度地实时分类音频事件并添加时间戳。变压器的输出保存在事件数据库中,随后转换为JSON格式。后者进一步解析为封装注释音景的图形结构,提供音频环境的丰富动态表示。随后使用专用的Python代码和大语言模型(LLM)遍历和分析这些图形,使系统能够回答有关检测到的音频事件的性质、关系和上下文的复杂查询。我们引入了一种新的图形解析方法,该方法实现了低误报率。在分析一部时长1小时40分钟、包含危险驾驶行为的电影音频的任务中,我们的方法准确率达到0.882,精确率为0.8,召回率为1.0,F1分数为0.89。通过结合分布式传感的鲁棒性和基于变压器的音频分类的精度,我们将音频视为文本的方法为声学监控、环境监测及其他领域的高级应用铺平了道路。