• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

EnViTSA:结合频谱增强的视觉Transformer集成用于声学事件分类

EnViTSA: Ensemble of Vision Transformer with SpecAugment for Acoustic Event Classification.

作者信息

Lim Kian Ming, Lee Chin Poo, Lee Zhi Yang, Alqahtani Ali

机构信息

Faculty of Information Science and Technology, Multimedia University, Melaka 75450, Malaysia.

DZH International Sdn. Bhd., Kuala Lumpur 55100, Malaysia.

出版信息

Sensors (Basel). 2023 Nov 10;23(22):9084. doi: 10.3390/s23229084.

DOI:10.3390/s23229084
PMID:38005472
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10674441/
Abstract

Recent successes in deep learning have inspired researchers to apply deep neural networks to Acoustic Event Classification (AEC). While deep learning methods can train effective AEC models, they are susceptible to overfitting due to the models' high complexity. In this paper, we introduce EnViTSA, an innovative approach that tackles key challenges in AEC. EnViTSA combines an ensemble of Vision Transformers with SpecAugment, a novel data augmentation technique, to significantly enhance AEC performance. Raw acoustic signals are transformed into Log Mel-spectrograms using Short-Time Fourier Transform, resulting in a fixed-size spectrogram representation. To address data scarcity and overfitting issues, we employ SpecAugment to generate additional training samples through time masking and frequency masking. The core of EnViTSA resides in its ensemble of pre-trained Vision Transformers, harnessing the unique strengths of the Vision Transformer architecture. This ensemble approach not only reduces inductive biases but also effectively mitigates overfitting. In this study, we evaluate the EnViTSA method on three benchmark datasets: ESC-10, ESC-50, and UrbanSound8K. The experimental results underscore the efficacy of our approach, achieving impressive accuracy scores of 93.50%, 85.85%, and 83.20% on ESC-10, ESC-50, and UrbanSound8K, respectively. EnViTSA represents a substantial advancement in AEC, demonstrating the potential of Vision Transformers and SpecAugment in the acoustic domain.

摘要

深度学习领域最近取得的成功激发了研究人员将深度神经网络应用于声学事件分类(AEC)。虽然深度学习方法可以训练有效的AEC模型,但由于模型的高复杂性,它们容易出现过拟合问题。在本文中,我们介绍了EnViTSA,这是一种应对AEC关键挑战的创新方法。EnViTSA将视觉Transformer集成与SpecAugment(一种新颖的数据增强技术)相结合,以显著提高AEC性能。原始声学信号通过短时傅里叶变换转换为对数梅尔频谱图,从而得到固定大小的频谱图表示。为了解决数据稀缺和过拟合问题,我们采用SpecAugment通过时间掩蔽和频率掩蔽来生成额外的训练样本。EnViTSA的核心在于其预训练视觉Transformer的集成,利用了视觉Transformer架构的独特优势。这种集成方法不仅减少了归纳偏差,还有效减轻了过拟合。在本研究中,我们在三个基准数据集上评估了EnViTSA方法:ESC-10、ESC-50和UrbanSound8K。实验结果强调了我们方法的有效性,在ESC-10、ESC-50和UrbanSound8K上分别取得了令人印象深刻的准确率得分93.50%、85.85%和83.20%。EnViTSA代表了AEC领域的重大进步,展示了视觉Transformer和SpecAugment在声学领域的潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b1d6/10674441/09c192bc318b/sensors-23-09084-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b1d6/10674441/31047c3ca1a4/sensors-23-09084-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b1d6/10674441/59ad7dcec056/sensors-23-09084-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b1d6/10674441/edf8004b037e/sensors-23-09084-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b1d6/10674441/2c9939a3e3d4/sensors-23-09084-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b1d6/10674441/eb0d7818362e/sensors-23-09084-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b1d6/10674441/09c192bc318b/sensors-23-09084-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b1d6/10674441/31047c3ca1a4/sensors-23-09084-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b1d6/10674441/59ad7dcec056/sensors-23-09084-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b1d6/10674441/edf8004b037e/sensors-23-09084-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b1d6/10674441/2c9939a3e3d4/sensors-23-09084-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b1d6/10674441/eb0d7818362e/sensors-23-09084-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b1d6/10674441/09c192bc318b/sensors-23-09084-g006.jpg

相似文献

1
EnViTSA: Ensemble of Vision Transformer with SpecAugment for Acoustic Event Classification.EnViTSA:结合频谱增强的视觉Transformer集成用于声学事件分类
Sensors (Basel). 2023 Nov 10;23(22):9084. doi: 10.3390/s23229084.
2
MetaV: A Pioneer in feature Augmented Meta-Learning Based Vision Transformer for Medical Image Classification.MetaV:基于特征增强的元学习的医学影像分类视觉转换器的先驱。
Interdiscip Sci. 2024 Jun;16(2):469-488. doi: 10.1007/s12539-024-00630-1. Epub 2024 Jun 29.
3
An Incremental Class-Learning Approach with Acoustic Novelty Detection for Acoustic Event Recognition.基于声学新颖性检测的增量式类学习方法在声学事件识别中的应用。
Sensors (Basel). 2021 Oct 5;21(19):6622. doi: 10.3390/s21196622.
4
MBT: Model-Based Transformer for retinal optical coherence tomography image and video multi-classification.MBT:用于视网膜光学相干断层扫描图像和视频多分类的基于模型的Transformer
Int J Med Inform. 2023 Oct;178:105178. doi: 10.1016/j.ijmedinf.2023.105178. Epub 2023 Aug 21.
5
Sound Event Localization and Detection Using Imbalanced Real and Synthetic Data via Multi-Generator.使用多生成器对不平衡真实和合成数据进行声音事件定位和检测。
Sensors (Basel). 2023 Mar 23;23(7):3398. doi: 10.3390/s23073398.
6
Power Transformer Voltages Classification with Acoustic Signal in Various Noisy Environments.不同噪声环境下基于声学信号的电力变压器电压分类
Sensors (Basel). 2022 Feb 7;22(3):1248. doi: 10.3390/s22031248.
7
Ensemble of vision transformer architectures for efficient Alzheimer's Disease classification.用于高效阿尔茨海默病分类的视觉Transformer架构集成
Brain Inform. 2024 Oct 3;11(1):25. doi: 10.1186/s40708-024-00238-7.
8
A CNN Sound Classification Mechanism Using Data Augmentation.一种使用数据增强的卷积神经网络声音分类机制。
Sensors (Basel). 2023 Aug 5;23(15):6972. doi: 10.3390/s23156972.
9
Self-supervised learning-based underwater acoustical signal classification via mask modeling.基于掩模建模的自监督水下声学信号分类。
J Acoust Soc Am. 2023 Jul 1;154(1):5-15. doi: 10.1121/10.0019937.
10
Transformers for Urban Sound Classification-A Comprehensive Performance Evaluation.用于城市声音分类的变压器-全面性能评估。
Sensors (Basel). 2022 Nov 16;22(22):8874. doi: 10.3390/s22228874.

引用本文的文献

1
Enhancing Situational Awareness with VAS-Compass Net for the Recognition of Directional Vehicle Alert Sounds.利用 VAS-Compass Net 增强情境感知,以识别定向车辆警报声音。
Sensors (Basel). 2024 Oct 24;24(21):6841. doi: 10.3390/s24216841.

本文引用的文献

1
Binaural Acoustic Scene Classification Using Wavelet Scattering, Parallel Ensemble Classifiers and Nonlinear Fusion.基于子波散射、并行集成分类器和非线性融合的双耳声场景分类。
Sensors (Basel). 2022 Feb 16;22(4):1535. doi: 10.3390/s22041535.
2
A Survey on Vision Transformer.视觉Transformer综述
IEEE Trans Pattern Anal Mach Intell. 2023 Jan;45(1):87-110. doi: 10.1109/TPAMI.2022.3152247. Epub 2022 Dec 5.
3
Multilabel Acoustic Event Classification Using Real-World Urban Data and Physical Redundancy of Sensors.利用真实世界城市数据和传感器物理冗余进行多标签声学事件分类。
Sensors (Basel). 2021 Nov 10;21(22):7470. doi: 10.3390/s21227470.