• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

TS-Resformer:一种基于多模态融合的音乐信号分类模型。

TS-Resformer: a model based on multimodal fusion for the classification of music signals.

作者信息

Zhang Yilin

机构信息

Dalian University of Foreign Languages, International Art College, Dalian, China.

出版信息

Front Neurorobot. 2025 May 13;19:1568811. doi: 10.3389/fnbot.2025.1568811. eCollection 2025.

DOI:10.3389/fnbot.2025.1568811
PMID:40433555
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12106318/
Abstract

The number of music of different genres is increasing year by year, and manual classification is costly and requires professionals in the field of music to manually design features, some of which lack the generality of music genre classification. Deep learning has had a large number of scientific research results in the field of music classification, but the existing deep learning methods still have the problems of insufficient extraction of music feature information, low accuracy rate of music genres, loss of time series information, and slow training. To address the problem that different music durations affect the accuracy of music genre classification, we form a Log Mel spectrum with music audio data of different cut durations. After discarding incomplete audio, we design data enhancement with different slicing durations and verify its effect on accuracy and training time through comparison experiments. Based on this, the audio signal is divided into frames, windowed and short-time Fourier transformed, and then the Log Mel spectrum is obtained by using the Mel filter and logarithmic compression. Aiming at the problems of loss of time information, insufficient feature extraction, and low classification accuracy in music genre classification, firstly, we propose a Res-Transformer model that fuses the residual network with the Transformer coding layer. The model consists of two branches, the left branch is an improved residual network, which enhances the spectral feature extraction ability and network expression ability and realizes the dimensionality reduction; the right branch uses four Transformer coding layers to extract the time-series information of the Log Mel spectrum. The output vectors of the two branches are spliced and input into the classifier to realize music genre classification. Then, to further improve the classification accuracy of the model, we propose the TS-Resformer model based on the Res-Transformer model, combined with different attention mechanisms, and design the time-frequency attention mechanism, which employs different scales of filters to fully extract the low-level music features from the two dimensions of time and frequency as the input to the time-frequency attention mechanism, respectively. Finally, experiments show that the accuracy of this method is 90.23% on the FMA-small dataset, which is an improvement in classification accuracy compared with the classical model.

摘要

不同流派音乐的数量逐年增加,人工分类成本高昂,且需要音乐领域的专业人员手动设计特征,其中一些特征缺乏音乐流派分类的通用性。深度学习在音乐分类领域已有大量科研成果,但现有的深度学习方法仍存在音乐特征信息提取不足、音乐流派准确率低、时间序列信息丢失以及训练速度慢等问题。为解决不同音乐时长影响音乐流派分类准确率的问题,我们用不同切割时长的音乐音频数据形成对数梅尔频谱。在丢弃不完整音频后,我们设计了不同切片时长的数据增强,并通过对比实验验证其对准确率和训练时间的影响。在此基础上,将音频信号分帧、加窗并进行短时傅里叶变换,然后通过梅尔滤波器和对数压缩得到对数梅尔频谱。针对音乐流派分类中时间信息丢失、特征提取不足和分类准确率低的问题,首先,我们提出一种将残差网络与Transformer编码层融合的Res-Transformer模型。该模型由两个分支组成,左分支是改进的残差网络,增强了频谱特征提取能力和网络表达能力并实现降维;右分支使用四个Transformer编码层提取对数梅尔频谱的时间序列信息。两个分支的输出向量拼接后输入分类器实现音乐流派分类。然后,为进一步提高模型的分类准确率,我们基于Res-Transformer模型提出TS-Resformer模型,结合不同的注意力机制,设计了时频注意力机制,该机制采用不同尺度的滤波器分别从时间和频率两个维度充分提取低级音乐特征作为时频注意力机制的输入。最后,实验表明该方法在FMA-small数据集上的准确率为90.23%,与经典模型相比分类准确率有所提高。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e2c/12106318/fc2afc0582bd/fnbot-19-1568811-g015.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e2c/12106318/10c45b741ad5/fnbot-19-1568811-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e2c/12106318/dac1b7897471/fnbot-19-1568811-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e2c/12106318/80fe319f6e58/fnbot-19-1568811-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e2c/12106318/2b42164faff7/fnbot-19-1568811-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e2c/12106318/922e1f3c5d91/fnbot-19-1568811-g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e2c/12106318/25d3cab3808a/fnbot-19-1568811-g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e2c/12106318/e7524cdd98a1/fnbot-19-1568811-g012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e2c/12106318/408d9e537212/fnbot-19-1568811-g013.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e2c/12106318/b27d583f418e/fnbot-19-1568811-g014.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e2c/12106318/fc2afc0582bd/fnbot-19-1568811-g015.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e2c/12106318/10c45b741ad5/fnbot-19-1568811-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e2c/12106318/dac1b7897471/fnbot-19-1568811-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e2c/12106318/80fe319f6e58/fnbot-19-1568811-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e2c/12106318/2b42164faff7/fnbot-19-1568811-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e2c/12106318/922e1f3c5d91/fnbot-19-1568811-g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e2c/12106318/25d3cab3808a/fnbot-19-1568811-g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e2c/12106318/e7524cdd98a1/fnbot-19-1568811-g012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e2c/12106318/408d9e537212/fnbot-19-1568811-g013.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e2c/12106318/b27d583f418e/fnbot-19-1568811-g014.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e2c/12106318/fc2afc0582bd/fnbot-19-1568811-g015.jpg

相似文献

1
TS-Resformer: a model based on multimodal fusion for the classification of music signals.TS-Resformer:一种基于多模态融合的音乐信号分类模型。
Front Neurorobot. 2025 May 13;19:1568811. doi: 10.3389/fnbot.2025.1568811. eCollection 2025.
2
A Multimodal Convolutional Neural Network Model for the Analysis of Music Genre on Children's Emotions Influence Intelligence.用于分析音乐类型对儿童情绪智力影响的多模态卷积神经网络模型。
Comput Intell Neurosci. 2022 Aug 29;2022:5611456. doi: 10.1155/2022/5611456. eCollection 2022.
3
Music genre classification with parallel convolutional neural networks and capuchin search algorithm.基于并行卷积神经网络和卷尾猴搜索算法的音乐流派分类
Sci Rep. 2025 Mar 20;15(1):9580. doi: 10.1038/s41598-025-90619-7.
4
Optimizing the configuration of deep learning models for music genre classification.优化用于音乐流派分类的深度学习模型配置。
Heliyon. 2024 Jan 17;10(2):e24892. doi: 10.1016/j.heliyon.2024.e24892. eCollection 2024 Jan 30.
5
Design of Neural Network Model for Cross-Media Audio and Video Score Recognition Based on Convolutional Neural Network Model.基于卷积神经网络模型的跨媒体音视频评分识别神经网络模型设计。
Comput Intell Neurosci. 2022 Jun 13;2022:4626867. doi: 10.1155/2022/4626867. eCollection 2022.
6
Rigdelet neural network and improved partial reinforcement effect optimizer for music genre classification from sound spectrum images.用于基于声谱图像的音乐流派分类的Rigdelet神经网络和改进的部分强化效应优化器。
Heliyon. 2024 Jul 4;10(14):e34067. doi: 10.1016/j.heliyon.2024.e34067. eCollection 2024 Jul 30.
7
An improved ViT model for music genre classification based on mel spectrogram.一种基于梅尔频谱图的用于音乐流派分类的改进型视觉Transformer(ViT)模型。
PLoS One. 2025 Mar 13;20(3):e0319027. doi: 10.1371/journal.pone.0319027. eCollection 2025.
8
A Music Emotion Classification Model Based on the Improved Convolutional Neural Network.基于改进卷积神经网络的音乐情绪分类模型。
Comput Intell Neurosci. 2022 Feb 14;2022:6749622. doi: 10.1155/2022/6749622. eCollection 2022.
9
Construction of Intelligent Recognition and Learning Education Platform of National Music Genre Under Deep Learning.深度学习下民族音乐体裁智能识别与学习教育平台的构建
Front Psychol. 2022 May 26;13:843427. doi: 10.3389/fpsyg.2022.843427. eCollection 2022.
10
A Multi-Modal Convolutional Neural Network Model for Intelligent Analysis of the Influence of Music Genres on Children's Emotions.一种用于智能分析音乐流派对儿童情绪影响的多模态卷积神经网络模型。
Comput Intell Neurosci. 2022 Jul 19;2022:4957085. doi: 10.1155/2022/4957085. eCollection 2022.

本文引用的文献

1
VPT: Video portraits transformer for realistic talking face generation.VPT:用于生成逼真会说话人脸的视频人像Transformer
Neural Netw. 2025 Apr;184:107122. doi: 10.1016/j.neunet.2025.107122. Epub 2025 Jan 9.
2
A novel diagnosis method combined dual-channel SE-ResNet with expert features for inter-patient heartbeat classification.一种结合双通道 SE-ResNet 和专家特征的新颖诊断方法,用于患者间心跳分类。
Med Eng Phys. 2024 Aug;130:104209. doi: 10.1016/j.medengphy.2024.104209. Epub 2024 Jul 17.
3
Convolutional Dynamically Convergent Differential Neural Network for Brain Signal Classification.
用于脑信号分类的卷积动态收敛差分神经网络
IEEE Trans Neural Netw Learn Syst. 2025 May;36(5):8166-8177. doi: 10.1109/TNNLS.2024.3437676. Epub 2025 May 2.
4
A novel hybrid model integrating MFCC and acoustic parameters for voice disorder detection.一种融合 MFCC 和声学参数的新型混合模型用于语音障碍检测。
Sci Rep. 2023 Dec 20;13(1):22719. doi: 10.1038/s41598-023-49869-6.
5
Classification of Glomerular Pathology Images in Children Using Convolutional Neural Networks with Improved SE-ResNet Module.使用具有改进型SE-ResNet模块的卷积神经网络对儿童肾小球病理图像进行分类
Interdiscip Sci. 2023 Dec;15(4):602-615. doi: 10.1007/s12539-023-00579-7. Epub 2023 Jul 31.
6
Music Score Recognition Method Based on Deep Learning.基于深度学习的乐谱识别方法
Comput Intell Neurosci. 2022 Jul 7;2022:3022767. doi: 10.1155/2022/3022767. eCollection 2022.