TS-Resformer：一种基于多模态融合的音乐信号分类模型。

TS-Resformer: a model based on multimodal fusion for the classification of music signals.

作者信息

Zhang Yilin

机构信息

Dalian University of Foreign Languages, International Art College, Dalian, China.

出版信息

Front Neurorobot. 2025 May 13;19:1568811. doi: 10.3389/fnbot.2025.1568811. eCollection 2025.

DOI:10.3389/fnbot.2025.1568811

PMID:40433555

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12106318/

Abstract

The number of music of different genres is increasing year by year, and manual classification is costly and requires professionals in the field of music to manually design features, some of which lack the generality of music genre classification. Deep learning has had a large number of scientific research results in the field of music classification, but the existing deep learning methods still have the problems of insufficient extraction of music feature information, low accuracy rate of music genres, loss of time series information, and slow training. To address the problem that different music durations affect the accuracy of music genre classification, we form a Log Mel spectrum with music audio data of different cut durations. After discarding incomplete audio, we design data enhancement with different slicing durations and verify its effect on accuracy and training time through comparison experiments. Based on this, the audio signal is divided into frames, windowed and short-time Fourier transformed, and then the Log Mel spectrum is obtained by using the Mel filter and logarithmic compression. Aiming at the problems of loss of time information, insufficient feature extraction, and low classification accuracy in music genre classification, firstly, we propose a Res-Transformer model that fuses the residual network with the Transformer coding layer. The model consists of two branches, the left branch is an improved residual network, which enhances the spectral feature extraction ability and network expression ability and realizes the dimensionality reduction; the right branch uses four Transformer coding layers to extract the time-series information of the Log Mel spectrum. The output vectors of the two branches are spliced and input into the classifier to realize music genre classification. Then, to further improve the classification accuracy of the model, we propose the TS-Resformer model based on the Res-Transformer model, combined with different attention mechanisms, and design the time-frequency attention mechanism, which employs different scales of filters to fully extract the low-level music features from the two dimensions of time and frequency as the input to the time-frequency attention mechanism, respectively. Finally, experiments show that the accuracy of this method is 90.23% on the FMA-small dataset, which is an improvement in classification accuracy compared with the classical model.

摘要

不同流派音乐的数量逐年增加，人工分类成本高昂，且需要音乐领域的专业人员手动设计特征，其中一些特征缺乏音乐流派分类的通用性。深度学习在音乐分类领域已有大量科研成果，但现有的深度学习方法仍存在音乐特征信息提取不足、音乐流派准确率低、时间序列信息丢失以及训练速度慢等问题。为解决不同音乐时长影响音乐流派分类准确率的问题，我们用不同切割时长的音乐音频数据形成对数梅尔频谱。在丢弃不完整音频后，我们设计了不同切片时长的数据增强，并通过对比实验验证其对准确率和训练时间的影响。在此基础上，将音频信号分帧、加窗并进行短时傅里叶变换，然后通过梅尔滤波器和对数压缩得到对数梅尔频谱。针对音乐流派分类中时间信息丢失、特征提取不足和分类准确率低的问题，首先，我们提出一种将残差网络与Transformer编码层融合的Res-Transformer模型。该模型由两个分支组成，左分支是改进的残差网络，增强了频谱特征提取能力和网络表达能力并实现降维；右分支使用四个Transformer编码层提取对数梅尔频谱的时间序列信息。两个分支的输出向量拼接后输入分类器实现音乐流派分类。然后，为进一步提高模型的分类准确率，我们基于Res-Transformer模型提出TS-Resformer模型，结合不同的注意力机制，设计了时频注意力机制，该机制采用不同尺度的滤波器分别从时间和频率两个维度充分提取低级音乐特征作为时频注意力机制的输入。最后，实验表明该方法在FMA-small数据集上的准确率为90.23%，与经典模型相比分类准确率有所提高。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

TS-Resformer：一种基于多模态融合的音乐信号分类模型。

TS-Resformer: a model based on multimodal fusion for the classification of music signals.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

TS-Resformer：一种基于多模态融合的音乐信号分类模型。

TS-Resformer: a model based on multimodal fusion for the classification of music signals.

作者信息

机构信息

出版信息

相似文献

本文引用的文献