Wilkes Ben, Vatolkin Igor, Müller Heinrich
Department of Computer Science, Technische Universität Dortmund, 44227 Dortmund, Germany.
Entropy (Basel). 2021 Nov 12;23(11):1502. doi: 10.3390/e23111502.
We present a multi-modal genre recognition framework that considers the modalities audio, text, and image by features extracted from audio signals, album cover images, and lyrics of music tracks. In contrast to pure learning of features by a neural network as done in the related work, handcrafted features designed for a respective modality are also integrated, allowing for higher interpretability of created models and further theoretical analysis of the impact of individual features on genre prediction. Genre recognition is performed by binary classification of a music track with respect to each genre based on combinations of elementary features. For feature combination a two-level technique is used, which combines aggregation into fixed-length feature vectors with confidence-based fusion of classification results. Extensive experiments have been conducted for three classifier models (Naïve Bayes, Support Vector Machine, and Random Forest) and numerous feature combinations. The results are presented visually, with data reduction for improved perceptibility achieved by multi-objective analysis and restriction to non-dominated data. Feature- and classifier-related hypotheses are formulated based on the data, and their statistical significance is formally analyzed. The statistical analysis shows that the combination of two modalities almost always leads to a significant increase of performance and the combination of three modalities in several cases.
我们提出了一种多模态流派识别框架,该框架通过从音频信号、专辑封面图像和音乐曲目歌词中提取的特征来考虑音频、文本和图像模态。与相关工作中通过神经网络纯粹学习特征不同,我们还整合了为各个模态设计的手工特征,这使得创建的模型具有更高的可解释性,并能对单个特征对流派预测的影响进行进一步的理论分析。流派识别是基于基本特征的组合,对音乐曲目相对于每个流派进行二元分类来执行的。对于特征组合,使用了一种两级技术,该技术将聚合为固定长度的特征向量与基于置信度的分类结果融合相结合。针对三种分类器模型(朴素贝叶斯、支持向量机和随机森林)以及众多特征组合进行了广泛的实验。结果以可视化方式呈现,通过多目标分析和对非支配数据的限制实现了数据简化以提高可感知性。基于数据提出了与特征和分类器相关的假设,并对其统计显著性进行了形式化分析。统计分析表明,两种模态的组合几乎总是会导致性能显著提高,在某些情况下三种模态的组合也是如此。