Elowsson Anders, Friberg Anders
KTH Royal Institute of Technology, School of Computer Science and Communication, Speech, Music and Hearing, Stockholm, Sweden.
J Acoust Soc Am. 2017 Mar;141(3):2224. doi: 10.1121/1.4978245.
By varying the dynamics in a musical performance, the musician can convey structure and different expressions. Spectral properties of most musical instruments change in a complex way with the performed dynamics, but dedicated audio features for modeling the parameter are lacking. In this study, feature extraction methods were developed to capture relevant attributes related to spectral characteristics and spectral fluctuations, the latter through a sectional spectral flux. Previously, ground truths ratings of performed dynamics had been collected by asking listeners to rate how soft/loud the musicians played in a set of audio files. The ratings, averaged over subjects, were used to train three different machine learning models, using the audio features developed for the study as input. The highest result was produced from an ensemble of multilayer perceptrons with an R of 0.84. This result seems to be close to the upper bound, given the estimated uncertainty of the ground truth data. The result is well above that of individual human listeners of the previous listening experiment, and on par with the performance achieved from the average rating of six listeners. Features were analyzed with a factorial design, which highlighted the importance of source separation in the feature extraction.
通过改变音乐表演中的动态变化,音乐家能够传达结构和不同的表现力。大多数乐器的频谱特性会随着演奏的动态变化而以复杂的方式改变,但缺乏用于对该参数进行建模的专用音频特征。在本研究中,开发了特征提取方法来捕捉与频谱特征和频谱波动相关的相关属性,后者通过分段频谱通量来实现。此前,通过要求听众对一组音频文件中音乐家演奏的柔和/响亮程度进行评分,收集了演奏动态的真实评分。将受试者的评分进行平均,以本研究开发的音频特征作为输入,用于训练三种不同的机器学习模型。由多层感知器集成产生的最高结果的相关系数为0.84。考虑到真实数据的估计不确定性,这个结果似乎接近上限。该结果远高于之前听力实验中个体人类听众的结果,与六位听众的平均评分所取得的表现相当。采用析因设计对特征进行了分析,突出了源分离在特征提取中的重要性。