Friberg Anders, Schoonderwaldt Erwin, Hedblad Anton, Fabiani Marco, Elowsson Anders
KTH Royal Institute of Technology, School of Computer Science and Communication, Speech, Music and Hearing, Stockholm, Sweden.
Hanover University of Music, Drama and Media, Institute of Music Physiology and Musicians' Medicine, Hannover, Germany.
J Acoust Soc Am. 2014 Oct;136(4):1951-63. doi: 10.1121/1.4892767.
The notion of perceptual features is introduced for describing general music properties based on human perception. This is an attempt at rethinking the concept of features, aiming to approach the underlying human perception mechanisms. Instead of using concepts from music theory such as tones, pitches, and chords, a set of nine features describing overall properties of the music was selected. They were chosen from qualitative measures used in psychology studies and motivated from an ecological approach. The perceptual features were rated in two listening experiments using two different data sets. They were modeled both from symbolic and audio data using different sets of computational features. Ratings of emotional expression were predicted using the perceptual features. The results indicate that (1) at least some of the perceptual features are reliable estimates; (2) emotion ratings could be predicted by a small combination of perceptual features with an explained variance from 75% to 93% for the emotional dimensions activity and valence; (3) the perceptual features could only to a limited extent be modeled using existing audio features. Results clearly indicated that a small number of dedicated features were superior to a "brute force" model using a large number of general audio features.
引入感知特征的概念是为了基于人类感知来描述一般音乐属性。这是对特征概念进行重新思考的一次尝试,旨在探究潜在的人类感知机制。并非使用音乐理论中的概念,如音调、音高和和弦,而是选择了一组描述音乐整体属性的九个特征。它们是从心理学研究中使用的定性测量方法中选取的,并受到生态方法的启发。在两个使用不同数据集的听力实验中对感知特征进行了评级。使用不同的计算特征集从符号数据和音频数据中对它们进行建模。使用感知特征预测情感表达评级。结果表明:(1)至少一些感知特征是可靠的估计值;(2)情感评级可以通过少量感知特征的组合来预测,情感维度“活跃度”和“效价”的解释方差为75%至93%;(3)使用现有的音频特征只能在有限程度上对感知特征进行建模。结果清楚地表明,少量专用特征优于使用大量一般音频特征的“暴力”模型。