Lostanlen Vincent, El-Hajj Christian, Rossignol Mathias, Lafay Grégoire, Andén Joakim, Lagrange Mathieu
LS2N, CNRS, Centrale Nantes, Nantes University, 1, rue de la Noe, Nantes, 44000 France.
Lonofi, 57 rue Letort, Paris, 75018 France.
EURASIP J Audio Speech Music Process. 2021;2021(1):3. doi: 10.1186/s13636-020-00187-z. Epub 2021 Jan 11.
Instrumentalplaying techniques such as vibratos, glissandos, and trills often denote musical expressivity, both in classical and folk contexts. However, most existing approaches to music similarity retrieval fail to describe timbre beyond the so-called "ordinary" technique, use instrument identity as a proxy for timbre quality, and do not allow for customization to the perceptual idiosyncrasies of a new subject. In this article, we ask 31 human participants to organize 78 isolated notes into a set of timbre clusters. Analyzing their responses suggests that timbre perception operates within a more flexible taxonomy than those provided by instruments or playing techniques alone. In addition, we propose a machine listening model to recover the cluster graph of auditory similarities across instruments, mutes, and techniques. Our model relies on joint time-frequency scattering features to extract spectrotemporal modulations as acoustic features. Furthermore, it minimizes triplet loss in the cluster graph by means of the large-margin nearest neighbor (LMNN) metric learning algorithm. Over a dataset of 9346 isolated notes, we report a state-of-the-art average precision at rank five (AP@5) of 990±1. An ablation study demonstrates that removing either the joint time-frequency scattering transform or the metric learning algorithm noticeably degrades performance.
诸如颤音、滑音和颤音等乐器演奏技巧,在古典音乐和民间音乐语境中通常都代表着音乐表现力。然而,现有的大多数音乐相似度检索方法都无法描述除所谓“普通”技巧之外的音色,而是将乐器种类作为音色质量的替代指标,并且不允许针对新受试者的感知特性进行定制。在本文中,我们让31名人类受试者将78个孤立音符组织成一组音色类别。对他们的回答进行分析表明,音色感知所依据的分类法比仅由乐器或演奏技巧提供的分类法更为灵活。此外,我们提出了一种机器听觉模型,以恢复跨乐器、消音器和技巧的听觉相似度聚类图。我们的模型依靠联合时频散射特征来提取频谱时间调制作为声学特征。此外,它通过大间隔最近邻(LMNN)度量学习算法,使聚类图中的三元组损失最小化。在一个包含9346个孤立音符的数据集上,我们报告了排名前五的平均精度(AP@5)达到99.0±1%的先进水平。一项消融研究表明,去除联合时频散射变换或度量学习算法都会显著降低性能。