Angioni Simone, Lincoln-DeCusatis Nathan, Ibba Andrea, Reforgiato Recupero Diego
Department of Mathematics and Computer Science, University of Cagliari, Cagliari, Sardegna, Italy.
Department of Music, Fordham University, New York, United States of America.
PeerJ Comput Sci. 2023 Jun 19;9:e1410. doi: 10.7717/peerj-cs.1410. eCollection 2023.
Music is an extremely subjective art form whose commodification the recording industry in the 20th century has led to an increasingly subdivided set of genre labels that attempt to organize musical styles into definite categories. Music psychology has been studying the processes through which music is perceived, created, responded to, and incorporated into everyday life, and, modern artificial intelligence technology can be exploited in such a direction. Music classification and generation are emerging fields that gained much attention recently, especially with the latest discoveries within deep learning technologies. Self attention networks have in fact brought huge benefits for several tasks of classification and generation in different domains where data of different types were used (text, images, videos, sounds). In this article, we want to analyze the effectiveness of Transformers for both classification and generation tasks and study the performances of classification at different granularity and of generation using different human and automatic metrics. The input data consist of MIDI sounds that we have considered from different datasets: sounds from 397 Nintendo Entertainment System video games, classical pieces, and rock songs from different composers and bands. We have performed classification tasks within each dataset to identify the types or composers of each sample (fine-grained) and classification at a higher level. In the latter, we combined the three datasets together with the goal of identifying for each sample just NES, rock, or classical (coarse-grained) pieces. The proposed transformers-based approach outperformed competitors based on deep learning and machine learning approaches. Finally, the generation task has been carried out on each dataset and the resulting samples have been evaluated using human and automatic metrics (the local alignment).
音乐是一种极具主观性的艺术形式,20世纪唱片业对其进行商品化,导致了一系列日益细分的流派标签,这些标签试图将音乐风格组织成明确的类别。音乐心理学一直在研究音乐被感知、创作、回应以及融入日常生活的过程,并且,现代人工智能技术可以朝着这个方向加以利用。音乐分类和生成是新兴领域,最近受到了广泛关注,尤其是随着深度学习技术的最新发现。事实上,自注意力网络在不同领域(文本、图像、视频、声音)使用不同类型数据的分类和生成的多项任务中都带来了巨大益处。在本文中,我们想要分析Transformer在分类和生成任务方面的有效性,并研究不同粒度下的分类性能以及使用不同人工和自动指标的生成性能。输入数据由我们从不同数据集中获取的MIDI声音组成:来自397款任天堂娱乐系统视频游戏的声音、古典乐曲以及来自不同作曲家和乐队的摇滚歌曲。我们在每个数据集中执行了分类任务,以识别每个样本的类型或作曲家(细粒度)以及更高层次的分类。在后者中,我们将这三个数据集组合在一起,目标是仅识别每个样本属于任天堂娱乐系统、摇滚还是古典(粗粒度)作品。所提出的基于Transformer的方法优于基于深度学习和机器学习方法的竞争对手。最后,在每个数据集上执行了生成任务,并使用人工和自动指标(局部对齐)对生成的样本进行了评估。