Department of Informatics, King's College London, 30 Aldwych, London, WC2B 4BG, UK.
Department of Computer Science and Engineering, University of Bologna, Mura Anteo Zamboni, 7, Bologna, 40126, Italy.
Sci Data. 2023 Sep 20;10(1):641. doi: 10.1038/s41597-023-02410-w.
Various disconnected chord datasets are currently available for music analysis and information retrieval, but they are often limited by either their size, non-openness, lack of timed information, and interoperability. Together with the lack of overlapping repertoire coverage, this limits cross-corpus studies on harmony over time and across genres, and hampers research in computational music analysis (chord recognition, pattern mining, computational creativity), which needs access to large datasets. We contribute to address this gap, by releasing the Chord Corpus (ChoCo), a large-scale dataset that semantically integrates harmonic data from 18 different sources using heterogeneous representations and formats (Harte, Leadsheet, Roman numerals, ABC, etc.). We rely on JAMS (JSON Annotated Music Specification), a popular data structure for annotations in Music Information Retrieval, to represent and enrich chord-related information (chord, key, mode, etc.) in a uniform way. To achieve semantic integration, we design a novel ontology for modelling music annotations and the entities they involve (artists, scores, etc.), and we build a 30M-triple knowledge graph, including 4 K+ links to other datasets (MIDI-LD, LED).
目前有各种不相关的和弦数据集可用于音乐分析和信息检索,但它们通常受到大小、非开放性、缺乏时间信息和互操作性的限制。再加上重叠曲目覆盖范围的不足,这限制了跨流派和跨时间的和声的跨语料库研究,并阻碍了计算音乐分析(和弦识别、模式挖掘、计算创造力)的研究,而这些研究需要访问大型数据集。我们通过发布 Chord Corpus (ChoCo) 来解决这一差距,这是一个大规模数据集,使用异构表示和格式(Harte、Leadsheet、罗马数字、ABC 等)从 18 个不同的来源语义集成了和声数据。我们依赖于 JAMS(JSONAnnotated Music Specification),这是一种在音乐信息检索中用于注释的流行数据结构,以统一的方式表示和丰富与和弦相关的信息(和弦、调式、调式等)。为了实现语义集成,我们设计了一种新颖的本体论来建模音乐注释及其所涉及的实体(艺术家、乐谱等),我们构建了一个 3000 万三元组的知识图谱,其中包括 4000 多个到其他数据集(MIDI-LD、LED)的链接。