Xu Yanze, Wang Weiqing, Cui Huahua, Xu Mingyang, Li Ming
Data Science Research Center, Duke Kunshan University, Kunshan, China.
Advanced Computing East China Sub-Center, Suzhou, China.
EURASIP J Audio Speech Music Process. 2022;2022(1):8. doi: 10.1186/s13636-022-00240-z. Epub 2022 Apr 15.
Humans can recognize someone's identity through their voice and describe the timbral phenomena of voices. Likewise, the singing voice also has timbral phenomena. In vocal pedagogy, vocal teachers listen and then describe the timbral phenomena of their student's singing voice. In this study, in order to enable machines to describe the singing voice from the vocal pedagogy point of view, we perform a task called paralinguistic singing attribute recognition. To achieve this goal, we first construct and publish an open source dataset named Singing Voice Quality and Technique Database (SVQTD) for supervised learning. All the audio clips in SVQTD are downloaded from YouTube and processed by music source separation and silence detection. For annotation, seven paralinguistic singing attributes commonly used in vocal pedagogy are adopted as the labels. Furthermore, to explore the different supervised machine learning algorithm for classifying each paralinguistic singing attribute, we adopt three main frameworks, namely openSMILE features with support vector machine (SF-SVM), end-to-end deep learning (E2EDL), and deep embedding with support vector machine (DE-SVM). Our methods are based on existing frameworks commonly employed in other paralinguistic speech attribute recognition tasks. In SF-SVM, we separately use the feature set of the INTERSPEECH 2009 Challenge and that of the INTERSPEECH 2016 Challenge as the SVM classifier's input. In E2EDL, the end-to-end framework separately utilizes the ResNet and transformer encoder as feature extractors. In particular, to handle two-dimensional spectrogram input for a transformer, we adopt a sliced multi-head self-attention (SMSA) mechanism. In the DE-SVM, we use the representation extracted from the E2EDL model as the input of the SVM classifier. Experimental results on SVQTD show no absolute winner between E2EDL and the DE-SVM, which means that the back-end SVM classifier with the representation learned by E2E as input does not necessarily improve the performance. However, the DE-SVM that utilizes the ResNet as the feature extractor achieves the best average UAR, with an average 16% improvement over that of the SF-SVM with INTERSPEECH's hand-crafted feature set.
人类能够通过声音识别某人的身份,并描述声音的音色现象。同样,歌声也具有音色现象。在声乐教学中,声乐教师倾听并描述学生歌声的音色现象。在本研究中,为了使机器能够从声乐教学的角度描述歌声,我们执行了一项名为副语言歌唱属性识别的任务。为实现这一目标,我们首先构建并发布了一个名为歌唱声音质量与技巧数据库(SVQTD)的开源数据集用于监督学习。SVQTD中的所有音频片段均从YouTube下载,并经过音乐源分离和静音检测处理。对于标注,采用了声乐教学中常用的七个副语言歌唱属性作为标签。此外,为了探索用于对每个副语言歌唱属性进行分类的不同监督机器学习算法,我们采用了三个主要框架,即带有支持向量机的openSMILE特征(SF-SVM)、端到端深度学习(E2EDL)以及带有支持向量机的深度嵌入(DE-SVM)。我们的方法基于其他副语言语音属性识别任务中常用的现有框架。在SF-SVM中,我们分别使用2009年国际语音通信协会挑战赛和2016年国际语音通信协会挑战赛的特征集作为支持向量机分类器的输入。在E2EDL中,端到端框架分别利用ResNet和Transformer编码器作为特征提取器。特别地,为了处理Transformer的二维频谱图输入,我们采用了切片多头自注意力(SMSA)机制。在DE-SVM中,我们将从E2EDL模型中提取的表示作为支持向量机分类器的输入。在SVQTD上的实验结果表明,E2EDL和DE-SVM之间没有绝对的优胜者,这意味着以E2E学习到的表示作为输入的后端支持向量机分类器不一定能提高性能。然而,以ResNet作为特征提取器的DE-SVM实现了最佳的平均未加权平均召回率(UAR),与使用国际语音通信协会手工制作特征集的SF-SVM相比,平均提高了16%。