Virkkunen Anja, Rouhe Aku, Phan Nhan, Kurimo Mikko
Department of Information and Communications Engineering, Aalto University, Espoo, Finland.
Lang Resour Eval. 2023 Mar 27:1-26. doi: 10.1007/s10579-023-09650-7.
Public sources like parliament meeting recordings and transcripts provide ever-growing material for the training and evaluation of automatic speech recognition (ASR) systems. In this paper, we publish and analyse the Finnish Parliament ASR Corpus, the most extensive publicly available collection of manually transcribed speech data for Finnish with over 3000 h of speech and 449 speakers for which it provides rich demographic metadata. This corpus builds on earlier initial work, and as a result the corpus has a natural split into two training subsets from two periods of time. Similarly, there are two official, corrected test sets covering different times, setting an ASR task with longitudinal distribution-shift characteristics. An official development set is also provided. We developed a complete Kaldi-based data preparation pipeline and ASR recipes for hidden Markov models (HMM), hybrid deep neural networks (HMM-DNN), and attention-based encoder-decoders (AED). For HMM-DNN systems, we provide results with time-delay neural networks (TDNN) as well as state-of-the-art wav2vec 2.0 pretrained acoustic models. We set benchmarks on the official test sets and multiple other recently used test sets. Both temporal corpus subsets are already large, and we observe that beyond their scale, HMM-TDNN ASR performance on the official test sets has reached a plateau. In contrast, other domains and larger wav2vec 2.0 models benefit from added data. The HMM-DNN and AED approaches are compared in a carefully matched equal data setting, with the HMM-DNN system consistently performing better. Finally, the variation of the ASR accuracy is compared between the speaker categories available in the parliament metadata to detect potential biases based on factors such as gender, age, and education.
像议会会议录音和文字记录这样的公开资源为自动语音识别(ASR)系统的训练和评估提供了越来越多的材料。在本文中,我们发布并分析了芬兰议会ASR语料库,这是芬兰最广泛的公开可用的人工转录语音数据集合,有超过3000小时的语音和449名说话者,并且提供了丰富的人口统计元数据。这个语料库建立在早期的初步工作基础上,因此该语料库自然地分为来自两个时间段的两个训练子集。同样,有两个官方的、经过校正的测试集,覆盖不同的时间,设置了一个具有纵向分布变化特征的ASR任务。还提供了一个官方开发集。我们为隐马尔可夫模型(HMM)、混合深度神经网络(HMM-DNN)和基于注意力的编码器-解码器(AED)开发了一个完整的基于Kaldi的数据准备管道和ASR方法。对于HMM-DNN系统,我们提供了使用时延神经网络(TDNN)以及最先进的wav2vec 2.0预训练声学模型的结果。我们在官方测试集和其他多个最近使用的测试集上设定了基准。两个时间语料库子集已经很大,并且我们观察到,除了它们的规模之外,官方测试集上的HMM-TDNN ASR性能已经达到了一个平台期。相比之下,其他领域和更大的wav2vec 2.0模型受益于增加的数据。在精心匹配的相等数据设置中比较了HMM-DNN和AED方法,HMM-DNN系统始终表现得更好。最后,比较了议会元数据中可用的说话者类别之间的ASR准确性变化,以检测基于性别、年龄和教育等因素的潜在偏差。