Ghyselen Anne-Sophie, Breitbarth Anne, Farasyn Melissa, Van Keymeulen Jacques, van Hessen Arjan
Department of Linguistics, Ghent University, Ghent, Belgium.
Variaties VZW, Umbrella Organisation for Dialects and Oral Heritage, Brussels, Belgium.
Front Artif Intell. 2020 Apr 15;3:10. doi: 10.3389/frai.2020.00010. eCollection 2020.
This paper discusses how the transcription hurdle in dialect corpus building can be cleared. While corpus analysis has strongly gained in popularity in linguistic research, dialect corpora are still relatively scarce. This scarcity can be attributed to several factors, one of which is the challenging nature of transcribing dialects, given a lack of both orthographic norms for many dialects and speech technological tools trained on dialect data. This paper addresses the questions (i) how dialects can be transcribed efficiently and (ii) whether speech technological tools can lighten the transcription work. These questions are tackled using the Southern Dutch dialects (SDDs) as case study, for which the usefulness of automatic speech recognition (ASR), respeaking, and forced alignment is considered. Tests with these tools indicate that dialects still constitute a major speech technological challenge. In the case of the SDDs, the decision was made to use speech technology only for the word-level segmentation of the audio files, as the transcription itself could not be sped up by ASR tools. The discussion does however indicate that the usefulness of ASR and other related tools for a dialect corpus project is strongly determined by the sound quality of the dialect recordings, the availability of statistical dialect-specific models, the degree of linguistic differentiation between the dialects and the standard language, and the goals the transcripts have to serve.
本文探讨了如何消除方言语料库建设中的转录障碍。虽然语料库分析在语言学研究中越来越受欢迎,但方言语料库仍然相对较少。这种稀缺性可归因于几个因素,其中之一是方言转录具有挑战性,因为许多方言缺乏拼写规范,且缺乏基于方言数据训练的语音技术工具。本文探讨了以下问题:(i)如何高效转录方言;(ii)语音技术工具是否能减轻转录工作。本文以荷兰南方方言(SDDs)为案例研究来解决这些问题,研究了自动语音识别(ASR)、重新朗读和强制对齐的实用性。使用这些工具进行的测试表明,方言仍然构成主要的语音技术挑战。就荷兰南方方言而言,由于ASR工具无法加快转录速度,因此决定仅将语音技术用于音频文件的词级分割。然而,讨论表明,ASR和其他相关工具对方言语料库项目的实用性很大程度上取决于方言录音的音质、特定方言统计模型的可用性、方言与标准语言之间的语言差异程度以及转录文本的用途。