Coto-Solano Rolando, Stanford James N, Reddy Sravana K
Dartmouth College, Hanover, NH, United States.
Front Artif Intell. 2021 Sep 24;4:662097. doi: 10.3389/frai.2021.662097. eCollection 2021.
In recent decades, computational approaches to sociophonetic vowel analysis have been steadily increasing, and sociolinguists now frequently use semi-automated systems for phonetic alignment and vowel formant extraction, including FAVE (Forced Alignment and Vowel Extraction, Rosenfelder et al., 2011; Evanini et al., Proceedings of Interspeech, 2009), Penn Aligner (Yuan and Liberman, J. Acoust. Soc. America, 2008, 123, 3878), and DARLA (Dartmouth Linguistic Automation), (Reddy and Stanford, DARLA Dartmouth Linguistic Automation: Online Tools for Linguistic Research, 2015a). Yet these systems still have a major bottleneck: manual transcription. For most modern sociolinguistic vowel alignment and formant extraction, researchers must first create manual transcriptions. This human step is painstaking, time-consuming, and resource intensive. If this manual step could be replaced with completely automated methods, sociolinguists could potentially tap into vast datasets that have previously been unexplored, including legacy recordings that are underutilized due to lack of transcriptions. Moreover, if sociolinguists could quickly and accurately extract phonetic information from the millions of hours of new audio content posted on the Internet every day, a virtual ocean of speech from newly created podcasts, videos, live-streams, and other audio content would now inform research. How close are the current technological tools to achieving such groundbreaking changes for sociolinguistics? Prior work (Reddy et al., Proceedings of the North American Association for Computational Linguistics 2015 Conference, 2015b, 71-75) showed that an HMM-based Automated Speech Recognition system, trained with CMU Sphinx (Lamere et al., 2003), was accurate enough for DARLA to uncover evidence of the US Southern Vowel Shift without any human transcription. Even so, because that automatic speech recognition (ASR) system relied on a small training set, it produced numerous transcription errors. Six years have passed since that study, and since that time numerous end-to-end automatic speech recognition (ASR) algorithms have shown considerable improvement in transcription quality. One example of such a system is the RNN/CTC-based DeepSpeech from Mozilla (Hannun et al., 2014). (RNN stands for recurrent neural networks, the learning mechanism for DeepSpeech. CTC stands for connectionist temporal classification, the mechanism to merge phones into words). The present paper combines DeepSpeech with DARLA to push the technological envelope and determine how well contemporary ASR systems can perform in completely automated vowel analyses with sociolinguistic goals. Specifically, we used these techniques on audio recordings from 352 North American English speakers in the International Dialects of English Archive (IDEA), extracting 88,500 tokens of vowels in stressed position from spontaneous, free speech passages. With this large dataset we conducted acoustic sociophonetic analyses of the Southern Vowel Shift and the Northern Cities Chain Shift in the North American IDEA speakers. We compared the results using three different sources of transcriptions: 1) IDEA's manual transcriptions as the baseline "ground truth", 2) the ASR built on CMU Sphinx used by Reddy et al. (Proceedings of the North American Association for Computational Linguistics 2015 Conference, 2015b, 71-75), and 3) the latest publicly available Mozilla DeepSpeech system. We input these three different transcriptions to DARLA, which automatically aligned and extracted the vowel formants from the 352 IDEA speakers. Our quantitative results show that newer ASR systems like DeepSpeech show considerable promise for sociolinguistic applications like DARLA. We found that DeepSpeech's automated transcriptions had significantly fewer character error rates than those from the prior Sphinx system (from 46 to 35%). When we performed the sociolinguistic analysis of the extracted vowel formants from DARLA, we found that the automated transcriptions from DeepSpeech matched the results from the ground truth for the Southern Vowel Shift (SVS): five vowels showed a shift in both transcriptions, and two vowels didn't show a shift in either transcription. The Northern Cities Shift (NCS) was more difficult to detect, but ground truth and DeepSpeech matched for four vowels: One of the vowels showed a clear shift, and three showed no shift in either transcription. Our study therefore shows how technology has made progress toward greater automation in vowel sociophonetics, while also showing what remains to be done. Our statistical modeling provides a quantified view of both the abilities and the limitations of a completely "hands-free" analysis of vowel shifts in a large dataset. Naturally, when comparing a completely automated system against a semi-automated system involving human manual work, there will always be a tradeoff between accuracy on the one hand versus speed and replicability on the other hand [Kendall and Joseph, Towards best practices in sociophonetics (with Marianna DiPaolo), 2014]. The amount of "noise" that can be tolerated for a given study will depend on the particular research goals and researchers' preferences. Nonetheless, our study shows that, for certain large-scale applications and research goals, a completely automated approach using publicly available ASR can produce meaningful sociolinguistic results across large datasets, and these results can be generated quickly, efficiently, and with full replicability.
近几十年来,社会语音学中元音分析的计算方法一直在稳步增加,社会语言学家现在经常使用半自动系统进行语音对齐和元音共振峰提取,包括FAVE(强制对齐和元音提取,Rosenfelder等人,2011年;Evanini等人,国际语音会议论文集,2009年)、宾夕法尼亚对齐器(Yuan和Liberman,《美国声学学会杂志》,2008年,123卷,3878页)以及DARLA(达特茅斯语言自动化)(Reddy和Stanford,《达特茅斯语言自动化:语言研究在线工具》,2015a)。然而,这些系统仍然存在一个主要瓶颈:人工转录。对于大多数现代社会语言学的元音对齐和共振峰提取,研究人员必须首先创建人工转录。这一人工步骤既费力又耗时,而且资源密集。如果这个人工步骤能够被完全自动化的方法所取代,社会语言学家就有可能利用大量以前未被探索的数据集,包括由于缺乏转录而未得到充分利用的传统录音。此外,如果社会语言学家能够从每天发布在互联网上的数百万小时新音频内容中快速准确地提取语音信息,那么来自新创建的播客、视频、直播和其他音频内容的虚拟语音海洋现在将为研究提供信息。当前的技术工具距离为社会语言学实现这种突破性变革还有多远?先前的研究(Reddy等人,《北美计算语言学协会2015年会议论文集》,2015b,71 - 75页)表明,一个基于隐马尔可夫模型的自动语音识别系统,使用CMU Sphinx(Lamere等人,2003年)进行训练,对于DARLA来说足够准确,能够在没有任何人工转录的情况下揭示美国南方元音转移的证据。即便如此,由于那个自动语音识别(ASR)系统依赖于一个小的训练集,它产生了大量的转录错误。自该研究以来已经过去了六年,从那时起,许多端到端的自动语音识别(ASR)算法在转录质量上有了显著提高。这样一个系统的一个例子是来自Mozilla的基于循环神经网络/连接主义时间分类(RNN/CTC)的DeepSpeech(Hannun等人,2014年)。(RNN代表循环神经网络,是DeepSpeech的学习机制。CTC代表连接主义时间分类,是将音素合并成单词的机制)。本文将DeepSpeech与DARLA相结合,以突破技术限制,并确定当代ASR系统在具有社会语言学目标的完全自动化元音分析中能表现得多好。具体来说,我们在国际英语方言档案库(IDEA)中352名北美英语使用者的音频记录上使用了这些技术,从自发自由的语音段落中提取了88500个处于重音位置的元音标记。利用这个大数据集,我们对北美IDEA使用者中的南方元音转移和北方城市链转移进行了声学社会语音学分析。我们使用三种不同的转录来源比较了结果:1)IDEA的人工转录作为基线“真实情况”,2)Reddy等人(《北美计算语言学协会2015年会议论文集》,2015b,71 - 75页)使用的基于CMU Sphinx构建的ASR,以及3)最新的公开可用的Mozilla DeepSpeech系统。我们将这三种不同的转录输入到DARLA中,DARLA会自动对齐并从352名IDEA使用者中提取元音共振峰。我们的定量结果表明,像DeepSpeech这样的更新的ASR系统在像DARLA这样的社会语言学应用中显示出了很大的前景。我们发现,DeepSpeech的自动转录中的字符错误率明显低于先前的Sphinx系统(从46%降至35%)。当我们对从DARLA中提取的元音共振峰进行社会语言学分析时,我们发现DeepSpeech的自动转录与南方元音转移(SVS)的真实情况结果相匹配:五个元音在两种转录中都显示出转移,两个元音在两种转录中都没有显示出转移。北方城市转移(NCS)更难检测,但真实情况和DeepSpeech在四个元音上相匹配:其中一个元音显示出明显的转移,三个元音在两种转录中都没有显示出转移。因此,我们的研究展示了技术在元音社会语音学中朝着更高自动化程度取得的进展,同时也展示了仍有待完成的工作。我们的统计建模为在大数据集中对元音转移进行完全“免人工干预”分析的能力和局限性提供了一个量化的视角。自然地,当将一个完全自动化的系统与一个涉及人工工作的半自动系统进行比较时,在准确性与速度和可重复性之间总会存在权衡[Kendall和Joseph,《迈向社会语音学的最佳实践(与Marianna DiPaolo合著)》,2014年]。对于给定的研究能够容忍的“噪声”量将取决于特定的研究目标和研究人员的偏好。尽管如此,我们的研究表明,对于某些大规模应用和研究目标,使用公开可用的ASR的完全自动化方法可以在大数据集上产生有意义的社会语言学结果,并且这些结果可以快速、高效地生成,并且具有完全的可重复性。
J Acoust Soc Am. 2017-7
Am J Speech Lang Pathol. 2023-11-6
J Acoust Soc Am. 2005-9
J Acoust Soc Am. 2004-9
Dement Geriatr Cogn Disord. 2023