Department of English, Linguistics Program, George Mason University 3298 , Fairfax, VA, USA.
Department of Linguistics, University of Alberta, Edmonton, AB, Canada.
Phonetica. 2024 Sep 5;81(5):451-508. doi: 10.1515/phon-2024-0015. Print 2024 Oct 28.
Given an orthographic transcription, forced alignment systems automatically determine boundaries between segments in speech, facilitating the use of large corpora. In the present paper, we introduce a neural network-based forced alignment system, the Mason-Alberta Phonetic Segmenter (MAPS). MAPS serves as a testbed for two possible improvements we pursue for forced alignment systems. The first is treating the acoustic model as a tagger, rather than a classifier, motivated by the common understanding that segments are not truly discrete and often overlap. The second is an interpolation technique to allow more precise boundaries than the typical 10 ms limit in modern systems. During testing, all system configurations we trained significantly outperformed the state-of-the-art Montreal Forced Aligner in the 10 ms boundary placement tolerance threshold. The greatest difference achieved was a 28.13 % relative performance increase. The Montreal Forced Aligner began to slightly outperform our models at around a 30 ms tolerance. We also reflect on the training process for acoustic modeling in forced alignment, highlighting how the output targets for these models do not match phoneticians' conception of similarity between phones and that reconciling this tension may require rethinking the task and output targets or how speech itself should be segmented.
给定一个正字转录,强制对齐系统可以自动确定语音中的音段边界,从而方便使用大型语料库。在本文中,我们引入了一个基于神经网络的强制对齐系统,即 Mason-Alberta 音标分段器(MAPS)。MAPS 是我们为强制对齐系统探索的两种可能改进的测试平台。第一种改进是将声学模型视为标记器而不是分类器,这是基于这样一种共识,即音段并非真正离散,而且经常重叠。第二种改进是一种插值技术,可以比现代系统中典型的 10ms 限制允许更精确的边界。在测试中,我们训练的所有系统配置在 10ms 边界放置容限阈值方面都明显优于最先进的蒙特利尔强制对齐器。最大的差异是相对性能提高了 28.13%。在大约 30ms 的容限下,蒙特利尔强制对齐器开始略微优于我们的模型。我们还反思了强制对齐中声学建模的训练过程,强调了这些模型的输出目标与语音学家对音位相似性的概念不匹配,并且调和这种紧张关系可能需要重新思考任务和输出目标,或者如何分割语音本身。