Kadambi Prad, Mahr Tristan J, Hustad Katherine C, Berisha Visar
School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe.
College of Health Solutions, Arizona State University, Tempe.
J Speech Lang Hear Res. 2025 Jul 29;68(7S):3583-3601. doi: 10.1044/2024_JSLHR-24-00347. Epub 2025 Mar 31.
Phonetic forced alignment has a multitude of applications in automated analysis of speech, particularly in studying nonstandard speech such as children's speech. Manual alignment is tedious but serves as the gold standard for clinical-grade alignment. Current tools do not support direct training on manual alignments. Thus, a trainable speaker adaptive phonetic forced alignment system, Wav2TextGrid, was developed for children's speech. The source code for the method is publicly available along with a graphical user interface at https://github.com/pkadambi/Wav2TextGrid.
We propose a trainable, speaker-adaptive, neural forced aligner developed using a corpus of 42 neurotypical children from 3 to 6 years of age. Evaluation on both child speech and on the TIMIT corpus was performed to demonstrate aligner performance across age and dialectal variations.
The trainable alignment tool markedly improved accuracy over baseline for several alignment quality metrics, for all phoneme categories. Accuracy for plosives and affricates in children's speech improved more than 40% over baseline. Performance matched existing methods using approximately 13 min of labeled data, while approximately 45-60 min of labeled alignments yielded significant improvement.
The Wav2TextGrid tool allows alternate alignment workflows where the forced alignments, via training, are directly tailored to match clinical-grade, manually provided alignments.
语音强制对齐在语音自动分析中有多种应用,特别是在研究非标准语音(如儿童语音)方面。手动对齐很繁琐,但却是临床级对齐的金标准。当前工具不支持直接基于手动对齐进行训练。因此,我们开发了一种可训练的说话人自适应语音强制对齐系统Wav2TextGrid,用于儿童语音。该方法的源代码以及图形用户界面可在https://github.com/pkadambi/Wav2TextGrid上公开获取。
我们提出了一种可训练的、说话人自适应的神经强制对齐器,它是使用42名3至6岁发育正常儿童的语料库开发的。我们对儿童语音和TIMIT语料库进行了评估,以展示对齐器在不同年龄和方言变体中的性能。
对于所有音素类别,该可训练对齐工具在几个对齐质量指标上比基线显著提高了准确性。儿童语音中爆破音和塞擦音的准确率比基线提高了40%以上。使用大约13分钟的标记数据时,性能与现有方法相当,而使用大约45 - 60分钟的标记对齐则有显著提升。
Wav2TextGrid工具允许采用替代对齐工作流程,通过训练,强制对齐可以直接定制以匹配临床级的手动提供的对齐。