Hosom John-Paul
Center for Spoken Language Understanding, School of Science & Engineering, Oregon Health & Science University, 20000 NW Walker Road, Beaverton, OR 97006 USA,
Speech Commun. 2009 Apr;51(4):352-368. doi: 10.1016/j.specom.2008.11.003.
Determining the location of phonemes is important to a number of speech applications, including training of automatic speech recognition systems, building text-to-speech systems, and research on human speech processing. Agreement of humans on the location of phonemes is, on average, 93.78% within 20 msec on a variety of corpora, and 93.49% within 20 msec on the TIMIT corpus. We describe a baseline forced-alignment system and a proposed system with several modifications to this baseline. Modifications include the addition of energy-based features to the standard cepstral feature set, the use of probabilities of a state transition given an observation, and the computation of probabilities of distinctive phonetic features instead of phoneme-level probabilities. Performance of the baseline system on the test partition of the TIMIT corpus is 91.48% within 20 msec, and performance of the proposed system on this corpus is 93.36% within 20 msec. The results of the proposed system are a 22% relative reduction in error over the baseline system, and a 14% reduction in error over results from a non-HMM alignment system. This result of 93.36% agreement is the best known reported result on the TIMIT corpus.
确定音素的位置对于许多语音应用都很重要,包括自动语音识别系统的训练、文本转语音系统的构建以及人类语音处理的研究。在各种语料库上,人类对音素位置的平均一致率在20毫秒内为93.78%,在TIMIT语料库上在20毫秒内为93.49%。我们描述了一个基线强制对齐系统以及对该基线进行了若干修改的提议系统。修改包括在标准倒谱特征集中添加基于能量的特征、使用给定观测值时状态转移的概率以及计算独特语音特征的概率而非音素级概率。基线系统在TIMIT语料库测试分区上在20毫秒内的准确率为91.48%,提议系统在该语料库上在20毫秒内的准确率为93.36%。提议系统的结果与基线系统相比,错误率相对降低了22%,与非隐马尔可夫对齐系统的结果相比,错误率降低了14%。93.36%的一致率这一结果是TIMIT语料库上已知的最佳报告结果。