使用复合音素的异音语音识别。

Heterophonic speech recognition using composite phones.

作者信息

Alkhairy Ashraf, Jafri Afshan

机构信息

King Abdul Aziz City for Science and Technology, Riyadh, Saudi Arabia.

King Saud University, Riyadh, Saudi Arabia.

出版信息

Springerplus. 2016 Nov 24;5(1):2008. doi: 10.1186/s40064-016-3332-9. eCollection 2016.

DOI:10.1186/s40064-016-3332-9

PMID:27933264

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5121111/

Abstract

Heterophones pose challenges during training of automatic speech recognition (ASR) systems because they involve ambiguity in the pronunciation of an orthographic representation of a word. Heterophones are words that have the same spelling but different pronunciations. This paper addresses the problem of heterophonic languages by developing the concept of a Composite Phoneme (CP) as a basic pronunciation unit for speech recognition. A CP is a set of alternative sequences of phonemes. CP's are developed specifically in the context of Arabic by defining phonetic units that are consonant centric and absorb phonemically contrastive short vowels and gemination, not represented in the Arabic Modern Orthography (MO). CPs alleviate the need to diacritize MO into Classical Orthography (CO), to represent short vowels and stress, before generating pronunciation in terms of Simple Phonemes (SP). We develop algorithms to generate CP pronunciation from MO, and SP pronunciation from CO to map a word into a single pronunciation. We investigate the performance of CP, SP, UG (Undiacritized Grapheme), and DG (Diacritized Grapheme) ASRs. The experimental results suggest that UG and DG are inferior to SP and CP. For the A-SpeechDB corpus with MO vocabulary of 8000, the WER for bigram and context dependent phone are: 11.78, 12.64, and 13.59 % for CP, SP_M (SP from manual diacritized CO), and SP_A (SP from automated diacritized MO) respectively. For vocabulary of 24,000 MO words, the corresponding WER's are 13.69, 15.08, and 16.86 %. For uniform statistical model, SP has a lower WER than CP. For context independent phone (CI), CP has lower WER than SP.

摘要

同音异形异义词在自动语音识别（ASR）系统的训练过程中带来了挑战，因为它们在单词的拼字表示发音方面存在歧义。同音异形异义词是指拼写相同但发音不同的单词。本文通过提出复合音素（CP）的概念作为语音识别的基本发音单元，来解决同音异形语言的问题。复合音素是一组可供选择的音素序列。复合音素是专门在阿拉伯语语境中开发的，通过定义以辅音为中心的语音单元，这些单元吸收了在阿拉伯语现代正字法（MO）中未体现的音位对比短元音和双写，复合音素减少了在根据简单音素（SP）生成发音之前，将现代正字法标注为古典正字法（CO）以表示短元音和重音的需求。我们开发了从现代正字法生成复合音素发音以及从古典正字法生成简单音素发音的算法，以便将一个单词映射到单一发音。我们研究了复合音素、简单音素、未标注正字法（UG）和标注正字法（DG）的自动语音识别性能。实验结果表明，未标注正字法和标注正字法不如简单音素和复合音素。对于拥有8000个现代正字法词汇的A-SpeechDB语料库，二元语法和上下文相关音素的词错误率（WER）分别为：复合音素为11.78%、手动标注古典正字法的简单音素（SP_M）为12.64%、自动标注现代正字法的简单音素（SP_A）为13.59%。对于24000个现代正字法单词的词汇表，相应的词错误率分别为13.69%、15.08%和16.86%。对于统一统计模型，简单音素的词错误率低于复合音素。对于上下文无关音素（CI），复合音素的词错误率低于简单音素。