Department of Otolaryngology, Head and Neck Surgery, Beijing TongRen Hospital, Capital Medical University, Beijing 100730, People's Republic of China.
House Ear Institute, Los Angeles, California 90057, USA.
J Acoust Soc Am. 2018 May;143(5):2886. doi: 10.1121/1.5037590.
Mandarin is a tonal language, and it is important to preserve lexical tone information in synthesized speech. With natural speech, Chinese cochlear implant (CI) users have difficulty perceiving voice pitch cues important for lexical tone perception; it is unclear whether this difficulty persists in Mandarin synthesized speech. In this study, intelligibility of naturally produced and synthesized Mandarin speech was measured in Chinese CI listeners; intelligibility was also measured in a control group of normal-hearing (NH) listeners. Five synthesized voices were selected to represent different talker genders (male, female, child), speaking rates (normal, slow), and speaking styles (emotional, accent). The data showed that while modern Mandarin text-to-speech (TTS) systems can provide perfect speech intelligibility for NH listeners, overall intelligibility was much poorer for CI than for NH listeners. CI performance was significantly poorer with synthesized speech than with natural speech (p < 0.001). CI listeners were highly sensitive to the "extra-atypical" synthesized emotional and accented speech. Performance with each of the synthesized speech types was significantly correlated with performance with natural speech in CI users (p < 0.01 in all cases). While modern TTS systems offer educational and communication benefits to CI users and hearing-impaired individuals, the selection of synthesized voices should be carefully considered in education applications of TTS for hearing-impaired individuals, especially CI children, since poor intelligibility performance may affect language learning.
普通话是一种声调语言,在合成语音中保留词汇声调信息非常重要。对于自然语音,中国人工耳蜗(CI)使用者很难感知到对词汇声调感知很重要的语音音高线索;在普通话合成语音中,这种困难是否仍然存在还不清楚。在这项研究中,测量了中国 CI 使用者对自然产生和合成的普通话语音的可理解度;还在正常听力(NH)对照组中测量了可理解度。选择了五个合成语音来代表不同的说话者性别(男性、女性、儿童)、说话速度(正常、慢)和说话风格(情绪化、口音)。数据表明,虽然现代普通话文语转换(TTS)系统可以为 NH 听众提供完美的语音可理解度,但 CI 听众的整体可理解度要比 NH 听众差得多。与自然语音相比,CI 听众对合成语音的表现明显更差(p<0.001)。CI 听众对“额外非典型”的合成情绪化和口音化语音非常敏感。与自然语音相比,CI 用户对每种合成语音类型的表现都与自然语音的表现显著相关(p<0.01)。虽然现代 TTS 系统为 CI 用户和听力受损者提供了教育和沟通方面的好处,但在 TTS 为听力受损者的教育应用中,应仔细考虑合成语音的选择,尤其是对于 CI 儿童,因为较差的可理解度表现可能会影响语言学习。