Vojtech Jennifer M, Chan Michael D, Shiwani Bhawna, Roy Serge H, Heaton James T, Meltzner Geoffrey S, Contessa Paola, De Luca Gianluca, Patel Rupal, Kline Joshua C
Delsys/Altec, Inc., Natick, MA.
Boston University, MA.
J Speech Lang Hear Res. 2021 Jun 18;64(6S):2134-2153. doi: 10.1044/2021_JSLHR-20-00257. Epub 2021 May 12.
Purpose This study aimed to evaluate a novel communication system designed to translate surface electromyographic (sEMG) signals from articulatory muscles into speech using a personalized, digital voice. The system was evaluated for word recognition, prosodic classification, and listener perception of synthesized speech. Method sEMG signals were recorded from the face and neck as speakers with ( = 4) and without ( = 4) laryngectomy subvocally recited (silently mouthed) a speech corpus comprising 750 phrases (150 phrases with variable phrase-level stress). Corpus tokens were then translated into speech via personalized voice synthesis ( = 8 synthetic voices) and compared against phrases produced by each speaker when using their typical mode of communication ( = 4 natural voices, = 4 electrolaryngeal [EL] voices). Naïve listeners ( = 12) evaluated synthetic, natural, and EL speech for acceptability and intelligibility in a visual sort-and-rate task, as well as phrasal stress discriminability via a classification mechanism. Results Recorded sEMG signals were processed to translate sEMG muscle activity into lexical content and categorize variations in phrase-level stress, achieving a mean accuracy of 96.3% ( = 3.10%) and 91.2% ( = 4.46%), respectively. Synthetic speech was significantly higher in acceptability and intelligibility than EL speech, also leading to greater phrasal stress classification accuracy, whereas natural speech was rated as the most acceptable and intelligible, with the greatest phrasal stress classification accuracy. Conclusion This proof-of-concept study establishes the feasibility of using subvocal sEMG-based alternative communication not only for lexical recognition but also for prosodic communication in healthy individuals, as well as those living with vocal impairments and residual articulatory function. Supplemental Material https://doi.org/10.23641/asha.14558481.
目的 本研究旨在评估一种新型通信系统,该系统旨在使用个性化数字语音将来自发音肌肉的表面肌电图(sEMG)信号转换为语音。对该系统进行了单词识别、韵律分类以及听众对合成语音的感知方面的评估。方法 当有喉切除术的说话者(n = 4)和无喉切除术的说话者(n = 4)默读(不出声地口念)包含750个短语(150个具有可变短语级重音的短语)的语音语料库时,从面部和颈部记录sEMG信号。然后通过个性化语音合成(n = 8个合成语音)将语料库中的标记转换为语音,并与每个说话者在使用其典型通信模式时产生的短语进行比较(n = 4个自然语音,n = 4个电子喉[EL]语音)。未受过训练的听众(n = 12)在视觉分类和评分任务中评估合成语音、自然语音和EL语音的可接受性和可理解性,以及通过分类机制评估短语重音的可辨别性。结果 对记录的sEMG信号进行处理,将sEMG肌肉活动转换为词汇内容,并对短语级重音的变化进行分类,平均准确率分别达到96.3%(标准差 = 3.10%)和91.2%(标准差 = 4.46%)。合成语音在可接受性和可理解性方面显著高于EL语音,也导致更高的短语重音分类准确率,而自然语音被评为最可接受和最可理解的,具有最高的短语重音分类准确率。结论 这项概念验证研究证明了使用基于默读sEMG的替代通信不仅对于健康个体,而且对于有语音障碍和残留发音功能的个体进行词汇识别和韵律通信的可行性。补充材料 https://doi.org/10.23641/asha.14558481 。