Meltzner Geoffrey S, Heaton James T, Deng Yunbin, De Luca Gianluca, Roy Serge H, Kline Joshua C
VocaliD, Inc. Belmont, MA, 02478, USA.
Harvard Medical School in the Department of Surgery, Massachusetts General Hospital Voice Center, Boston, MA 02114.
IEEE/ACM Trans Audio Speech Lang Process. 2017 Dec;25(12):2386-2398. doi: 10.1109/TASLP.2017.2740000. Epub 2017 Nov 28.
Each year thousands of individuals require surgical removal of their larynx (voice box) due to trauma or disease, and thereby require an alternative voice source or assistive device to verbally communicate. Although natural voice is lost after laryngectomy, most muscles controlling speech articulation remain intact. Surface electromyographic (sEMG) activity of speech musculature can be recorded from the neck and face, and used for automatic speech recognition to provide speech-to-text or synthesized speech as an alternative means of communication. This is true even when speech is mouthed or spoken in a silent (subvocal) manner, making it an appropriate communication platform after laryngectomy. In this study, 8 individuals at least 6 months after total laryngectomy were recorded using 8 sEMG sensors on their face (4) and neck (4) while reading phrases constructed from a 2,500-word vocabulary. A unique set of phrases were used for training phoneme-based recognition models for each of the 39 commonly used phonemes in English, and the remaining phrases were used for testing word recognition of the models based on phoneme identification from running speech. Word error rates were on average 10.3% for the full 8-sensor set (averaging 9.5% for the top 4 participants), and 13.6% when reducing the sensor set to 4 locations per individual (n=7). This study provides a compelling proof-of-concept for sEMG-based alaryngeal speech recognition, with the strong potential to further improve recognition performance.
每年都有成千上万的人因外伤或疾病需要手术切除喉部(喉),因此需要替代的语音源或辅助设备来进行言语交流。尽管喉切除术后会失去自然嗓音,但大多数控制言语发音的肌肉仍保持完好。可以从颈部和面部记录言语肌肉组织的表面肌电图(sEMG)活动,并将其用于自动语音识别,以提供语音转文本或合成语音作为替代的交流方式。即使在以无声(默读)方式口型发音或说话时也是如此,这使其成为喉切除术后合适的交流平台。在本研究中,8名全喉切除术后至少6个月的个体在阅读由2500个单词词汇构成的短语时,使用8个sEMG传感器记录其面部(4个)和颈部(4个)的情况。使用一组独特的短语来训练基于音素的识别模型,用于识别英语中39个常用音素中的每一个,其余短语则用于基于连续语音中的音素识别来测试模型的单词识别。对于完整的8传感器组,单词错误率平均为10.3%(前4名参与者平均为9.5%),当将每个个体的传感器组减少到4个位置时(n = 7),单词错误率为13.6%。本研究为基于sEMG的无喉语音识别提供了令人信服的概念验证,具有进一步提高识别性能的强大潜力。