VocaliD, Inc. 50 Leonard St, Belmont, MA 02478, United States of America.
J Neural Eng. 2018 Aug;15(4):046031. doi: 10.1088/1741-2552/aac965. Epub 2018 Jun 1.
Speech is among the most natural forms of human communication, thereby offering an attractive modality for human-machine interaction through automatic speech recognition (ASR). However, the limitations of ASR-including degradation in the presence of ambient noise, limited privacy and poor accessibility for those with significant speech disorders-have motivated the need for alternative non-acoustic modalities of subvocal or silent speech recognition (SSR).
We have developed a new system of face- and neck-worn sensors and signal processing algorithms that are capable of recognizing silently mouthed words and phrases entirely from the surface electromyographic (sEMG) signals recorded from muscles of the face and neck that are involved in the production of speech. The algorithms were strategically developed by evolving speech recognition models: first for recognizing isolated words by extracting speech-related features from sEMG signals, then for recognizing sequences of words from patterns of sEMG signals using grammar models, and finally for recognizing a vocabulary of previously untrained words using phoneme-based models. The final recognition algorithms were integrated with specially designed multi-point, miniaturized sensors that can be arranged in flexible geometries to record high-fidelity sEMG signal measurements from small articulator muscles of the face and neck.
We tested the system of sensors and algorithms during a series of subvocal speech experiments involving more than 1200 phrases generated from a 2200-word vocabulary and achieved an 8.9%-word error rate (91.1% recognition rate), far surpassing previous attempts in the field.
These results demonstrate the viability of our system as an alternative modality of communication for a multitude of applications including: persons with speech impairments following a laryngectomy; military personnel requiring hands-free covert communication; or the consumer in need of privacy while speaking on a mobile phone in public.
言语是人类最自然的交流形式之一,因此通过自动语音识别(ASR)为人机交互提供了一种有吸引力的方式。然而,ASR 的局限性,包括在环境噪声存在下的降级、对那些有严重言语障碍的人的隐私和可及性有限,促使人们需要替代非声学的亚声或无声语音识别(SSR)模态。
我们开发了一种新的面部和颈部佩戴传感器和信号处理算法系统,能够完全从参与言语产生的面部和颈部肌肉记录的表面肌电图(sEMG)信号中识别无声的口语音和短语。这些算法是通过进化语音识别模型来策略性地开发的:首先通过从 sEMG 信号中提取与语音相关的特征来识别孤立的单词,然后使用语法模型从 sEMG 信号模式中识别单词序列,最后使用基于音素的模型识别以前未训练过的单词词汇。最终的识别算法与专门设计的多点、微型传感器集成在一起,可以以灵活的几何形状排列,从面部和颈部的小发音肌记录高保真 sEMG 信号测量。
我们在一系列亚声语音实验中测试了传感器和算法系统,这些实验涉及到来自 2200 个单词词汇的 1200 多个短语,达到了 8.9%-单词错误率(91.1%的识别率),远远超过了该领域以前的尝试。
这些结果表明,我们的系统作为一种替代的通信方式具有多种应用的可行性,包括:喉切除术后有言语障碍的人;需要免提隐蔽通信的军事人员;或在公共场合使用手机时需要隐私的消费者。