Vojtech Jennifer M, Mitchell Claire L, Raiff Laura, Kline Joshua C, De Luca Gianluca
Delsys, Inc., Natick, MA 01760, USA.
Altec, Inc., Natick, MA 01760, USA.
Vibration. 2022 Dec;5(4):692-710. doi: 10.3390/vibration5040041. Epub 2022 Oct 13.
Silent speech interfaces (SSIs) enable speech recognition and synthesis in the absence of an acoustic signal. Yet, the archetypal SSI fails to convey the expressive attributes of prosody such as pitch and loudness, leading to lexical ambiguities. The aim of this study was to determine the efficacy of using surface electromyography (sEMG) as an approach for predicting continuous acoustic estimates of prosody. Ten participants performed a series of vocal tasks including sustained vowels, phrases, and monologues while acoustic data was recorded simultaneously with sEMG activity from muscles of the face and neck. A battery of time-, frequency-, and cepstral-domain features extracted from the sEMG signals were used to train deep regression neural networks to predict fundamental frequency and intensity contours from the acoustic signals. We achieved an average accuracy of 0.01 ST and precision of 0.56 ST for the estimation of fundamental frequency, and an average accuracy of 0.21 dB SPL and precision of 3.25 dB SPL for the estimation of intensity. This work highlights the importance of using sEMG as an alternative means of detecting prosody and shows promise for improving SSIs in future development.
无声语音接口(SSIs)能够在没有声学信号的情况下实现语音识别与合成。然而,典型的SSI无法传达诸如音高和响度等韵律的表达属性,从而导致词汇歧义。本研究的目的是确定使用表面肌电图(sEMG)作为预测韵律连续声学估计的一种方法的有效性。十名参与者执行了一系列发声任务,包括持续元音、短语和独白,同时记录声学数据以及来自面部和颈部肌肉的sEMG活动。从sEMG信号中提取的一系列时域、频域和倒谱域特征被用于训练深度回归神经网络,以从声学信号中预测基频和强度轮廓。对于基频估计,我们实现了平均精度为0.01 ST,召回率为0.56 ST;对于强度估计,平均精度为0.21 dB SPL,召回率为3.25 dB SPL。这项工作突出了使用sEMG作为检测韵律的替代手段的重要性,并显示出在未来发展中改进SSIs的前景。