Xiong Wenjing, Ma Lin, Li Haifeng
Faculty of Computing, Harbin Institute of Technology, Harbin, China.
Front Neurosci. 2025 Apr 17;19:1565848. doi: 10.3389/fnins.2025.1565848. eCollection 2025.
Decoding natural language directly from neural activity is of great interest to people with limited communication means. Being a non-invasive and convenient approach, the electroencephalogram (EEG) offers better portability and wider application potentiality. We present an EEG-to-speech system (ETS) that synthesizes audible, and highly understandable language by EEG of imagined speech. Our ETS applies a specially designed X-shape deep neural network (DNN) to realize an End-to-End correspondence between imagined speech EEG and the speech. The system innovatively incorporates dynamic time warping into the network's training process, using actual speech EEG data as a bridge to temporally align imagined speech EEG signals with speech signals. The ETS performance was evaluated on 13 participants who pretraining four Chinese disyllabic words. These words cover all four tones and 40% of the phonemes in Chinese. Our synthesized utterances' average accuracy across all subjects was 91.23%, the average MOS value was 3.50, and the best accuracy was 99% with an MOS value of 3.99. Furthermore, a partial feedback mechanism for DNN and spectral subtraction-based speech enhancement are introduced to improve the quality of generated speech. Our findings prove that non-invasive approaches can be a significant step in regaining verbal language ability.
直接从神经活动中解码自然语言对于沟通方式有限的人来说极具吸引力。脑电图(EEG)作为一种非侵入性且便捷的方法,具有更好的便携性和更广泛的应用潜力。我们提出了一种脑电转语音系统(ETS),该系统通过想象语音的脑电图合成可听且高度易懂的语言。我们的ETS应用了专门设计的X形深度神经网络(DNN),以实现想象语音脑电图与语音之间的端到端对应。该系统创新性地将动态时间规整纳入网络训练过程,以实际语音脑电图数据为桥梁,在时间上对齐想象语音脑电图信号与语音信号。对13名预先训练了四个中文双音节词的参与者进行了ETS性能评估。这些词涵盖了汉语的所有四个声调以及40%的音素。我们合成话语在所有受试者中的平均准确率为91.23%,平均MOS值为3.50,最佳准确率为99%,MOS值为3.99。此外,还引入了一种针对DNN的部分反馈机制和基于谱减法的语音增强方法,以提高生成语音的质量。我们的研究结果证明,非侵入性方法可能是恢复言语语言能力的重要一步。