Suppr超能文献

基于变形卷积神经网络的三轴加速度计信号语音合成。

Speech synthesis from three-axis accelerometer signals using conformer-based deep neural network.

机构信息

Department of Electronic Engineering, Hanyang University, Seoul, South Korea.

Department of Communication Disorders, Ewha Womans University, Seoul, South Korea.

出版信息

Comput Biol Med. 2024 Nov;182:109090. doi: 10.1016/j.compbiomed.2024.109090. Epub 2024 Sep 3.

Abstract

Silent speech interfaces (SSIs) have emerged as innovative non-acoustic communication methods, and our previous study demonstrated the significant potential of three-axis accelerometer-based SSIs to identify silently spoken words with high classification accuracy. The developed accelerometer-based SSI with only four accelerometers and a small training dataset outperformed a conventional surface electromyography (sEMG)-based SSI. In this study, motivated by the promising initial results, we investigated the feasibility of synthesizing spoken speech from three-axis accelerometer signals. This exploration aimed to assess the potential of accelerometer-based SSIs for practical silent communication applications. Nineteen healthy individuals participated in our experiments. Five accelerometers were attached to the face to acquire speech-related facial movements while the participants read 270 Korean sentences aloud. For the speech synthesis, we used a convolution-augmented Transformer (Conformer)-based deep neural network model to convert the accelerometer signals into a Mel spectrogram, from which an audio waveform was synthesized using HiFi-GAN. To evaluate the quality of the generated Mel spectrograms, ten-fold cross-validation was performed, and the Mel cepstral distortion (MCD) was chosen as the evaluation metric. As a result, an average MCD of 5.03 ± 0.65 was achieved using four optimized accelerometers based on our previous study. Furthermore, the quality of generated Mel spectrograms was significantly enhanced by adding one more accelerometer attached under the chin, achieving an average MCD of 4.86 ± 0.65 (p < 0.001, Wilcoxon signed-rank test). Although an objective comparison is difficult, these results surpass those obtained using conventional SSIs based on sEMG, electromagnetic articulography, and electropalatography with the fewest sensors and a similar or smaller number of sentences to train the model. Our proposed approach will contribute to the widespread adoption of accelerometer-based SSIs, leveraging the advantages of accelerometers like low power consumption, invulnerability to physiological artifacts, and high portability.

摘要

无声语音接口 (SSI) 已经成为创新的非声学通信方法,我们之前的研究表明,基于三轴加速度计的 SSI 具有识别无声语音的巨大潜力,其分类准确率很高。与传统的基于表面肌电图 (sEMG) 的 SSI 相比,该研究开发的基于加速度计的 SSI 仅使用四个加速度计和一个小型训练数据集,就实现了更高的性能。在这项研究中,受初步研究结果的鼓舞,我们探索了从三轴加速度计信号合成语音的可行性。这项研究旨在评估基于加速度计的 SSI 在实际无声通信应用中的潜力。19 名健康个体参与了我们的实验。将五个加速度计附着在面部以获取与语音相关的面部运动,同时参与者大声朗读 270 个韩语句子。为了进行语音合成,我们使用了基于卷积增强转换器 (Conformer) 的深度神经网络模型,将加速度计信号转换为梅尔频谱图,然后使用 HiFi-GAN 从该频谱图中合成音频波形。为了评估生成的梅尔频谱图的质量,我们进行了十折交叉验证,并选择梅尔倒谱失真 (MCD) 作为评估指标。结果,使用我们之前研究中优化的四个加速度计,平均 MCD 为 5.03±0.65。此外,通过在下巴下添加一个加速度计,生成的梅尔频谱图的质量得到了显著提高,平均 MCD 为 4.86±0.65(p<0.001,Wilcoxon 符号秩检验)。虽然进行客观比较较为困难,但与使用基于 sEMG、电磁发音图和电声门图的传统 SSI 相比,我们的方法使用了最少的传感器和类似或更少的句子来训练模型,结果更优。我们提出的方法将促进基于加速度计的 SSI 的广泛采用,利用加速度计的优势,如低功耗、不易受生理伪影影响和高便携性。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验