用于唇语语音生成 GAN 的集成视觉转换器和闪存注意

Integrated visual transformer and flash attention for lip-to-speech generation GAN.

机构信息

School of Computer Science, Xi'an Polytechnic University, Xi'an, 710048, Shaanxi, China.

Shaanxi Key Laboratory of Clothing Intelligence, School of Computer Science, Xi'an Polytechnic University, Xi'an, 710048, Shaanxi, China.

出版信息

Sci Rep. 2024 Feb 24;14(1):4525. doi: 10.1038/s41598-024-55248-6.

DOI:10.1038/s41598-024-55248-6

PMID:38402265

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10894270/

Abstract

Lip-to-Speech (LTS) generation is an emerging technology that is highly visible, widely supported, and rapidly evolving. LTS has a wide range of promising applications, including assisting speech impairment and improving speech interaction in virtual assistants and robots. However, the technique faces the following challenges: (1) Chinese lip-to-speech generation is poorly recognized. (2) The wide range of variation in lip-speaking is poorly aligned with lip movements. Addressing these challenges will contribute to advancing Lip-to-Speech (LTS) technology, enhancing the communication abilities, and improving the quality of life for individuals with disabilities. Currently, lip-to-speech generation techniques usually employ the GAN architecture but suffer from the following problems: The primary issue lies in the insufficient joint modeling of local and global lip movements, resulting in visual ambiguities and inadequate image representations. To solve these problems, we design Flash Attention GAN (FA-GAN) with the following features: (1) Vision and audio are separately coded, and lip motion is jointly modelled to improve speech recognition accuracy. (2) A multilevel Swin-transformer is introduced to improve image representation. (3) A hierarchical iterative generator is introduced to improve speech generation. (4) A flash attention mechanism is introduced to improve computational efficiency. Many experiments have indicated that FA-GAN can recognize Chinese and English datasets better than existing architectures, especially the recognition error rate of Chinese, which is only 43.19%, the lowest among the same type.

摘要

唇语生成 (LTS) 是一项新兴技术，具有高度可见性、广泛支持和快速发展的特点。LTS 有广泛的有前途的应用，包括辅助言语障碍和改善虚拟助手和机器人中的言语交互。然而，该技术面临以下挑战：（1）中文唇语生成识别率低。（2）唇语的广泛变化与唇动不匹配。解决这些挑战将有助于推进唇语生成 (LTS) 技术，增强残疾人的沟通能力，提高生活质量。目前，唇语生成技术通常采用 GAN 架构，但存在以下问题：主要问题在于局部和全局唇动的联合建模不足，导致视觉模糊和图像表示不足。为了解决这些问题，我们设计了具有以下特点的 Flash Attention GAN (FA-GAN)：（1）分别对视觉和音频进行编码，并联合建模唇动，以提高语音识别精度。（2）引入多级 Swin-transformer 以提高图像表示能力。（3）引入分层迭代生成器以提高语音生成能力。（4）引入闪关注机制以提高计算效率。许多实验表明，FA-GAN 可以比现有架构更好地识别中文和英文数据集，特别是中文的识别错误率仅为 43.19%，在同类中最低。