Suppr超能文献

基于高斯模糊和动态卷积的连续说话人脸生成

Continuous Talking Face Generation Based on Gaussian Blur and Dynamic Convolution.

作者信息

Tang Ying, Liu Yazhi, Li Wei

机构信息

College of Artificial Intelligence, North China University of Science and Technology, Tangshan 063210, China.

出版信息

Sensors (Basel). 2025 Mar 18;25(6):1885. doi: 10.3390/s25061885.

Abstract

In the field of talking face generation, two-stage audio-based generation methods have attracted significant research interest. However, these methods still face challenges in achieving lip-audio synchronization during face generation, as well as issues with the discontinuity between the generated parts and original face in rendered videos. To overcome these challenges, this paper proposes a two-stage talking face generation method. The first stage is the landmark generation stage. A dynamic convolutional transformer generator is designed to capture complex facial movements. A dual-pipeline parallel processing mechanism is adopted to enhance the temporal feature correlation of input features and the ability to model details at the spatial scale. In the second stage, a dynamic Gaussian renderer (adaptive Gaussian renderer) is designed to realize seamless and natural connection of the upper- and lower-boundary areas through a Gaussian blur masking technique. We conducted quantitative analyses on the LRS2, HDTF, and MEAD neutral expression datasets. Experimental results demonstrate that, compared with existing methods, our approach significantly improves the realism and lip-audio synchronization of talking face videos. In particular, on the LRS2 dataset, the lip-audio synchronization rate was improved by 18.16% and the peak signal-to-noise ratio was improved by 12.11% compared to state-of-the-art works.

摘要

在会说话的面部生成领域,基于音频的两阶段生成方法引起了广泛的研究兴趣。然而,这些方法在面部生成过程中实现唇音同步方面仍面临挑战,并且在渲染视频中生成部分与原始面部之间存在不连续问题。为了克服这些挑战,本文提出了一种两阶段会说话的面部生成方法。第一阶段是地标生成阶段。设计了一个动态卷积变压器生成器来捕捉复杂的面部运动。采用双管道并行处理机制来增强输入特征的时间特征相关性以及在空间尺度上对细节建模的能力。在第二阶段,设计了一个动态高斯渲染器(自适应高斯渲染器),通过高斯模糊掩蔽技术实现上下边界区域的无缝自然连接。我们对LRS2、HDTF和MEAD中性表情数据集进行了定量分析。实验结果表明,与现有方法相比,我们的方法显著提高了会说话的面部视频的真实感和唇音同步性。特别是在LRS2数据集上,与最先进的方法相比,唇音同步率提高了18.16%,峰值信噪比提高了12.11%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10ab/11945506/5509d7ce38cc/sensors-25-01885-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验