• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于高斯模糊和动态卷积的连续说话人脸生成

Continuous Talking Face Generation Based on Gaussian Blur and Dynamic Convolution.

作者信息

Tang Ying, Liu Yazhi, Li Wei

机构信息

College of Artificial Intelligence, North China University of Science and Technology, Tangshan 063210, China.

出版信息

Sensors (Basel). 2025 Mar 18;25(6):1885. doi: 10.3390/s25061885.

DOI:10.3390/s25061885
PMID:40293012
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11945506/
Abstract

In the field of talking face generation, two-stage audio-based generation methods have attracted significant research interest. However, these methods still face challenges in achieving lip-audio synchronization during face generation, as well as issues with the discontinuity between the generated parts and original face in rendered videos. To overcome these challenges, this paper proposes a two-stage talking face generation method. The first stage is the landmark generation stage. A dynamic convolutional transformer generator is designed to capture complex facial movements. A dual-pipeline parallel processing mechanism is adopted to enhance the temporal feature correlation of input features and the ability to model details at the spatial scale. In the second stage, a dynamic Gaussian renderer (adaptive Gaussian renderer) is designed to realize seamless and natural connection of the upper- and lower-boundary areas through a Gaussian blur masking technique. We conducted quantitative analyses on the LRS2, HDTF, and MEAD neutral expression datasets. Experimental results demonstrate that, compared with existing methods, our approach significantly improves the realism and lip-audio synchronization of talking face videos. In particular, on the LRS2 dataset, the lip-audio synchronization rate was improved by 18.16% and the peak signal-to-noise ratio was improved by 12.11% compared to state-of-the-art works.

摘要

在会说话的面部生成领域,基于音频的两阶段生成方法引起了广泛的研究兴趣。然而,这些方法在面部生成过程中实现唇音同步方面仍面临挑战,并且在渲染视频中生成部分与原始面部之间存在不连续问题。为了克服这些挑战,本文提出了一种两阶段会说话的面部生成方法。第一阶段是地标生成阶段。设计了一个动态卷积变压器生成器来捕捉复杂的面部运动。采用双管道并行处理机制来增强输入特征的时间特征相关性以及在空间尺度上对细节建模的能力。在第二阶段,设计了一个动态高斯渲染器(自适应高斯渲染器),通过高斯模糊掩蔽技术实现上下边界区域的无缝自然连接。我们对LRS2、HDTF和MEAD中性表情数据集进行了定量分析。实验结果表明,与现有方法相比,我们的方法显著提高了会说话的面部视频的真实感和唇音同步性。特别是在LRS2数据集上,与最先进的方法相比,唇音同步率提高了18.16%,峰值信噪比提高了12.11%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10ab/11945506/67802ba87b09/sensors-25-01885-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10ab/11945506/5509d7ce38cc/sensors-25-01885-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10ab/11945506/7cb5ca3e4f5d/sensors-25-01885-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10ab/11945506/2b2824bafd46/sensors-25-01885-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10ab/11945506/cacac50591d1/sensors-25-01885-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10ab/11945506/9d5f9b332e24/sensors-25-01885-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10ab/11945506/67802ba87b09/sensors-25-01885-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10ab/11945506/5509d7ce38cc/sensors-25-01885-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10ab/11945506/7cb5ca3e4f5d/sensors-25-01885-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10ab/11945506/2b2824bafd46/sensors-25-01885-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10ab/11945506/cacac50591d1/sensors-25-01885-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10ab/11945506/9d5f9b332e24/sensors-25-01885-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10ab/11945506/67802ba87b09/sensors-25-01885-g006.jpg

相似文献

1
Continuous Talking Face Generation Based on Gaussian Blur and Dynamic Convolution.基于高斯模糊和动态卷积的连续说话人脸生成
Sensors (Basel). 2025 Mar 18;25(6):1885. doi: 10.3390/s25061885.
2
VPT: Video portraits transformer for realistic talking face generation.VPT:用于生成逼真会说话人脸的视频人像Transformer
Neural Netw. 2025 Apr;184:107122. doi: 10.1016/j.neunet.2025.107122. Epub 2025 Jan 9.
3
Talking Face Generation With Audio-Deduced Emotional Landmarks.基于音频提取的情感地标进行人脸对话生成。
IEEE Trans Neural Netw Learn Syst. 2024 Oct;35(10):14099-14111. doi: 10.1109/TNNLS.2023.3274676. Epub 2024 Oct 7.
4
Learn2Talk: 3D Talking Face Learns from 2D Talking Face.Learn2Talk:从二维会说话的面部学习三维会说话的面部。
IEEE Trans Vis Comput Graph. 2024 Oct 7;PP. doi: 10.1109/TVCG.2024.3476275.
5
GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained 3D Face Guidance.GSmoothFace:通过细粒度3D面部引导实现通用平滑说话人脸生成
IEEE Trans Vis Comput Graph. 2025 May 2;PP. doi: 10.1109/TVCG.2025.3566382.
6
Toward Fine-Grained Talking Face Generation.迈向细粒度的会说话面部生成。
IEEE Trans Image Process. 2023;32:5794-5807. doi: 10.1109/TIP.2023.3323452. Epub 2023 Oct 24.
7
Deep Audio-Visual Speech Recognition.深度视听语音识别
IEEE Trans Pattern Anal Mach Intell. 2022 Dec;44(12):8717-8727. doi: 10.1109/TPAMI.2018.2889052. Epub 2022 Nov 7.
8
StyleTalk++: A Unified Framework for Controlling the Speaking Styles of Talking Heads.StyleTalk++:用于控制说话人头的说话风格的统一框架。
IEEE Trans Pattern Anal Mach Intell. 2024 Jun;46(6):4331-4347. doi: 10.1109/TPAMI.2024.3357808. Epub 2024 May 7.
9
3D Talking Face With Personalized Pose Dynamics.具有个性化姿态动态的3D会说话面部。
IEEE Trans Vis Comput Graph. 2023 Feb;29(2):1438-1449. doi: 10.1109/TVCG.2021.3117484. Epub 2022 Dec 29.
10
Generating Talking Face With Controllable Eye Movements by Disentangled Blinking Feature.通过解缠眨眼特征生成可控制眼部运动的对话人脸。
IEEE Trans Vis Comput Graph. 2023 Dec;29(12):5050-5061. doi: 10.1109/TVCG.2022.3199412. Epub 2023 Nov 10.

引用本文的文献

1
Design of Realistic and Artistically Expressive 3D Facial Models for Film AIGC: A Cross-Modal Framework Integrating Audience Perception Evaluation.用于电影人工智能生成内容的逼真且具艺术表现力的3D面部模型设计:一个整合观众感知评估的跨模态框架
Sensors (Basel). 2025 Jul 26;25(15):4646. doi: 10.3390/s25154646.

本文引用的文献

1
Accurate Real-Time Live Face Detection Using Snapshot Spectral Imaging Method.使用快照光谱成像方法进行准确的实时活体面部检测。
Sensors (Basel). 2025 Feb 5;25(3):952. doi: 10.3390/s25030952.
2
Pose-Aware 3D Talking Face Synthesis Using Geometry-Guided Audio-Vertices Attention.基于几何引导的音频顶点注意力的姿态感知3D会说话人脸合成
IEEE Trans Vis Comput Graph. 2025 Mar;31(3):1758-1771. doi: 10.1109/TVCG.2024.3371064. Epub 2025 Jan 30.
3
DaGAN++: Depth-Aware Generative Adversarial Network for Talking Head Video Generation.DaGAN++:用于生成会说话头部视频的深度感知生成对抗网络
IEEE Trans Pattern Anal Mach Intell. 2024 May;46(5):2997-3012. doi: 10.1109/TPAMI.2023.3339964. Epub 2024 Apr 3.