VPT：用于生成逼真会说话人脸的视频人像Transformer

VPT: Video portraits transformer for realistic talking face generation.

作者信息

Zhang Zhijun, Zhang Jian, Mai Weijian

机构信息

School of Automation Science and Engineering, South China University of Technology, China; Key Library of Autonomous Systems and Network Control, Ministry of Education, China; Jiangxi Thousand Talents Plan, Nanchang University, China; College of Computer Science and Engineering, Jishou University, China; Guangdong Artificial Intelligence and Digital Economy Laboratory (Pazhou Lab), China; Shaanxi Provincial Key Laboratory of Industrial Automation, School of Mechanical Engineering, Shaanxi University of Technology, Hanzhong, China; School of Information Science and Engineering, Changsha Normal University, Changsha, China; School of Automation Science and Engineering, and also with the Institute of Artificial Intelligence and Automation, Guangdong University of Petrochemical Technology, Maoming, China; Key Laboratory of Large-Model Embodied-Intelligent Humanoid Robot (2024KSYS004), China; The Institute for Super Robotics (Huangpu), Guangzhou,, China.

School of Automation Science and Engineering, South China University of Technology, China; The Institute for Super Robotics (Huangpu), Guangzhou,, China.

出版信息

Neural Netw. 2025 Apr;184:107122. doi: 10.1016/j.neunet.2025.107122. Epub 2025 Jan 9.

DOI:10.1016/j.neunet.2025.107122

PMID:39799718

Abstract

Talking face generation is a promising approach within various domains, such as digital assistants, video editing, and virtual video conferences. Previous works with audio-driven talking faces focused primarily on the synchronization between audio and video. However, existing methods still have certain limitations in synthesizing photo-realistic video with high identity preservation, audiovisual synchronization, and facial details like blink movements. To solve these problems, a novel talking face generation framework, termed video portraits transformer (VPT) with controllable blink movements is proposed and applied. It separates the process of video generation into two stages, i.e., audio-to-landmark and landmark-to-face stages. In the audio-to-landmark stage, the transformer encoder serves as the generator used for predicting whole facial landmarks from given audio and continuous eye aspect ratio (EAR). During the landmark-to-face stage, the video-to-video (vid-to-vid) network is employed to transfer landmarks into realistic talking face videos. Moreover, to imitate real blink movements during inference, a transformer-based spontaneous blink generation module is devised to generate the EAR sequence. Extensive experiments demonstrate that the VPT method can produce photo-realistic videos of talking faces with natural blink movements, and the spontaneous blink generation module can generate blink movements close to the real blink duration distribution and frequency.

摘要

生成会说话的面部是数字助理、视频编辑和虚拟视频会议等各个领域中一种很有前景的方法。以前的音频驱动会说话面部的工作主要集中在音频和视频之间的同步。然而，现有方法在合成具有高身份保留、视听同步以及眨眼动作等面部细节的逼真视频方面仍存在一定局限性。为了解决这些问题，提出并应用了一种新颖的会说话面部生成框架，称为具有可控眨眼动作的视频肖像变换器（VPT）。它将视频生成过程分为两个阶段，即音频到地标阶段和地标到面部阶段。在音频到地标阶段，变换器编码器用作生成器，用于从给定音频和连续眼宽高比（EAR）预测整个面部地标。在地标到面部阶段，采用视频到视频（vid-to-vid）网络将地标转换为逼真的会说话面部视频。此外，为了在推理过程中模仿真实的眨眼动作，设计了一个基于变换器的自发眨眼生成模块来生成EAR序列。大量实验表明，VPT方法可以生成具有自然眨眼动作的会说话面部的逼真视频，并且自发眨眼生成模块可以生成接近真实眨眼持续时间分布和频率的眨眼动作。

相似文献

VPT: Video portraits transformer for realistic talking face generation.VPT：用于生成逼真会说话人脸的视频人像Transformer

Neural Netw. 2025 Apr;184:107122. doi: 10.1016/j.neunet.2025.107122. Epub 2025 Jan 9.

Talking Face Generation With Audio-Deduced Emotional Landmarks.基于音频提取的情感地标进行人脸对话生成。

IEEE Trans Neural Netw Learn Syst. 2024 Oct;35(10):14099-14111. doi: 10.1109/TNNLS.2023.3274676. Epub 2024 Oct 7.

Generating Talking Face With Controllable Eye Movements by Disentangled Blinking Feature.通过解缠眨眼特征生成可控制眼部运动的对话人脸。

IEEE Trans Vis Comput Graph. 2023 Dec;29(12):5050-5061. doi: 10.1109/TVCG.2022.3199412. Epub 2023 Nov 10.

Continuous Talking Face Generation Based on Gaussian Blur and Dynamic Convolution.基于高斯模糊和动态卷积的连续说话人脸生成

Sensors (Basel). 2025 Mar 18;25(6):1885. doi: 10.3390/s25061885.

Toward Fine-Grained Talking Face Generation.迈向细粒度的会说话面部生成。

IEEE Trans Image Process. 2023;32:5794-5807. doi: 10.1109/TIP.2023.3323452. Epub 2023 Oct 24.

StyleTalk++: A Unified Framework for Controlling the Speaking Styles of Talking Heads.StyleTalk++：用于控制说话人头的说话风格的统一框架。

IEEE Trans Pattern Anal Mach Intell. 2024 Jun;46(6):4331-4347. doi: 10.1109/TPAMI.2024.3357808. Epub 2024 May 7.

Blink synchronization is an indicator of interest while viewing videos.眨眼同步是观看视频时感兴趣的指标。

Int J Psychophysiol. 2019 Jan;135:1-11. doi: 10.1016/j.ijpsycho.2018.10.012. Epub 2018 Nov 11.

GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained 3D Face Guidance.GSmoothFace：通过细粒度3D面部引导实现通用平滑说话人脸生成

IEEE Trans Vis Comput Graph. 2025 May 2;PP. doi: 10.1109/TVCG.2025.3566382.

Adjusting eye aspect ratio for strong eye blink detection based on facial landmarks.基于面部特征点调整眼睛宽高比以进行强力眨眼检测。

PeerJ Comput Sci. 2022 Apr 18;8:e943. doi: 10.7717/peerj-cs.943. eCollection 2022.

Learn2Talk: 3D Talking Face Learns from 2D Talking Face.Learn2Talk：从二维会说话的面部学习三维会说话的面部。

IEEE Trans Vis Comput Graph. 2024 Oct 7;PP. doi: 10.1109/TVCG.2024.3476275.

引用本文的文献

TS-Resformer: a model based on multimodal fusion for the classification of music signals.TS-Resformer：一种基于多模态融合的音乐信号分类模型。

Front Neurorobot. 2025 May 13;19:1568811. doi: 10.3389/fnbot.2025.1568811. eCollection 2025.

Advances in Zeroing Neural Networks: Bio-Inspired Structures, Performance Enhancements, and Applications.归零神经网络的进展：受生物启发的结构、性能提升及应用

Biomimetics (Basel). 2025 Apr 29;10(5):279. doi: 10.3390/biomimetics10050279.

VPT：用于生成逼真会说话人脸的视频人像Transformer

VPT: Video portraits transformer for realistic talking face generation.

作者信息

Zhang Zhijun, Zhang Jian, Mai Weijian

机构信息

School of Automation Science and Engineering, South China University of Technology, China; The Institute for Super Robotics (Huangpu), Guangzhou,, China.

出版信息

Neural Netw. 2025 Apr;184:107122. doi: 10.1016/j.neunet.2025.107122. Epub 2025 Jan 9.

DOI:10.1016/j.neunet.2025.107122

PMID:39799718

Abstract

摘要

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

VPT：用于生成逼真会说话人脸的视频人像Transformer

VPT: Video portraits transformer for realistic talking face generation.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

VPT：用于生成逼真会说话人脸的视频人像Transformer

VPT: Video portraits transformer for realistic talking face generation.

作者信息

机构信息

出版信息

相似文献

引用本文的文献