Zhang Zhijun, Zhang Jian, Mai Weijian
School of Automation Science and Engineering, South China University of Technology, China; Key Library of Autonomous Systems and Network Control, Ministry of Education, China; Jiangxi Thousand Talents Plan, Nanchang University, China; College of Computer Science and Engineering, Jishou University, China; Guangdong Artificial Intelligence and Digital Economy Laboratory (Pazhou Lab), China; Shaanxi Provincial Key Laboratory of Industrial Automation, School of Mechanical Engineering, Shaanxi University of Technology, Hanzhong, China; School of Information Science and Engineering, Changsha Normal University, Changsha, China; School of Automation Science and Engineering, and also with the Institute of Artificial Intelligence and Automation, Guangdong University of Petrochemical Technology, Maoming, China; Key Laboratory of Large-Model Embodied-Intelligent Humanoid Robot (2024KSYS004), China; The Institute for Super Robotics (Huangpu), Guangzhou,, China.
School of Automation Science and Engineering, South China University of Technology, China; The Institute for Super Robotics (Huangpu), Guangzhou,, China.
Neural Netw. 2025 Apr;184:107122. doi: 10.1016/j.neunet.2025.107122. Epub 2025 Jan 9.
Talking face generation is a promising approach within various domains, such as digital assistants, video editing, and virtual video conferences. Previous works with audio-driven talking faces focused primarily on the synchronization between audio and video. However, existing methods still have certain limitations in synthesizing photo-realistic video with high identity preservation, audiovisual synchronization, and facial details like blink movements. To solve these problems, a novel talking face generation framework, termed video portraits transformer (VPT) with controllable blink movements is proposed and applied. It separates the process of video generation into two stages, i.e., audio-to-landmark and landmark-to-face stages. In the audio-to-landmark stage, the transformer encoder serves as the generator used for predicting whole facial landmarks from given audio and continuous eye aspect ratio (EAR). During the landmark-to-face stage, the video-to-video (vid-to-vid) network is employed to transfer landmarks into realistic talking face videos. Moreover, to imitate real blink movements during inference, a transformer-based spontaneous blink generation module is devised to generate the EAR sequence. Extensive experiments demonstrate that the VPT method can produce photo-realistic videos of talking faces with natural blink movements, and the spontaneous blink generation module can generate blink movements close to the real blink duration distribution and frequency.
生成会说话的面部是数字助理、视频编辑和虚拟视频会议等各个领域中一种很有前景的方法。以前的音频驱动会说话面部的工作主要集中在音频和视频之间的同步。然而,现有方法在合成具有高身份保留、视听同步以及眨眼动作等面部细节的逼真视频方面仍存在一定局限性。为了解决这些问题,提出并应用了一种新颖的会说话面部生成框架,称为具有可控眨眼动作的视频肖像变换器(VPT)。它将视频生成过程分为两个阶段,即音频到地标阶段和地标到面部阶段。在音频到地标阶段,变换器编码器用作生成器,用于从给定音频和连续眼宽高比(EAR)预测整个面部地标。在地标到面部阶段,采用视频到视频(vid-to-vid)网络将地标转换为逼真的会说话面部视频。此外,为了在推理过程中模仿真实的眨眼动作,设计了一个基于变换器的自发眨眼生成模块来生成EAR序列。大量实验表明,VPT方法可以生成具有自然眨眼动作的会说话面部的逼真视频,并且自发眨眼生成模块可以生成接近真实眨眼持续时间分布和频率的眨眼动作。