IEEE Trans Neural Netw Learn Syst. 2024 Oct;35(10):14099-14111. doi: 10.1109/TNNLS.2023.3274676. Epub 2024 Oct 7.
The goal of talking face generation is to synthesize a sequence of face images of the specified identity, ensuring the mouth movements are synchronized with the given audio. Recently, image-based talking face generation has emerged as a popular approach. It could generate talking face images synchronized with the audio merely depending on a facial image of arbitrary identity and an audio clip. Despite the accessible input, it forgoes the exploitation of the audio emotion, inducing the generated faces to suffer from emotion unsynchronization, mouth inaccuracy, and image quality deficiency. In this article, we build a bistage audio emotion-aware talking face generation (AMIGO) framework, to generate high-quality talking face videos with cross-modally synced emotion. Specifically, we propose a sequence-to-sequence (seq2seq) cross-modal emotional landmark generation network to generate vivid landmarks, whose lip and emotion are both synchronized with input audio. Meantime, we utilize a coordinated visual emotion representation to improve the extraction of the audio one. In stage two, a feature-adaptive visual translation network is designed to translate the synthesized landmarks into facial images. Concretely, we proposed a feature-adaptive transformation module to fuse the high-level representations of landmarks and images, resulting in significant improvement in image quality. We perform extensive experiments on the multi-view emotional audio-visual dataset (MEAD) and crowd-sourced emotional multimodal actors dataset (CREMA-D) benchmark datasets, demonstrating that our model outperforms state-of-the-art benchmarks.
说话人脸生成的目标是合成指定身份的人脸图像序列,确保嘴部运动与给定的音频同步。最近,基于图像的说话人脸生成作为一种流行的方法出现了。它仅依赖于任意身份的人脸图像和音频片段,就可以生成与音频同步的说话人脸图像。尽管输入是可访问的,但它放弃了音频情感的利用,导致生成的人脸遭受情感不同步、嘴部不准确和图像质量不足的问题。在本文中,我们构建了一个两阶段的音频情感感知说话人脸生成(AMIGO)框架,以生成具有跨模态同步情感的高质量说话人脸视频。具体来说,我们提出了一个序列到序列(seq2seq)的跨模态情感地标生成网络,以生成生动的地标,其嘴唇和情感都与输入音频同步。同时,我们利用协调的视觉情感表示来改进音频的提取。在第二阶段,设计了一个特征自适应的视觉翻译网络,将合成的地标转换为人脸图像。具体来说,我们提出了一个特征自适应的转换模块,将地标和图像的高层表示融合在一起,从而显著提高了图像质量。我们在多视图情感视听数据集(MEAD)和众包情感多模态演员数据集(CREMA-D)基准数据集上进行了广泛的实验,结果表明我们的模型优于最先进的基准。