记忆是对话式面部生成中的一对多映射缓解因素。

Memories are One-to-Many Mapping Alleviators in Talking Face Generation.

作者信息

Tang Anni, He Tianyu, Tan Xu, Ling Jun, Li Runnan, Zhao Sheng, Bian Jiang, Song Li

出版信息

IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8758-8770. doi: 10.1109/TPAMI.2024.3409380. Epub 2024 Nov 6.

DOI:10.1109/TPAMI.2024.3409380

Abstract

Talking face generation aims at generating photo-realistic video portraits of a target person driven by input audio. According to the nature of audio to lip motions mapping, the same speech content may have different appearances even for the same person at different occasions. Such one-to-many mapping problem brings ambiguity during training and thus causes inferior visual results. Although this one-to-many mapping could be alleviated in part by a two-stage framework (i.e., an audio-to-expression model followed by a neural-rendering model), it is still insufficient since the prediction is produced without enough information (e.g., emotions, wrinkles, etc.). In this paper, we propose MemFace to complement the missing information with an implicit memory and an explicit memory that follow the sense of the two stages respectively. More specifically, the implicit memory is employed in the audio-to-expression model to capture high-level semantics in the audio-expression shared space, while the explicit memory is employed in the neural-rendering model to help synthesize pixel-level details. Our experimental results show that our proposed MemFace surpasses all the state-of-the-art results across multiple scenarios consistently and significantly.

摘要

会说话的脸部生成旨在生成由输入音频驱动的目标人物的逼真视频肖像。根据音频到唇部动作映射的性质，即使对于同一个人在不同场合，相同的语音内容也可能有不同的外观。这种一对多的映射问题在训练过程中会带来模糊性，从而导致视觉效果不佳。尽管这种一对多的映射可以通过两阶段框架（即音频到表情模型，后跟神经渲染模型）部分缓解，但由于预测是在没有足够信息（例如情绪、皱纹等）的情况下产生的，仍然不够充分。在本文中，我们提出了MemFace，分别通过遵循两个阶段意义的隐式记忆和显式记忆来补充缺失的信息。更具体地说，隐式记忆用于音频到表情模型，以捕捉音频-表情共享空间中的高级语义，而显式记忆用于神经渲染模型，以帮助合成像素级细节。我们的实验结果表明，我们提出的MemFace在多个场景中始终显著超越所有现有技术的结果。

相似文献

Memories are One-to-Many Mapping Alleviators in Talking Face Generation.记忆是对话式面部生成中的一对多映射缓解因素。

IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8758-8770. doi: 10.1109/TPAMI.2024.3409380. Epub 2024 Nov 6.

Talking Face Generation With Audio-Deduced Emotional Landmarks.基于音频提取的情感地标进行人脸对话生成。

IEEE Trans Neural Netw Learn Syst. 2024 Oct;35(10):14099-14111. doi: 10.1109/TNNLS.2023.3274676. Epub 2024 Oct 7.

Learn2Talk: 3D Talking Face Learns from 2D Talking Face.Learn2Talk：从二维会说话的面部学习三维会说话的面部。

IEEE Trans Vis Comput Graph. 2024 Oct 7;PP. doi: 10.1109/TVCG.2024.3476275.

Photorealistic Audio-driven Video Portraits.逼真音频驱动的视频人像。

IEEE Trans Vis Comput Graph. 2020 Dec;26(12):3457-3466. doi: 10.1109/TVCG.2020.3023573. Epub 2020 Nov 10.

Toward Fine-Grained Talking Face Generation.迈向细粒度的会说话面部生成。

IEEE Trans Image Process. 2023;32:5794-5807. doi: 10.1109/TIP.2023.3323452. Epub 2023 Oct 24.

High-Fidelity and High-Efficiency Talking Portrait Synthesis With Detail-Aware Neural Radiance Fields.

IEEE Trans Vis Comput Graph. 2025 Sep;31(9):6022-6035. doi: 10.1109/TVCG.2024.3488960.

Audio2Gestures: Generating Diverse Gestures From Audio.

IEEE Trans Vis Comput Graph. 2024 Aug;30(8):4752-4766. doi: 10.1109/TVCG.2023.3276973. Epub 2024 Jul 1.

"Look who's talking!" Gaze Patterns for Implicit and Explicit Audio-Visual Speech Synchrony Detection in Children With High-Functioning Autism.“瞧瞧是谁在说话！”高功能自闭症儿童中用于隐式和显式视听语音同步检测的注视模式

Autism Res. 2015 Jun;8(3):307-16. doi: 10.1002/aur.1447. Epub 2015 Jan 24.

DaGAN++: Depth-Aware Generative Adversarial Network for Talking Head Video Generation.DaGAN++：用于生成会说话头部视频的深度感知生成对抗网络

IEEE Trans Pattern Anal Mach Intell. 2024 May;46(5):2997-3012. doi: 10.1109/TPAMI.2023.3339964. Epub 2024 Apr 3.

Personalized Audio-Driven 3D Facial Animation via Style-Content Disentanglement.基于风格-内容解耦的个性化音频驱动的 3D 人脸动画。

IEEE Trans Vis Comput Graph. 2024 Mar;30(3):1803-1820. doi: 10.1109/TVCG.2022.3230541. Epub 2024 Jan 30.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

记忆是对话式面部生成中的一对多映射缓解因素。

Memories are One-to-Many Mapping Alleviators in Talking Face Generation.

作者信息

出版信息

相似文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献