• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于音频提取的情感地标进行人脸对话生成。

Talking Face Generation With Audio-Deduced Emotional Landmarks.

出版信息

IEEE Trans Neural Netw Learn Syst. 2024 Oct;35(10):14099-14111. doi: 10.1109/TNNLS.2023.3274676. Epub 2024 Oct 7.

DOI:10.1109/TNNLS.2023.3274676
PMID:37216233
Abstract

The goal of talking face generation is to synthesize a sequence of face images of the specified identity, ensuring the mouth movements are synchronized with the given audio. Recently, image-based talking face generation has emerged as a popular approach. It could generate talking face images synchronized with the audio merely depending on a facial image of arbitrary identity and an audio clip. Despite the accessible input, it forgoes the exploitation of the audio emotion, inducing the generated faces to suffer from emotion unsynchronization, mouth inaccuracy, and image quality deficiency. In this article, we build a bistage audio emotion-aware talking face generation (AMIGO) framework, to generate high-quality talking face videos with cross-modally synced emotion. Specifically, we propose a sequence-to-sequence (seq2seq) cross-modal emotional landmark generation network to generate vivid landmarks, whose lip and emotion are both synchronized with input audio. Meantime, we utilize a coordinated visual emotion representation to improve the extraction of the audio one. In stage two, a feature-adaptive visual translation network is designed to translate the synthesized landmarks into facial images. Concretely, we proposed a feature-adaptive transformation module to fuse the high-level representations of landmarks and images, resulting in significant improvement in image quality. We perform extensive experiments on the multi-view emotional audio-visual dataset (MEAD) and crowd-sourced emotional multimodal actors dataset (CREMA-D) benchmark datasets, demonstrating that our model outperforms state-of-the-art benchmarks.

摘要

说话人脸生成的目标是合成指定身份的人脸图像序列,确保嘴部运动与给定的音频同步。最近,基于图像的说话人脸生成作为一种流行的方法出现了。它仅依赖于任意身份的人脸图像和音频片段,就可以生成与音频同步的说话人脸图像。尽管输入是可访问的,但它放弃了音频情感的利用,导致生成的人脸遭受情感不同步、嘴部不准确和图像质量不足的问题。在本文中,我们构建了一个两阶段的音频情感感知说话人脸生成(AMIGO)框架,以生成具有跨模态同步情感的高质量说话人脸视频。具体来说,我们提出了一个序列到序列(seq2seq)的跨模态情感地标生成网络,以生成生动的地标,其嘴唇和情感都与输入音频同步。同时,我们利用协调的视觉情感表示来改进音频的提取。在第二阶段,设计了一个特征自适应的视觉翻译网络,将合成的地标转换为人脸图像。具体来说,我们提出了一个特征自适应的转换模块,将地标和图像的高层表示融合在一起,从而显著提高了图像质量。我们在多视图情感视听数据集(MEAD)和众包情感多模态演员数据集(CREMA-D)基准数据集上进行了广泛的实验,结果表明我们的模型优于最先进的基准。

相似文献

1
Talking Face Generation With Audio-Deduced Emotional Landmarks.基于音频提取的情感地标进行人脸对话生成。
IEEE Trans Neural Netw Learn Syst. 2024 Oct;35(10):14099-14111. doi: 10.1109/TNNLS.2023.3274676. Epub 2024 Oct 7.
2
VPT: Video portraits transformer for realistic talking face generation.VPT:用于生成逼真会说话人脸的视频人像Transformer
Neural Netw. 2025 Apr;184:107122. doi: 10.1016/j.neunet.2025.107122. Epub 2025 Jan 9.
3
A fine-grained human facial key feature extraction and fusion method for emotion recognition.一种用于情感识别的细粒度人类面部关键特征提取与融合方法。
Sci Rep. 2025 Feb 20;15(1):6153. doi: 10.1038/s41598-025-90440-2.
4
StyleTalk++: A Unified Framework for Controlling the Speaking Styles of Talking Heads.StyleTalk++:用于控制说话人头的说话风格的统一框架。
IEEE Trans Pattern Anal Mach Intell. 2024 Jun;46(6):4331-4347. doi: 10.1109/TPAMI.2024.3357808. Epub 2024 May 7.
5
Enhancing Emotion Recognition: A Dual-Input Model for Facial Expression Recognition Using Images and Facial Landmarks.增强情感识别:一种使用图像和面部标志进行面部表情识别的双输入模型。
Annu Int Conf IEEE Eng Med Biol Soc. 2024 Jul;2024:1-5. doi: 10.1109/EMBC53108.2024.10782924.
6
Multimodal interaction enhanced representation learning for video emotion recognition.用于视频情感识别的多模态交互增强表示学习。
Front Neurosci. 2022 Dec 19;16:1086380. doi: 10.3389/fnins.2022.1086380. eCollection 2022.
7
Learn2Talk: 3D Talking Face Learns from 2D Talking Face.Learn2Talk:从二维会说话的面部学习三维会说话的面部。
IEEE Trans Vis Comput Graph. 2024 Oct 7;PP. doi: 10.1109/TVCG.2024.3476275.
8
Multi-Modal Residual Perceptron Network for Audio-Video Emotion Recognition.多模态残差感知机网络的音视频情感识别。
Sensors (Basel). 2021 Aug 12;21(16):5452. doi: 10.3390/s21165452.
9
Continuous Talking Face Generation Based on Gaussian Blur and Dynamic Convolution.基于高斯模糊和动态卷积的连续说话人脸生成
Sensors (Basel). 2025 Mar 18;25(6):1885. doi: 10.3390/s25061885.
10
CREMA-D: Crowd-sourced Emotional Multimodal Actors Dataset.CREMA-D:众包情感多模态演员数据集。
IEEE Trans Affect Comput. 2014 Oct-Dec;5(4):377-390. doi: 10.1109/TAFFC.2014.2336244.