Suppr超能文献

Learn2Talk:从二维会说话的面部学习三维会说话的面部。

Learn2Talk: 3D Talking Face Learns from 2D Talking Face.

作者信息

Zhuang Yixiang, Cheng Baoping, Cheng Yao, Jin Yuntao, Liu Renshuai, Li Chengyang, Cheng Xuan, Liao Jing, Lin Juncong

出版信息

IEEE Trans Vis Comput Graph. 2024 Oct 7;PP. doi: 10.1109/TVCG.2024.3476275.

Abstract

The speech-driven facial animation technology is generally categorized into two main types: 3D and 2D talking face. Both of these have garnered considerable research attention in recent years. However, to our knowledge, the research into 3D talking face has not progressed as deeply as that of 2D talking face, particularly in terms of lip-sync and perceptual mouth movements. The lip-sync necessitates an impeccable synchronization between mouth motion and speech audio. The speech perception derived from the perceptual mouth movements should resemble that of the driving audio. To mind the gap between the two sub-fields, we propose Learn2Talk, a learning framework that enhances 3D talking face network by integrating two key insights from the field of 2D talking face. Firstly, drawing inspiration from the audio-video sync network, we develop a 3D sync-lip expert model for the pursuit of lip-sync between audio and 3D facial motions. Secondly, we utilize a teacher model, carefully chosen from among 2D talking face methods, to guide the training of the audio-to-3D motions regression network, thereby increasing the accuracy of 3D vertex movements. Extensive experiments demonstrate the superiority of our proposed framework over state-of-the-art methods in terms of lip-sync, vertex accuracy and perceptual movements. Finally, we showcase two applications of our framework: audio-visual speech recognition and speech-driven 3D Gaussian Splatting-based avatar animation. The project page of this paper is: https://lkjkjoiuiu.github.io/Learn2Talk/.

摘要

语音驱动的面部动画技术通常分为两种主要类型

3D和2D会说话的脸。近年来,这两种技术都受到了相当多的研究关注。然而,据我们所知,3D会说话的脸的研究进展不如2D会说话的脸深入,特别是在口型同步和感知嘴部动作方面。口型同步需要嘴部动作和语音音频之间完美同步。从感知嘴部动作中获得的语音感知应该与驱动音频的语音感知相似。为了缩小这两个子领域之间的差距,我们提出了Learn2Talk,这是一个学习框架,通过整合2D会说话的脸领域的两个关键见解来增强3D会说话的脸网络。首先,从音频-视频同步网络中获得灵感,我们开发了一个3D同步唇专家模型,以追求音频和3D面部动作之间的口型同步。其次,我们利用一个从2D会说话的脸方法中精心挑选的教师模型来指导音频到3D动作回归网络的训练,从而提高3D顶点运动的准确性。大量实验表明,我们提出的框架在口型同步、顶点准确性和感知动作方面优于现有方法。最后,我们展示了我们框架的两个应用:视听语音识别和基于语音驱动的3D高斯点渲染的虚拟形象动画。本文的项目页面是:https://lkjkjoiuiu.github.io/Learn2Talk/

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验