Yang Daowu, Liu Ying, Yang Qiyun, Li Ruihui
College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, China; Hunan University of Finance and Economics, Changsha, 410205, China.
College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, China.
Neural Netw. 2025 Nov;191:107714. doi: 10.1016/j.neunet.2025.107714. Epub 2025 Jul 12.
Current audio-driven binary interaction methods have limitations in capturing the uncertain relationship between a speaker's audio and an interlocutor's facial movements. To address this issue, we propose a video generation pipeline based on a cross-modal Transformer. First, a Transformer decoder partitions facial features into upper and lower regions, capturing lower features that are closely linked to the audio and upper features that remain independent of visual cues. Second, we design a cross-modal attention module that combines alignment bias with causal attention mechanisms to effectively manage subtle motion variations between adjacent frames in facial sequences. To mitigate uncertainties in long-term contexts, we expand the self-attention range of the Transformer encoder and integrate self-supervised pretrained speech representations to alleviate data scarcity. Finally, by optimizing the audio-to-action mapping and incorporating an enhanced neural renderer, we achieve fine control over facial movements while generating high-quality portrait images. Extensive experiments validate the effectiveness and superiority of our approach in interactive video generation.
当前基于音频驱动的二元交互方法在捕捉说话者音频与对话者面部动作之间的不确定关系方面存在局限性。为了解决这个问题,我们提出了一种基于跨模态Transformer的视频生成管道。首先,Transformer解码器将面部特征划分为上部和下部区域,捕捉与音频紧密相关的下部特征以及与视觉线索无关的上部特征。其次,我们设计了一个跨模态注意力模块,该模块将对齐偏差与因果注意力机制相结合,以有效管理面部序列中相邻帧之间的细微运动变化。为了减轻长期上下文的不确定性,我们扩展了Transformer编码器的自注意力范围,并集成了自监督预训练的语音表示,以缓解数据稀缺问题。最后,通过优化音频到动作的映射并结合增强的神经渲染器,我们在生成高质量肖像图像的同时实现了对面部动作的精细控制。广泛的实验验证了我们的方法在交互式视频生成中的有效性和优越性。