• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于二元交互式视频生成的上下脸变压器解耦

Decoupling upper and lower face transformers for binary interactive video generation.

作者信息

Yang Daowu, Liu Ying, Yang Qiyun, Li Ruihui

机构信息

College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, China; Hunan University of Finance and Economics, Changsha, 410205, China.

College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, China.

出版信息

Neural Netw. 2025 Nov;191:107714. doi: 10.1016/j.neunet.2025.107714. Epub 2025 Jul 12.

DOI:10.1016/j.neunet.2025.107714
PMID:40674787
Abstract

Current audio-driven binary interaction methods have limitations in capturing the uncertain relationship between a speaker's audio and an interlocutor's facial movements. To address this issue, we propose a video generation pipeline based on a cross-modal Transformer. First, a Transformer decoder partitions facial features into upper and lower regions, capturing lower features that are closely linked to the audio and upper features that remain independent of visual cues. Second, we design a cross-modal attention module that combines alignment bias with causal attention mechanisms to effectively manage subtle motion variations between adjacent frames in facial sequences. To mitigate uncertainties in long-term contexts, we expand the self-attention range of the Transformer encoder and integrate self-supervised pretrained speech representations to alleviate data scarcity. Finally, by optimizing the audio-to-action mapping and incorporating an enhanced neural renderer, we achieve fine control over facial movements while generating high-quality portrait images. Extensive experiments validate the effectiveness and superiority of our approach in interactive video generation.

摘要

当前基于音频驱动的二元交互方法在捕捉说话者音频与对话者面部动作之间的不确定关系方面存在局限性。为了解决这个问题,我们提出了一种基于跨模态Transformer的视频生成管道。首先,Transformer解码器将面部特征划分为上部和下部区域,捕捉与音频紧密相关的下部特征以及与视觉线索无关的上部特征。其次,我们设计了一个跨模态注意力模块,该模块将对齐偏差与因果注意力机制相结合,以有效管理面部序列中相邻帧之间的细微运动变化。为了减轻长期上下文的不确定性,我们扩展了Transformer编码器的自注意力范围,并集成了自监督预训练的语音表示,以缓解数据稀缺问题。最后,通过优化音频到动作的映射并结合增强的神经渲染器,我们在生成高质量肖像图像的同时实现了对面部动作的精细控制。广泛的实验验证了我们的方法在交互式视频生成中的有效性和优越性。

相似文献

1
Decoupling upper and lower face transformers for binary interactive video generation.用于二元交互式视频生成的上下脸变压器解耦
Neural Netw. 2025 Nov;191:107714. doi: 10.1016/j.neunet.2025.107714. Epub 2025 Jul 12.
2
Trajectory-Ordered Objectives for Self-Supervised Representation Learning of Temporal Healthcare Data Using Transformers: Model Development and Evaluation Study.使用Transformer进行时间序列医疗数据自监督表示学习的轨迹有序目标:模型开发与评估研究
JMIR Med Inform. 2025 Jun 4;13:e68138. doi: 10.2196/68138.
3
GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained 3D Face Guidance.GSmoothFace:通过细粒度3D面部引导实现通用平滑说话人脸生成
IEEE Trans Vis Comput Graph. 2025 May 2;PP. doi: 10.1109/TVCG.2025.3566382.
4
Short-Term Memory Impairment短期记忆障碍
5
Cognitive decline assessment using semantic linguistic content and transformer deep learning architecture.使用语义语言内容和变压器深度学习架构评估认知能力下降。
Int J Lang Commun Disord. 2024 May-Jun;59(3):1110-1127. doi: 10.1111/1460-6984.12973. Epub 2023 Nov 16.
6
Long-term care plan recommendation for older adults with disabilities: a bipartite graph transformer and self-supervised approach.针对残疾老年人的长期护理计划建议:一种二分图变压器和自监督方法。
J Am Med Inform Assoc. 2025 Apr 1;32(4):689-701. doi: 10.1093/jamia/ocae327.
7
Facial Emotion Recognition of 16 Distinct Emotions From Smartphone Videos: Comparative Study of Machine Learning and Human Performance.基于智能手机视频的16种不同情绪的面部表情识别:机器学习与人类表现的对比研究
J Med Internet Res. 2025 Jul 2;27:e68942. doi: 10.2196/68942.
8
TLTNet: A novel transscale cascade layered transformer network for enhanced retinal blood vessel segmentation.TLTNet:一种新颖的跨尺度级联分层Transformer 网络,用于增强视网膜血管分割。
Comput Biol Med. 2024 Aug;178:108773. doi: 10.1016/j.compbiomed.2024.108773. Epub 2024 Jun 25.
9
Facial Landmark-Driven Keypoint Feature Extraction for Robust Facial Expression Recognition.用于鲁棒面部表情识别的面部地标驱动关键点特征提取
Sensors (Basel). 2025 Jun 16;25(12):3762. doi: 10.3390/s25123762.
10
Cascaded Dynamic Memory Refinement and Semantic Alignment for Exo-to-Ego Cross-View Video Generation.
IEEE Trans Pattern Anal Mach Intell. 2025 Sep;47(9):7490-7505. doi: 10.1109/TPAMI.2025.3569195.