• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用多模态风格编码的对抗解缠实现由文本和语音驱动的手势动画的零样本风格迁移。

Zero-shot style transfer for gesture animation driven by text and speech using adversarial disentanglement of multimodal style encoding.

作者信息

Fares Mireille, Pelachaud Catherine, Obin Nicolas

机构信息

The Institute of Intelligent Systems and Robotics (ISIR), Sciences et Technologies de la Musique et du Son (STMS), Sorbonne University, Paris, France.

Centre National de la Recherche Scientifique (CNRS), The Institute of Intelligent Systems and Robotics (ISIR), Sorbonne University, Paris, France.

出版信息

Front Artif Intell. 2023 Jun 12;6:1142997. doi: 10.3389/frai.2023.1142997. eCollection 2023.

DOI:10.3389/frai.2023.1142997
PMID:37377638
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10291316/
Abstract

Modeling virtual agents with behavior style is one factor for personalizing human-agent interaction. We propose an efficient yet effective machine learning approach to synthesize gestures driven by prosodic features and text in the style of different speakers including those unseen during training. Our model performs zero-shot multimodal style transfer driven by multimodal data from the PATS database containing videos of various speakers. We view style as being pervasive; while speaking, it colors the communicative behaviors expressivity while speech content is carried by multimodal signals and text. This disentanglement scheme of content and style allows us to directly infer the style embedding even of a speaker whose data are not part of the training phase, without requiring any further training or fine-tuning. The first goal of our model is to generate the gestures of a source speaker based on the of two input modalities-Mel spectrogram and text semantics. The second goal is to condition the source speaker's predicted gestures on the multimodal behavior embedding of a target speaker. The third goal is to allow zero-shot style transfer of speakers unseen during training without re-training the model. Our system consists of two main components: (1) a that learns to generate a fixed-dimensional speaker embedding from a target speaker multimodal data (mel-spectrogram, pose, and text) and (2) a that synthesizes gestures based on the of the input modalities-text and mel-spectrogram-of a source speaker and conditioned on the speaker style embedding. We evaluate that our model is able to synthesize gestures of a source speaker given the two input modalities and transfer the knowledge of target speaker style variability learned by the speaker style encoder to the gesture generation task in a zero-shot setup, indicating that the model has learned a high-quality speaker representation. We conduct objective and subjective evaluations to validate our approach and compare it with baselines.

摘要

用行为风格对虚拟代理进行建模是实现人机交互个性化的一个因素。我们提出了一种高效且有效的机器学习方法,以不同说话者的风格(包括训练期间未见过的说话者)合成由韵律特征和文本驱动的手势。我们的模型由包含各种说话者视频的PATS数据库中的多模态数据驱动,执行零样本多模态风格转换。我们认为风格是普遍存在的;在说话时,它为交际行为的表现力增添色彩,而语音内容则由多模态信号和文本承载。这种内容和风格的解缠结方案使我们能够直接推断出即使是其数据不属于训练阶段的说话者的风格嵌入,而无需任何进一步的训练或微调。我们模型的第一个目标是基于两个输入模态——梅尔频谱图和文本语义——生成源说话者的手势。第二个目标是以目标说话者的多模态行为嵌入为条件,对源说话者预测的手势进行调整。第三个目标是在不重新训练模型的情况下,实现训练期间未见过的说话者的零样本风格转换。我们的系统由两个主要组件组成:(1)一个编码器,它学习从目标说话者的多模态数据(梅尔频谱图、姿势和文本)中生成固定维度的说话者嵌入;(2)一个生成器,它基于源说话者的输入模态——文本和梅尔频谱图——并以说话者风格嵌入为条件来合成手势。我们评估得出,在给定两个输入模态的情况下,我们的模型能够合成源说话者的手势,并在零样本设置中将说话者风格编码器学到的目标说话者风格变异性知识转移到手势生成任务中,这表明该模型已经学习到了高质量的说话者表示。我们进行了客观和主观评估,以验证我们的方法并将其与基线进行比较。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a7ab/10291316/241ba503afb6/frai-06-1142997-g0012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a7ab/10291316/ebf40400a6e1/frai-06-1142997-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a7ab/10291316/fe90d1f86c6a/frai-06-1142997-g0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a7ab/10291316/ed2e58766697/frai-06-1142997-g0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a7ab/10291316/e399d6f6305f/frai-06-1142997-g0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a7ab/10291316/e02c5191dcd2/frai-06-1142997-g0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a7ab/10291316/6ee20c18cd09/frai-06-1142997-g0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a7ab/10291316/57bbdc17ca02/frai-06-1142997-g0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a7ab/10291316/b7357791f493/frai-06-1142997-g0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a7ab/10291316/842fe8e8ad29/frai-06-1142997-g0009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a7ab/10291316/a139f6aed07a/frai-06-1142997-g0010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a7ab/10291316/b18ff5aa1ac2/frai-06-1142997-g0011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a7ab/10291316/241ba503afb6/frai-06-1142997-g0012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a7ab/10291316/ebf40400a6e1/frai-06-1142997-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a7ab/10291316/fe90d1f86c6a/frai-06-1142997-g0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a7ab/10291316/ed2e58766697/frai-06-1142997-g0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a7ab/10291316/e399d6f6305f/frai-06-1142997-g0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a7ab/10291316/e02c5191dcd2/frai-06-1142997-g0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a7ab/10291316/6ee20c18cd09/frai-06-1142997-g0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a7ab/10291316/57bbdc17ca02/frai-06-1142997-g0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a7ab/10291316/b7357791f493/frai-06-1142997-g0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a7ab/10291316/842fe8e8ad29/frai-06-1142997-g0009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a7ab/10291316/a139f6aed07a/frai-06-1142997-g0010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a7ab/10291316/b18ff5aa1ac2/frai-06-1142997-g0011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a7ab/10291316/241ba503afb6/frai-06-1142997-g0012.jpg

相似文献

1
Zero-shot style transfer for gesture animation driven by text and speech using adversarial disentanglement of multimodal style encoding.利用多模态风格编码的对抗解缠实现由文本和语音驱动的手势动画的零样本风格迁移。
Front Artif Intell. 2023 Jun 12;6:1142997. doi: 10.3389/frai.2023.1142997. eCollection 2023.
2
STYLETTS-VC: ONE-SHOT VOICE CONVERSION BY KNOWLEDGE TRANSFER FROM STYLE-BASED TTS MODELS.STYLETTS-VC:基于风格的语音合成模型知识迁移实现的一次性语音转换
SLT Workshop Spok Lang Technol. 2023 Jan;2022:920-927. doi: 10.1109/slt54892.2023.10022498.
3
Attention-based speech feature transfer between speakers.基于注意力机制的说话人之间的语音特征转移。
Front Artif Intell. 2024 Feb 26;7:1259641. doi: 10.3389/frai.2024.1259641. eCollection 2024.
4
Evaluation of text-to-gesture generation model using convolutional neural network.基于卷积神经网络的文本到手势生成模型评估。
Neural Netw. 2022 Jul;151:365-375. doi: 10.1016/j.neunet.2022.03.041. Epub 2022 Apr 4.
5
Effective Zero-Shot Multi-Speaker Text-to-Speech Technique Using Information Perturbation and a Speaker Encoder.基于信息扰动和说话人编码器的有效零样本多说话人文本到语音技术
Sensors (Basel). 2023 Dec 3;23(23):9591. doi: 10.3390/s23239591.
6
A speaker's gesture style can affect language comprehension: ERP evidence from gesture-speech integration.说话者的手势风格会影响语言理解:来自手势-言语整合的事件相关电位证据。
Soc Cogn Affect Neurosci. 2015 Sep;10(9):1236-43. doi: 10.1093/scan/nsv011. Epub 2015 Feb 16.
7
Noise-robust voice conversion with domain adversarial training.基于域对抗训练的抗噪语音转换。
Neural Netw. 2022 Apr;148:74-84. doi: 10.1016/j.neunet.2022.01.003. Epub 2022 Jan 13.
8
Cycle consistent network for end-to-end style transfer TTS training.循环一致网络用于端到端风格转换 TTS 训练。
Neural Netw. 2021 Aug;140:223-236. doi: 10.1016/j.neunet.2021.03.005. Epub 2021 Mar 16.
9
Zero-shot prompt-based video encoder for surgical gesture recognition.用于手术手势识别的基于零样本提示的视频编码器
Int J Comput Assist Radiol Surg. 2025 Feb;20(2):311-321. doi: 10.1007/s11548-024-03257-1. Epub 2024 Sep 17.
10
Analysis of head gesture and prosody patterns for prosody-driven head-gesture animation.用于韵律驱动头部手势动画的头部手势和韵律模式分析。
IEEE Trans Pattern Anal Mach Intell. 2008 Aug;30(8):1330-45. doi: 10.1109/TPAMI.2007.70797.

引用本文的文献

1
Creating Expressive Social Robots That Convey Symbolic and Spontaneous Communication.创造能够传达符号化和自然交流的富有表现力的社交机器人。
Sensors (Basel). 2024 Jun 5;24(11):3671. doi: 10.3390/s24113671.

本文引用的文献

1
Automating the Production of Communicative Gestures in Embodied Characters.在具身角色中实现交际手势的自动化生成。
Front Psychol. 2018 Jul 9;9:1144. doi: 10.3389/fpsyg.2018.01144. eCollection 2018.
2
A speaker's gesture style can affect language comprehension: ERP evidence from gesture-speech integration.说话者的手势风格会影响语言理解:来自手势-言语整合的事件相关电位证据。
Soc Cogn Affect Neurosci. 2015 Sep;10(9):1236-43. doi: 10.1093/scan/nsv011. Epub 2015 Feb 16.
3
Analysis of head gesture and prosody patterns for prosody-driven head-gesture animation.
用于韵律驱动头部手势动画的头部手势和韵律模式分析。
IEEE Trans Pattern Anal Mach Intell. 2008 Aug;30(8):1330-45. doi: 10.1109/TPAMI.2007.70797.