Suppr超能文献

利用多模态风格编码的对抗解缠实现由文本和语音驱动的手势动画的零样本风格迁移。

Zero-shot style transfer for gesture animation driven by text and speech using adversarial disentanglement of multimodal style encoding.

作者信息

Fares Mireille, Pelachaud Catherine, Obin Nicolas

机构信息

The Institute of Intelligent Systems and Robotics (ISIR), Sciences et Technologies de la Musique et du Son (STMS), Sorbonne University, Paris, France.

Centre National de la Recherche Scientifique (CNRS), The Institute of Intelligent Systems and Robotics (ISIR), Sorbonne University, Paris, France.

出版信息

Front Artif Intell. 2023 Jun 12;6:1142997. doi: 10.3389/frai.2023.1142997. eCollection 2023.

Abstract

Modeling virtual agents with behavior style is one factor for personalizing human-agent interaction. We propose an efficient yet effective machine learning approach to synthesize gestures driven by prosodic features and text in the style of different speakers including those unseen during training. Our model performs zero-shot multimodal style transfer driven by multimodal data from the PATS database containing videos of various speakers. We view style as being pervasive; while speaking, it colors the communicative behaviors expressivity while speech content is carried by multimodal signals and text. This disentanglement scheme of content and style allows us to directly infer the style embedding even of a speaker whose data are not part of the training phase, without requiring any further training or fine-tuning. The first goal of our model is to generate the gestures of a source speaker based on the of two input modalities-Mel spectrogram and text semantics. The second goal is to condition the source speaker's predicted gestures on the multimodal behavior embedding of a target speaker. The third goal is to allow zero-shot style transfer of speakers unseen during training without re-training the model. Our system consists of two main components: (1) a that learns to generate a fixed-dimensional speaker embedding from a target speaker multimodal data (mel-spectrogram, pose, and text) and (2) a that synthesizes gestures based on the of the input modalities-text and mel-spectrogram-of a source speaker and conditioned on the speaker style embedding. We evaluate that our model is able to synthesize gestures of a source speaker given the two input modalities and transfer the knowledge of target speaker style variability learned by the speaker style encoder to the gesture generation task in a zero-shot setup, indicating that the model has learned a high-quality speaker representation. We conduct objective and subjective evaluations to validate our approach and compare it with baselines.

摘要

用行为风格对虚拟代理进行建模是实现人机交互个性化的一个因素。我们提出了一种高效且有效的机器学习方法,以不同说话者的风格(包括训练期间未见过的说话者)合成由韵律特征和文本驱动的手势。我们的模型由包含各种说话者视频的PATS数据库中的多模态数据驱动,执行零样本多模态风格转换。我们认为风格是普遍存在的;在说话时,它为交际行为的表现力增添色彩,而语音内容则由多模态信号和文本承载。这种内容和风格的解缠结方案使我们能够直接推断出即使是其数据不属于训练阶段的说话者的风格嵌入,而无需任何进一步的训练或微调。我们模型的第一个目标是基于两个输入模态——梅尔频谱图和文本语义——生成源说话者的手势。第二个目标是以目标说话者的多模态行为嵌入为条件,对源说话者预测的手势进行调整。第三个目标是在不重新训练模型的情况下,实现训练期间未见过的说话者的零样本风格转换。我们的系统由两个主要组件组成:(1)一个编码器,它学习从目标说话者的多模态数据(梅尔频谱图、姿势和文本)中生成固定维度的说话者嵌入;(2)一个生成器,它基于源说话者的输入模态——文本和梅尔频谱图——并以说话者风格嵌入为条件来合成手势。我们评估得出,在给定两个输入模态的情况下,我们的模型能够合成源说话者的手势,并在零样本设置中将说话者风格编码器学到的目标说话者风格变异性知识转移到手势生成任务中,这表明该模型已经学习到了高质量的说话者表示。我们进行了客观和主观评估,以验证我们的方法并将其与基线进行比较。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a7ab/10291316/ebf40400a6e1/frai-06-1142997-g0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验