Yang Fan, Fang Lei, Suo Rui, Zhang Jing, Whang Mincheol
Department of Emotion Engineering, Sangmyung University, Seoul 03016, Republic of Korea.
College of Physical Education and Health Engineering, Hebei University of Engineering, Handan 056038, China.
Sensors (Basel). 2025 Aug 18;25(16):5117. doi: 10.3390/s25165117.
With the increasing complexity of human-computer interaction scenarios, conventional digital human facial expression systems show notable limitations in handling multi-emotion co-occurrence, dynamic expression, and semantic responsiveness. This paper proposes a digital human system framework that integrates multimodal emotion recognition and compound facial expression generation. The system establishes a complete pipeline for real-time interaction and compound emotional expression, following a sequence of "speech semantic parsing-multimodal emotion recognition-Action Unit (AU)-level 3D facial expression control." First, a ResNet18-based model is employed for robust emotion classification using the AffectNet dataset. Then, an AU motion curve driving module is constructed on the Unreal Engine platform, where dynamic synthesis of basic emotions is achieved via a state-machine mechanism. Finally, Generative Pre-trained Transformer (GPT) is utilized for semantic analysis, generating structured emotional weight vectors that are mapped to the AU layer to enable language-driven facial responses. Experimental results demonstrate that the proposed system significantly improves facial animation quality, with naturalness increasing from 3.54 to 3.94 and semantic congruence from 3.44 to 3.80. These results validate the system's capability to generate realistic and emotionally coherent expressions in real time. This research provides a complete technical framework and practical foundation for high-fidelity digital humans with affective interaction capabilities.
随着人机交互场景日益复杂,传统数字人面部表情系统在处理多情绪共现、动态表情和语义响应方面存在显著局限性。本文提出了一种集成多模态情感识别和复合面部表情生成的数字人系统框架。该系统建立了一个用于实时交互和复合情感表达的完整流程,遵循“语音语义解析 - 多模态情感识别 - 动作单元(AU) - 3D 面部表情控制”的顺序。首先,使用基于 ResNet18 的模型,利用 AffectNet 数据集进行稳健的情感分类。然后,在虚幻引擎平台上构建一个 AU 运动曲线驱动模块,通过状态机机制实现基本情绪的动态合成。最后,利用生成式预训练变换器(GPT)进行语义分析,生成结构化的情感权重向量,将其映射到 AU 层以实现语言驱动的面部响应。实验结果表明,所提出的系统显著提高了面部动画质量,自然度从 3.54 提高到 3.94,语义一致性从 3.44 提高到 3.80。这些结果验证了该系统实时生成逼真且情感连贯表情的能力。本研究为具有情感交互能力的高保真数字人提供了完整的技术框架和实践基础。