College of Economics and Trade Guangdong Mechanical & Electrical Polytechnic, Guangzhou, 510545, China.
School of Software Engineering, South China University of Technology, Guangzhou, 510641, China.
Sci Rep. 2024 Aug 26;14(1):19759. doi: 10.1038/s41598-024-70463-x.
Reinforcement learning (RL) is an effective method in training dialogue policies to steer the conversation towards successful task completion. However, most RL-based methods only rely on semantic inputs that lack empathy as they ignore the user emotional information. Moreover, these methods suffer from delayed rewards caused by the user simulator returning valuable results only at dialogue end. Recently, some methods have been proposed to learn the reward function together with user emotions, but they omit considering user emotion in each dialogue turn. In this paper, we proposed an emotion-sensitive dialogue policy model (ESDP), it incorporates user emotions information into dialogue policy and selects the optimal action by the combination of top-k actions with the user emotions. The user emotion information in each turn is used as an immediate reward for the current dialogue state to solve sparse rewards and the dependency on termination. Extensive experiments validate that our method outperforms the baseline approaches when combined with different Q-Learning algorithms, and also surpasses other popular existing dialog policies' performance.
强化学习(RL)是一种训练对话策略的有效方法,可以引导对话朝着成功完成任务的方向发展。然而,大多数基于 RL 的方法仅依赖于语义输入,这些输入缺乏同理心,因为它们忽略了用户的情感信息。此外,这些方法还存在由于用户模拟器仅在对话结束时返回有价值的结果而导致的延迟奖励问题。最近,已经提出了一些方法来共同学习奖励函数和用户情感,但它们忽略了在每个对话轮次中考虑用户情感。在本文中,我们提出了一种情感敏感对话策略模型(ESDP),它将用户情感信息纳入对话策略中,并通过与用户情感相结合的 top-k 动作选择最优动作。每个回合的用户情感信息被用作当前对话状态的即时奖励,以解决稀疏奖励和对结束的依赖问题。大量实验验证了,当与不同的 Q 学习算法结合使用时,我们的方法优于基线方法,并且还超过了其他流行的现有对话策略的性能。