Khorram Soheil, McInnis Melvin G, Mower Provost Emily
Research Fellow in the Departments of Computer Science and Engineering (College of Engineering) and Psychiatry (School of Medicine), University of Michigan.
Thomas B and Nancy Upjohn Woodworth Professor of Bipolar Disorder and Depression, Department of Psychiatry, University of Michigan School of Medicine.
IEEE Trans Affect Comput. 2021 Oct-Dec;12(4):1069-1083. doi: 10.1109/taffc.2019.2917047. Epub 2019 May 16.
Time-continuous dimensional descriptions of emotions (e.g., arousal, valence) allow researchers to characterize short-time changes and to capture long-term trends in emotion expression. However, continuous emotion labels are generally not synchronized with the input speech signal due to delays caused by reaction-time, which is inherent in human evaluations. To deal with this challenge, we introduce a new convolutional neural network () that is able to simultaneously align and predict labels in an end-to-end manner. The proposed network is a stack of convolutional layers followed by an aligner network that aligns the speech signal and emotion labels. This network is implemented using a new convolutional layer that we introduce, the . It is a time-shifted low-pass (sinc) filter that uses a gradient-based algorithm to learn a single delay. Multiple delayed sinc layers can be used to compensate for a non-stationary delay that is a function of the acoustic space. We test the efficacy of this system on two common emotion datasets, RECOLA and SEWA, and show that this approach obtains state-of-the-art speech-only results by learning time-varying delays while predicting dimensional descriptors of emotions.
对情绪进行时间连续的维度描述(例如,唤醒度、效价),使研究人员能够刻画情绪表达的短期变化并捕捉长期趋势。然而,由于人类评估中固有的反应时间导致的延迟,连续的情绪标签通常与输入语音信号不同步。为应对这一挑战,我们引入了一种新型卷积神经网络(),它能够以端到端的方式同时对齐和预测标签。所提出的网络是一系列卷积层,后面跟着一个对齐器网络,该对齐器网络对齐语音信号和情绪标签。这个网络是使用我们引入的一种新型卷积层实现的,即。它是一个时移低通( sinc )滤波器,使用基于梯度的算法来学习单个延迟。多个延迟的 sinc 层可用于补偿作为声学空间函数的非平稳延迟。我们在两个常见的情绪数据集RECOLA和SEWA上测试了该系统的有效性,并表明这种方法通过在预测情绪维度描述符时学习时变延迟获得了仅语音的最新结果。