在计算副语言任务中使用混合 HMM/DNN 嵌入提取器模型。

Using Hybrid HMM/DNN Embedding Extractor Models in Computational Paralinguistic Tasks.

机构信息

Institute of Informatics, University of Szeged, H-6720 Szeged, Hungary.

ELKH-SZTE Research Group on Artificial Intelligence, H-6720 Szeged, Hungary.

出版信息

Sensors (Basel). 2023 May 30;23(11):5208. doi: 10.3390/s23115208.

DOI:10.3390/s23115208

PMID:37299935

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10256007/

Abstract

The field of computational paralinguistics emerged from automatic speech processing, and it covers a wide range of tasks involving different phenomena present in human speech. It focuses on the non-verbal content of human speech, including tasks such as spoken emotion recognition, conflict intensity estimation and sleepiness detection from speech, showing straightforward application possibilities for remote monitoring with acoustic sensors. The two main technical issues present in computational paralinguistics are (1) handling varying-length utterances with traditional classifiers and (2) training models on relatively small corpora. In this study, we present a method that combines automatic speech recognition and paralinguistic approaches, which is able to handle both of these technical issues. That is, we trained a HMM/DNN hybrid acoustic model on a general ASR corpus, which was then used as a source of embeddings employed as features for several paralinguistic tasks. To convert the local embeddings into utterance-level features, we experimented with five different aggregation methods, namely mean, standard deviation, skewness, kurtosis and the ratio of non-zero activations. Our results show that the proposed feature extraction technique consistently outperforms the widely used x-vector method used as the baseline, independently of the actual paralinguistic task investigated. Furthermore, the aggregation techniques could be combined effectively as well, leading to further improvements depending on the task and the layer of the neural network serving as the source of the local embeddings. Overall, based on our experimental results, the proposed method can be considered as a competitive and resource-efficient approach for a wide range of computational paralinguistic tasks.

摘要

计算副语言学领域源于自动语音处理，涵盖了涉及人类语音中不同现象的广泛任务。它专注于人类语音的非语言内容，包括从语音中识别情感、估计冲突强度和检测困倦等任务，为使用声学传感器进行远程监测展示了直接的应用可能性。计算副语言学中的两个主要技术问题是（1）用传统分类器处理长度变化的语音，以及（2）在相对较小的语料库上训练模型。在这项研究中，我们提出了一种结合自动语音识别和副语言学方法的方法，该方法能够处理这两个技术问题。也就是说，我们在一般的 ASR 语料库上训练了一个 HMM/DNN 混合声学模型，然后将其用作几个副语言学任务的嵌入源特征。为了将局部嵌入转换为话语级特征，我们尝试了五种不同的聚合方法，即平均值、标准差、偏度、峰度和非零激活比。我们的结果表明，所提出的特征提取技术在不依赖于所研究的实际副语言任务的情况下，始终优于作为基线的广泛使用的 x-vector 方法。此外，聚合技术也可以有效地结合起来，根据任务和作为局部嵌入源的神经网络的层，进一步提高性能。总的来说，根据我们的实验结果，可以认为该方法是一种具有竞争力和资源效率的方法，适用于广泛的计算副语言学任务。