利用神经嵌入检测嗓音疲劳。

Detecting Vocal Fatigue with Neural Embeddings.

作者信息

Bayerl Sebastian P, Wagner Dominik, Baumann Ilja, Bocklet Tobias, Riedhammer Korbinian

机构信息

Technische Hochschule Nürnberg Georg Simon Ohm.

出版信息

J Voice. 2023 Feb 9. doi: 10.1016/j.jvoice.2023.01.012.

DOI:10.1016/j.jvoice.2023.01.012

PMID:36774263

Abstract

Vocal fatigue refers to the feeling of tiredness and weakness of voice due to extended utilization. This paper investigates the effectiveness of neural embeddings for the detection of vocal fatigue. We compare x-vectors, ECAPA-TDNN, and wav2vec 2.0 embeddings on a corpus of academic spoken English. Low-dimensional mappings of the data reveal that neural embeddings capture information about the change in vocal characteristics of a speaker during prolonged voice usage. We show that vocal fatigue can be reliably predicted using all three types of neural embeddings after 40 minutes of continuous speaking when temporal smoothing and normalization are applied to the extracted embeddings. We employ support vector machines for classification and achieve accuracy scores of 81% using x-vectors, 85% using ECAPA-TDNN embeddings, and 82% using wav2vec 2.0 embeddings as input features. We obtain an accuracy score of 76%, when the trained system is applied to a different speaker and recording environment without any adaptation.

摘要

嗓音疲劳是指由于长时间使用而导致的声音疲劳和虚弱感。本文研究了神经嵌入在嗓音疲劳检测中的有效性。我们在学术英语口语语料库上比较了x向量、ECAPA - TDNN和wav2vec 2.0嵌入。数据的低维映射表明，神经嵌入捕捉了说话者在长时间使用嗓音期间嗓音特征变化的信息。我们表明，当对提取的嵌入应用时间平滑和归一化后，在连续说话40分钟后，使用所有三种类型的神经嵌入都可以可靠地预测嗓音疲劳。我们使用支持向量机进行分类，以x向量作为输入特征时准确率为81%，使用ECAPA - TDNN嵌入时准确率为85%，使用wav2vec 2.0嵌入时准确率为82%。当将训练好的系统应用于不同的说话者和录音环境而不进行任何调整时，我们获得了76%的准确率。