Singla Karan, Chen Zhuohao, Atkins David C, Narayanan Shrikanth
University of Southern California, Los Angeles, USA.
University of Washington, Seattle, WA, USA.
Proc Conf Assoc Comput Linguist Meet. 2020 Jul;2020:3797-3803. doi: 10.18653/v1/2020.acl-main.351.
Spoken language understanding tasks usually rely on pipelines involving complex processing blocks such as voice activity detection, speaker diarization and Automatic speech recognition (ASR). We propose a novel framework for predicting utterance level labels directly from speech features, thus removing the dependency on first generating transcripts, and transcription free behavioral coding. Our classifier uses a pretrained Speech-2-Vector encoder as bottleneck to generate word-level representations from speech features. This pre-trained encoder learns to encode speech features for a word using an objective similar to Word2Vec. Our proposed approach just uses speech features and word segmentation information for predicting spoken utterance-level target labels. We show that our model achieves competitive results to other state-of-the-art approaches which use transcribed text for the task of predicting psychotherapy-relevant behavior codes.
口语理解任务通常依赖于包含复杂处理模块的流水线,如语音活动检测、说话人分割和自动语音识别(ASR)。我们提出了一种新颖的框架,可直接根据语音特征预测话语级标签,从而消除了对先生成转录本以及无转录行为编码的依赖。我们的分类器使用预训练的语音到向量编码器作为瓶颈,从语音特征生成词级表示。这个预训练的编码器学习使用类似于Word2Vec的目标为一个单词编码语音特征。我们提出的方法仅使用语音特征和词分割信息来预测口语话语级目标标签。我们表明,我们的模型与其他使用转录文本进行预测心理治疗相关行为代码任务的最先进方法相比,取得了具有竞争力的结果。