Ditthapron Apiwat, O Agu Emmanuel, C Lammert Adam
Computer Science DepartmentWorcester Polytechnic Institute Worcester MA 01609 USA.
Biomedical Engineering DepartmentWorcester Polytechnic Institute Worcester MA 01609 USA.
IEEE Open J Eng Med Biol. 2021 Mar 4;2:304-313. doi: 10.1109/OJEMB.2021.3063994. eCollection 2021.
Smartphones can be used to passively assess and monitor patients' speech impairments caused by ailments such as Parkinson's disease, Traumatic Brain Injury (TBI), Post-Traumatic Stress Disorder (PTSD) and neurodegenerative diseases such as Alzheimer's disease and dementia. However, passive audio recordings in natural settings often capture the speech of non-target speakers (cross-talk). Consequently, speaker separation, which identifies the target speakers' speech in audio recordings with two or more speakers' voices, is a crucial pre-processing step in such scenarios. Prior speech separation methods analyzed raw audio. However, in order to preserve speaker privacy, passively recorded smartphone audio and machine learning-based speech assessment are often performed on derived speech features such as Mel-Frequency Cepstral Coefficients (MFCCs). In this paper, we propose a novel Deep MFCC bAsed SpeaKer Separation (Deep-MASKS). Deep-MASKS uses an autoencoder to reconstruct MFCC components of an individual's speech from an i-vector, x-vector or d-vector representation of their speech learned during the enrollment period. Deep-MASKS utilizes a Deep Neural Network (DNN) for MFCC signal reconstructions, which yields a more accurate, higher-order function compared to prior work that utilized a mask. Unlike prior work that operates on utterances, Deep-MASKS operates on continuous audio recordings. Deep-MASKS outperforms baselines, reducing the Mean Squared Error (MSE) of MFCC reconstruction by up to 44% and the number of additional bits required to represent clean speech entropy by 36%.
智能手机可用于被动评估和监测由帕金森病、创伤性脑损伤(TBI)、创伤后应激障碍(PTSD)以及阿尔茨海默病和痴呆等神经退行性疾病引起的患者言语障碍。然而,在自然环境中的被动音频记录通常会捕捉到非目标说话者的语音(串扰)。因此,在有两个或更多说话者声音的音频记录中识别目标说话者语音的说话者分离,是此类场景中的关键预处理步骤。先前的语音分离方法分析的是原始音频。然而,为了保护说话者隐私,被动记录的智能手机音频和基于机器学习的语音评估通常是对诸如梅尔频率倒谱系数(MFCC)等派生语音特征进行的。在本文中,我们提出了一种新颖的基于深度MFCC的说话者分离方法(Deep-MASKS)。Deep-MASKS使用自动编码器从个体在注册期间学习到的语音的i向量、x向量或d向量表示中重建其语音的MFCC分量。Deep-MASKS利用深度神经网络(DNN)进行MFCC信号重建,与之前使用掩码的工作相比,它能产生更准确的高阶函数。与之前基于话语操作的工作不同,Deep-MASKS对连续音频记录进行操作。Deep-MASKS优于基线方法,将MFCC重建的均方误差(MSE)降低了高达44%,并将表示纯净语音熵所需的额外比特数减少了36%。