Xu Sean Shensheng, Ke Xiaoquan, Mak Man-Wai, Wong Ka Ho, Meng Helen, Kwok Timothy C Y, Gu Jason, Zhang Jian, Tao Wei, Chang Chunqi
School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen University, Shenzhen, China.
Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China.
Front Neurosci. 2024 Jan 16;17:1351848. doi: 10.3389/fnins.2023.1351848. eCollection 2023.
Speaker diarization is an essential preprocessing step for diagnosing cognitive impairments from speech-based Montreal cognitive assessments (MoCA).
This paper proposes three enhancements to the conventional speaker diarization methods for such assessments. The enhancements tackle the challenges of diarizing MoCA recordings on two fronts. First, multi-scale channel interdependence speaker embedding is used as the front-end speaker representation for overcoming the acoustic mismatch caused by far-field microphones. Specifically, a squeeze-and-excitation (SE) unit and channel-dependent attention are added to Res2Net blocks for multi-scale feature aggregation. Second, a sequence comparison approach with a holistic view of the whole conversation is applied to measure the similarity of short speech segments in the conversation, which results in a speaker-turn aware scoring matrix for the subsequent clustering step. Third, to further enhance the diarization performance, we propose incorporating a pairwise similarity measure so that the speaker-turn aware scoring matrix contains both local and global information across the segments.
Evaluations on an interactive MoCA dataset show that the proposed enhancements lead to a diarization system that outperforms the conventional x-vector/PLDA systems under language-, age-, and microphone-mismatch scenarios.
The results also show that the proposed enhancements can help hypothesize the speaker-turn timestamps, making the diarization method amendable to datasets without timestamp information.
说话人识别是基于语音的蒙特利尔认知评估(MoCA)诊断认知障碍的重要预处理步骤。
本文针对此类评估对传统说话人识别方法提出了三项改进。这些改进从两个方面应对了对MoCA录音进行说话人识别的挑战。首先,多尺度通道相互依赖说话人嵌入被用作前端说话人表示,以克服由远场麦克风引起的声学不匹配。具体而言,在Res2Net块中添加了挤压激励(SE)单元和通道相关注意力,用于多尺度特征聚合。其次,应用一种具有整个对话整体视图的序列比较方法来测量对话中短语音段的相似性,从而为后续聚类步骤生成一个说话人轮次感知评分矩阵。第三,为了进一步提高说话人识别性能,我们建议纳入成对相似性度量,以便说话人轮次感知评分矩阵包含跨段的局部和全局信息。
在一个交互式MoCA数据集上的评估表明,所提出的改进导致了一个说话人识别系统,在语言、年龄和麦克风不匹配的情况下,该系统优于传统的x向量/PLDA系统。
结果还表明,所提出的改进有助于推测说话人轮次时间戳,使说话人识别方法适用于没有时间戳信息的数据集。