文献检索，用中文搜 PubMed

INTRODUCTION

Speaker diarization is an essential preprocessing step for diagnosing cognitive impairments from speech-based Montreal cognitive assessments (MoCA).

METHODS

This paper proposes three enhancements to the conventional speaker diarization methods for such assessments. The enhancements tackle the challenges of diarizing MoCA recordings on two fronts. First, multi-scale channel interdependence speaker embedding is used as the front-end speaker representation for overcoming the acoustic mismatch caused by far-field microphones. Specifically, a squeeze-and-excitation (SE) unit and channel-dependent attention are added to Res2Net blocks for multi-scale feature aggregation. Second, a sequence comparison approach with a holistic view of the whole conversation is applied to measure the similarity of short speech segments in the conversation, which results in a speaker-turn aware scoring matrix for the subsequent clustering step. Third, to further enhance the diarization performance, we propose incorporating a pairwise similarity measure so that the speaker-turn aware scoring matrix contains both local and global information across the segments.

RESULTS

Evaluations on an interactive MoCA dataset show that the proposed enhancements lead to a diarization system that outperforms the conventional x-vector/PLDA systems under language-, age-, and microphone-mismatch scenarios.

DISCUSSION

The results also show that the proposed enhancements can help hypothesize the speaker-turn timestamps, making the diarization method amendable to datasets without timestamp information.

INTRODUCTION

Speaker diarization is an essential preprocessing step for diagnosing cognitive impairments from speech-based Montreal cognitive assessments (MoCA).

METHODS

RESULTS

DISCUSSION

The results also show that the proposed enhancements can help hypothesize the speaker-turn timestamps, making the diarization method amendable to datasets without timestamp information.

引言

说话人识别是基于语音的蒙特利尔认知评估（MoCA）诊断认知障碍的重要预处理步骤。

方法

本文针对此类评估对传统说话人识别方法提出了三项改进。这些改进从两个方面应对了对MoCA录音进行说话人识别的挑战。首先，多尺度通道相互依赖说话人嵌入被用作前端说话人表示，以克服由远场麦克风引起的声学不匹配。具体而言，在Res2Net块中添加了挤压激励（SE）单元和通道相关注意力，用于多尺度特征聚合。其次，应用一种具有整个对话整体视图的序列比较方法来测量对话中短语音段的相似性，从而为后续聚类步骤生成一个说话人轮次感知评分矩阵。第三，为了进一步提高说话人识别性能，我们建议纳入成对相似性度量，以便说话人轮次感知评分矩阵包含跨段的局部和全局信息。