在低资源环境中使用带有wav2vec 2.0的迁移学习改进语音抑郁检测

Improving speech depression detection using transfer learning with wav2vec 2.0 in low-resource environments.

作者信息

Zhang Xu, Zhang Xiangcheng, Chen Weisi, Li Chenlong, Yu Chengyuan

机构信息

School of Software Engineering, Xiamen University of Technology, Xiamen, 361024, China.

School of Computer and Information Engineering, Xiamen University of Technology, Xiamen, 361024, China.

出版信息

Sci Rep. 2024 Apr 25;14(1):9543. doi: 10.1038/s41598-024-60278-1.

DOI:10.1038/s41598-024-60278-1

PMID:38664511

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11045867/

Abstract

Depression, a pervasive global mental disorder, profoundly impacts daily lives. Despite numerous deep learning studies focused on depression detection through speech analysis, the shortage of annotated bulk samples hampers the development of effective models. In response to this challenge, our research introduces a transfer learning approach for detecting depression in speech, aiming to overcome constraints imposed by limited resources. In the context of feature representation, we obtain depression-related features by fine-tuning wav2vec 2.0. By integrating 1D-CNN and attention pooling structures, we generate advanced features at the segment level, thereby enhancing the model's capability to capture temporal relationships within audio frames. In the realm of prediction results, we integrate LSTM and self-attention mechanisms. This incorporation assigns greater weights to segments associated with depression, thereby augmenting the model's discernment of depression-related information. The experimental results indicate that our model has achieved impressive F1 scores, reaching 79% on the DAIC-WOZ dataset and 90.53% on the CMDC dataset. It outperforms recent baseline models in the field of speech-based depression detection. This provides a promising solution for effective depression detection in low-resource environments.

摘要

抑郁症是一种普遍存在的全球性精神障碍，对日常生活有着深远影响。尽管有许多深度学习研究致力于通过语音分析来检测抑郁症，但带注释的大量样本的短缺阻碍了有效模型的开发。为应对这一挑战，我们的研究引入了一种用于语音中抑郁症检测的迁移学习方法，旨在克服资源有限带来的限制。在特征表示方面，我们通过微调wav2vec 2.0来获取与抑郁症相关的特征。通过整合一维卷积神经网络（1D-CNN）和注意力池化结构，我们在片段级别生成高级特征，从而增强模型捕捉音频帧内时间关系的能力。在预测结果方面，我们整合了长短期记忆网络（LSTM）和自注意力机制。这种整合为与抑郁症相关的片段赋予更大权重，从而增强模型对抑郁症相关信息的辨别能力。实验结果表明，我们的模型取得了令人印象深刻的F1分数，在DAIC-WOZ数据集上达到了79%，在CMDC数据集上达到了90.53%。它在基于语音的抑郁症检测领域优于最近的基线模型。这为在低资源环境中进行有效的抑郁症检测提供了一个有前景的解决方案。