Jin Nani, Ye Renjia, Li Peng
Materdicine Lab, School of Life Sciences, Shanghai University, Shanghai, China.
Research Department, Third Xiangya Hospital of Central South University, Changsha, China.
Front Psychiatry. 2025 Jan 28;16:1508772. doi: 10.3389/fpsyt.2025.1508772. eCollection 2025.
Depression is a serious mental health disease. Traditional scale-based depression diagnosis methods often have problems of strong subjectivity and high misdiagnosis rate, so it is particularly important to develop automatic diagnostic tools based on objective indicators.
This study proposes a deep learning method that fuses multimodal data to automatically diagnose depression using facial video and audio data. We use spatiotemporal attention module to enhance the extraction of visual features and combine the Graph Convolutional Network (GCN) and the Long and Short Term Memory (LSTM) to analyze the audio features. Through the multi-modal feature fusion, the model can effectively capture different feature patterns related to depression.
We conduct extensive experiments on the publicly available clinical dataset, the Extended Distress Analysis Interview Corpus (E-DAIC). The experimental results show that we achieve robust accuracy on the E-DAIC dataset, with a Mean Absolute Error (MAE) of 3.51 in estimating PHQ-8 scores from recorded interviews.
Compared with existing methods, our model shows excellent performance in multi-modal information fusion, which is suitable for early evaluation of depression.
抑郁症是一种严重的心理健康疾病。传统的基于量表的抑郁症诊断方法往往存在主观性强和误诊率高的问题,因此开发基于客观指标的自动诊断工具尤为重要。
本研究提出了一种融合多模态数据的深度学习方法,利用面部视频和音频数据自动诊断抑郁症。我们使用时空注意力模块来增强视觉特征的提取,并结合图卷积网络(GCN)和长短时记忆网络(LSTM)来分析音频特征。通过多模态特征融合,该模型能够有效捕捉与抑郁症相关的不同特征模式。
我们在公开可用的临床数据集——扩展痛苦分析访谈语料库(E-DAIC)上进行了广泛的实验。实验结果表明,我们在E-DAIC数据集上取得了稳健的准确率,从录制的访谈中估计PHQ-8分数时的平均绝对误差(MAE)为3.51。
与现有方法相比,我们的模型在多模态信息融合方面表现出优异的性能,适用于抑郁症的早期评估。