Zhao Shiwen, Zhang Yunze, Su Yikai, Su Kaifeng, Liu Jiemin, Wang Tao, Yu Shiqi
HACI Laboratory, Sydney Smart Technology College, Northeastern University, Shenyang 110167, China.
School of Computer Science and Engineeing, Northeastern University, Shenyang 110167, China.
Sensors (Basel). 2025 Jul 21;25(14):4520. doi: 10.3390/s25144520.
The global prevalence of depression necessitates the application of technological solutions, particularly sensor-based systems, to augment scarce resources for early diagnostic purposes. In this study, we use benchmark datasets that contain multimodal data including video, audio, and transcribed text. To address depression detection as a chronic long-term disorder reflected by temporal behavioral patterns, we propose a novel framework that segments videos into utterance-level instances using GRU for contextual representation, and then constructs graphs where utterance embeddings serve as nodes connected through dual relationships capturing both chronological development and intermittent relevant information. Graph neural networks are employed to learn multi-dimensional edge relationships and align multimodal representations across different temporal dependencies. Our approach achieves superior performance with an MAE of 5.25 and RMSE of 6.75 on AVEC2014, and CCC of 0.554 and RMSE of 4.61 on AVEC2019, demonstrating significant improvements over existing methods that focus primarily on momentary expressions.
抑郁症在全球的普遍存在使得有必要应用技术解决方案,特别是基于传感器的系统,以增加用于早期诊断目的的稀缺资源。在本研究中,我们使用包含视频、音频和转录文本等多模态数据的基准数据集。为了将抑郁症检测作为一种由时间行为模式反映的慢性长期疾病来处理,我们提出了一个新颖的框架,该框架使用门控循环单元(GRU)将视频分割为话语级实例以进行上下文表示,然后构建图,其中话语嵌入作为通过捕捉时间发展和间歇性相关信息的双重关系连接的节点。图神经网络用于学习多维度边关系,并在不同时间依赖关系之间对齐多模态表示。我们的方法在AVEC2014上实现了卓越的性能,平均绝对误差(MAE)为5.25,均方根误差(RMSE)为6.75;在AVEC2019上,一致性相关系数(CCC)为0.554,RMSE为4.61,与主要关注瞬间表情的现有方法相比有显著改进。