Suppr超能文献

多模态感知用于抑郁风险检测:音频、视频和文本数据的融合。

Multimodal Sensing for Depression Risk Detection: Integrating Audio, Video, and Text Data.

机构信息

School of Biomedical Engineering, Health Science Center, Shenzhen University, Shenzhen 518060, China.

Guangdong Provincial Key Laboratory of Biomedical Measurements and Ultrasound Imaging, Shenzhen 518060, China.

出版信息

Sensors (Basel). 2024 Jun 7;24(12):3714. doi: 10.3390/s24123714.

Abstract

Depression is a major psychological disorder with a growing impact worldwide. Traditional methods for detecting the risk of depression, predominantly reliant on psychiatric evaluations and self-assessment questionnaires, are often criticized for their inefficiency and lack of objectivity. Advancements in deep learning have paved the way for innovations in depression risk detection methods that fuse multimodal data. This paper introduces a novel framework, the Audio, Video, and Text Fusion-Three Branch Network (AVTF-TBN), designed to amalgamate auditory, visual, and textual cues for a comprehensive analysis of depression risk. Our approach encompasses three dedicated branches-Audio Branch, Video Branch, and Text Branch-each responsible for extracting salient features from the corresponding modality. These features are subsequently fused through a multimodal fusion (MMF) module, yielding a robust feature vector that feeds into a predictive modeling layer. To further our research, we devised an emotion elicitation paradigm based on two distinct tasks-reading and interviewing-implemented to gather a rich, sensor-based depression risk detection dataset. The sensory equipment, such as cameras, captures subtle facial expressions and vocal characteristics essential for our analysis. The research thoroughly investigates the data generated by varying emotional stimuli and evaluates the contribution of different tasks to emotion evocation. During the experiment, the AVTF-TBN model has the best performance when the data from the two tasks are simultaneously used for detection, where the F1 Score is 0.78, Precision is 0.76, and Recall is 0.81. Our experimental results confirm the validity of the paradigm and demonstrate the efficacy of the AVTF-TBN model in detecting depression risk, showcasing the crucial role of sensor-based data in mental health detection.

摘要

抑郁症是一种具有全球影响力的主要心理障碍。传统的抑郁症风险检测方法主要依赖于精神病学评估和自我评估问卷,这些方法通常因其效率低下和缺乏客观性而受到批评。深度学习的进步为融合多模态数据的抑郁症风险检测方法的创新铺平了道路。本文介绍了一种新颖的框架,即音频、视频和文本融合三分支网络(AVTF-TBN),旨在综合分析抑郁症风险,融合听觉、视觉和文本线索。我们的方法包括三个专用分支——音频分支、视频分支和文本分支,每个分支负责从相应模态中提取显著特征。这些特征通过多模态融合(MMF)模块融合,生成一个稳健的特征向量,然后输入到预测建模层中。为了进一步研究,我们设计了一种基于两种不同任务的情绪诱发范式——阅读和面试,以收集丰富的基于传感器的抑郁症风险检测数据集。传感器设备(如摄像头)捕捉到了细微的面部表情和声音特征,这些特征对我们的分析至关重要。研究深入调查了不同情绪刺激产生的数据,并评估了不同任务对情绪诱发的贡献。在实验中,当同时使用两个任务的数据进行检测时,AVTF-TBN 模型的性能最佳,F1 得分为 0.78,精度为 0.76,召回率为 0.81。我们的实验结果证实了该范式的有效性,并证明了 AVTF-TBN 模型在检测抑郁症风险方面的有效性,展示了基于传感器的数据在心理健康检测中的关键作用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/35dc/11207438/a365e77e6bd2/sensors-24-03714-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验