Callyope, Paris, France.
Department of Psychiatry, Saint-Antoine Hospital, Sorbonne University, Assistance publique - Hôpitaux de Paris, Paris, France.
J Med Internet Res. 2024 Oct 31;26:e58572. doi: 10.2196/58572.
While speech analysis holds promise for mental health assessment, research often focuses on single symptoms, despite symptom co-occurrences and interactions. In addition, predictive models in mental health do not properly assess the limitations of speech-based systems, such as uncertainty, or fairness for a safe clinical deployment.
We investigated the predictive potential of mobile-collected speech data for detecting and estimating depression, anxiety, fatigue, and insomnia, focusing on other factors than mere accuracy, in the general population.
We included 865 healthy adults and recorded their answers regarding their perceived mental and sleep states. We asked how they felt and if they had slept well lately. Clinically validated questionnaires measuring depression, anxiety, insomnia, and fatigue severity were also used. We developed a novel speech and machine learning pipeline involving voice activity detection, feature extraction, and model training. We automatically modeled speech with pretrained deep learning models that were pretrained on a large, open, and free database, and we selected the best one on the validation set. Based on the best speech modeling approach, clinical threshold detection, individual score prediction, model uncertainty estimation, and performance fairness across demographics (age, sex, and education) were evaluated. We used a train-validation-test split for all evaluations: to develop our models, select the best ones, and assess the generalizability of held-out data.
The best model was Whisper M with a max pooling and oversampling method. Our methods achieved good detection performance for all symptoms, depression (Patient Health Questionnaire-9: area under the curve [AUC]=0.76; F-score=0.49 and Beck Depression Inventory: AUC=0.78; F-score=0.65), anxiety (Generalized Anxiety Disorder 7-item scale: AUC=0.77; F-score=0.50), insomnia (Athens Insomnia Scale: AUC=0.73; F-score=0.62), and fatigue (Multidimensional Fatigue Inventory total score: AUC=0.68; F-score=0.88). The system performed well when it needed to abstain from making predictions, as demonstrated by low abstention rates in depression detection with the Beck Depression Inventory and fatigue, with risk-coverage AUCs below 0.4. Individual symptom scores were accurately predicted (correlations were all significant with Pearson strengths between 0.31 and 0.49). Fairness analysis revealed that models were consistent for sex (average disparity ratio [DR] 0.86, SD 0.13), to a lesser extent for education level (average DR 0.47, SD 0.30), and worse for age groups (average DR 0.33, SD 0.30).
This study demonstrates the potential of speech-based systems for multifaceted mental health assessment in the general population, not only for detecting clinical thresholds but also for estimating their severity. Addressing fairness and incorporating uncertainty estimation with selective classification are key contributions that can enhance the clinical utility and responsible implementation of such systems.
虽然语音分析在心理健康评估方面具有很大的潜力,但研究通常侧重于单一症状,尽管症状的共同出现和相互作用。此外,心理健康领域的预测模型并没有正确评估基于语音的系统的局限性,例如不确定性或对安全临床部署的公平性。
我们研究了移动采集的语音数据在检测和估计抑郁、焦虑、疲劳和失眠方面的预测潜力,重点关注一般人群中除了准确性之外的其他因素。
我们纳入了 865 名健康成年人,并记录了他们对自己的心理和睡眠状态的感知。我们询问他们的感觉如何,以及他们最近是否睡得好。我们还使用了临床验证的问卷来测量抑郁、焦虑、失眠和疲劳的严重程度。我们开发了一种新的语音和机器学习管道,涉及语音活动检测、特征提取和模型训练。我们使用经过预训练的深度学习模型自动建模语音,这些模型是在一个大型、开放和免费的数据库上进行预训练的,我们在验证集上选择了最好的模型。基于最佳语音建模方法,我们评估了临床阈值检测、个体评分预测、模型不确定性估计以及人口统计学(年龄、性别和教育)方面的性能公平性。我们在所有评估中使用了训练-验证-测试分割:开发我们的模型、选择最佳模型以及评估保留数据的可泛化性。
最佳模型是 Whisper M 与最大池化和过采样方法。我们的方法在所有症状的检测性能都很好,包括抑郁(患者健康问卷-9:曲线下面积[AUC]=0.76;F 分数=0.49 和贝克抑郁量表:AUC=0.78;F 分数=0.65)、焦虑(广泛性焦虑障碍 7 项量表:AUC=0.77;F 分数=0.50)、失眠(雅典失眠量表:AUC=0.73;F 分数=0.62)和疲劳(多维疲劳量表总分:AUC=0.68;F 分数=0.88)。当系统需要避免进行预测时,它的表现也很好,这一点从贝克抑郁量表和疲劳的抑郁检测中低的回避率得到了证明,风险覆盖率 AUC 低于 0.4。个体症状评分被准确预测(相关性均具有统计学意义,皮尔逊相关强度介于 0.31 和 0.49 之间)。公平性分析表明,模型在性别方面是一致的(平均差异比[DR]为 0.86,标准差为 0.13),在教育程度方面的一致性稍差(平均 DR 为 0.47,标准差为 0.30),在年龄组方面的一致性更差(平均 DR 为 0.33,标准差为 0.30)。
本研究证明了基于语音的系统在一般人群中进行多方面心理健康评估的潜力,不仅可以用于检测临床阈值,还可以用于估计其严重程度。解决公平性问题并结合选择性分类的不确定性估计是增强此类系统的临床实用性和负责任实施的关键贡献。