Department of Population Health Sciences, School of Life Course and Population Sciences, Faculty of Life Sciences & Medicine, King's College London, London, SE1 1UL, United Kingdom.
Department of Intelligent Medical Engineering, School of Biomedical Engineering, Anhui Medical University, Hefei, 230032, China.
J Am Med Inform Assoc. 2024 Oct 1;31(10):2394-2404. doi: 10.1093/jamia/ocae189.
This study aims to conduct a systematic review and meta-analysis of the diagnostic accuracy of deep learning (DL) using speech samples in depression.
This review included studies reporting diagnostic results of DL algorithms in depression using speech data, published from inception to January 31, 2024, on PubMed, Medline, Embase, PsycINFO, Scopus, IEEE, and Web of Science databases. Pooled accuracy, sensitivity, and specificity were obtained by random-effect models. The diagnostic Precision Study Quality Assessment Tool (QUADAS-2) was used to assess the risk of bias.
A total of 25 studies met the inclusion criteria and 8 of them were used in the meta-analysis. The pooled estimates of accuracy, specificity, and sensitivity for depression detection models were 0.87 (95% CI, 0.81-0.93), 0.85 (95% CI, 0.78-0.91), and 0.82 (95% CI, 0.71-0.94), respectively. When stratified by model structure, the highest pooled diagnostic accuracy was 0.89 (95% CI, 0.81-0.97) in the handcrafted group.
To our knowledge, our study is the first meta-analysis on the diagnostic performance of DL for depression detection from speech samples. All studies included in the meta-analysis used convolutional neural network (CNN) models, posing problems in deciphering the performance of other DL algorithms. The handcrafted model performed better than the end-to-end model in speech depression detection.
The application of DL in speech provided a useful tool for depression detection. CNN models with handcrafted acoustic features could help to improve the diagnostic performance.
The study protocol was registered on PROSPERO (CRD42023423603).
本研究旨在对使用语音样本进行抑郁症诊断的深度学习(DL)的诊断准确性进行系统评价和荟萃分析。
本综述纳入了自 2024 年 1 月 31 日起,在 PubMed、Medline、Embase、PsycINFO、Scopus、IEEE 和 Web of Science 数据库中发表的使用语音数据报告 DL 算法在抑郁症中诊断结果的研究。使用随机效应模型获取汇总准确性、敏感性和特异性。采用诊断性 Precision 研究质量评估工具(QUADAS-2)评估偏倚风险。
共有 25 项研究符合纳入标准,其中 8 项研究纳入荟萃分析。抑郁检测模型的汇总准确性、特异性和敏感性估计值分别为 0.87(95%CI,0.81-0.93)、0.85(95%CI,0.78-0.91)和 0.82(95%CI,0.71-0.94)。按模型结构分层时,手工组的最高汇总诊断准确性为 0.89(95%CI,0.81-0.97)。
据我们所知,本研究是关于使用语音样本进行抑郁症检测的 DL 诊断性能的首次荟萃分析。荟萃分析中纳入的所有研究均使用卷积神经网络(CNN)模型,这给其他 DL 算法的性能解读带来了问题。在语音抑郁检测中,手工模型的性能优于端到端模型。
DL 在语音中的应用为抑郁检测提供了一种有用的工具。具有手工声学特征的 CNN 模型可以帮助提高诊断性能。
本研究方案已在 PROSPERO(CRD42023423603)上注册。