Suppr超能文献

利用大语言模型进行自动抑郁症筛查。

Leveraging large language models for automated depression screening.

作者信息

Teferra Bazen Gashaw, Perivolaris Argyrios, Hsiang Wei-Ni, Sidharta Christian Kevin, Rueda Alice, Parkington Karisa, Wu Yuqi, Soni Achint, Samavi Reza, Jetly Rakesh, Zhang Yanbo, Cao Bo, Rambhatla Sirisha, Krishnan Sri, Bhat Venkat

机构信息

Interventional Psychiatry Program, St. Michael's Hospital, Unity Health Toronto, Toronto, Ontario, Canada.

Electrical and Computer Engineering, University of Toronto, Toronto, Ontario, Canada.

出版信息

PLOS Digit Health. 2025 Jul 28;4(7):e0000943. doi: 10.1371/journal.pdig.0000943. eCollection 2025 Jul.

Abstract

Mental health diagnoses possess unique challenges that often lead to nuanced difficulties in managing an individual's well-being and daily functioning. Self-report questionnaires are a common practice in clinical settings to help mitigate the challenges involved in mental health disorder screening. However, these questionnaires rely on an individual's subjective response which can be influenced by various factors. Despite the advancements of Large Language Models (LLMs), quantifying self-reported experiences with natural language processing has resulted in imperfect accuracy. This project aims to demonstrate the effectiveness of zero-shot learning LLMs for screening and assessing item scales for depression using LLMs. The DAIC-WOZ is a publicly available mental health dataset that contains textual data from clinical interviews and self-report questionnaires with relevant mental health disorder labels. The RISEN prompt engineering framework was utilized to evaluate LLMs' effectiveness in predicting depression symptoms based on individual PHQ-8 items. Various LLMs, including GPT models, Llama3_8B, Cohere, and Gemini were assessed based on performance. The GPT models, especially GPT-4o, were consistently better than other LLMs (Llama3_8B, Cohere, Gemini) across all eight items of the PHQ-8 scale in accuracy (M = 75.9%), and F1 score (0.74). GPT models were able to predict PHQ-8 items related to emotional and cognitive states. Llama 3_8B demonstrated superior detection of anhedonia-related symptoms and the Cohere LLM's strength was identifying and predicting psychomotor activity symptoms. This study provides a novel outlook on the potential of LLMs for predicting self-reported questionnaire scores from textual interview data. The promising preliminary performance of the various models indicates there is potential that these models could effectively assist in the screening of depression. Further research is needed to establish a framework for which LLM can be used for specific mental health symptoms and other disorders. As well, analysis of additional datasets while fine-tuning models should be explored.

摘要

心理健康诊断存在独特的挑战,这些挑战常常导致在管理个人幸福和日常功能方面出现细微的困难。自我报告问卷是临床环境中的常见做法,以帮助减轻心理健康障碍筛查中涉及的挑战。然而,这些问卷依赖于个人的主观回答,这可能会受到各种因素的影响。尽管大语言模型(LLMs)取得了进展,但使用自然语言处理对自我报告的经历进行量化的准确性并不完美。本项目旨在证明零样本学习大语言模型在使用大语言模型筛查和评估抑郁项目量表方面的有效性。DAIC-WOZ是一个公开可用的心理健康数据集,包含来自临床访谈和带有相关心理健康障碍标签的自我报告问卷的文本数据。RISEN提示工程框架被用于评估大语言模型基于个体PHQ-8项目预测抑郁症状的有效性。基于性能评估了各种大语言模型,包括GPT模型、Llama3_8B、Cohere和Gemini。GPT模型,尤其是GPT-4o,在PHQ-8量表的所有八个项目上,在准确性(M = 75.9%)和F1分数(0.74)方面始终优于其他大语言模型(Llama3_8B、Cohere、Gemini)。GPT模型能够预测与情绪和认知状态相关的PHQ-8项目。Llama 3_8B在检测与快感缺乏相关的症状方面表现出色,而Cohere大语言模型的优势在于识别和预测精神运动活动症状。本研究为大语言模型从文本访谈数据预测自我报告问卷分数的潜力提供了新的视角。各种模型有前景的初步表现表明,这些模型有可能有效地协助抑郁症的筛查。需要进一步的研究来建立一个框架,以便大语言模型可用于特定的心理健康症状和其他障碍。此外,在微调模型时应探索对其他数据集的分析。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c695/12303271/297e5b070aa7/pdig.0000943.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验