Holley Dan, Daly Brian, Beverly Briana, Wamsley Blaken, Brooks Amanda, Zaubler Tom
Clinical Operation, NeuroFlow, Philadelphia, PA, USA.
Drexel University, Philadelphia, PA, USA.
BMC Psychiatry. 2025 Aug 1;25(1):753. doi: 10.1186/s12888-025-07088-5.
Over 700,000 individuals die by suicide globally each year, with rapid progression from suicidal ideation (SI) to attempt often precluding opportunities for intervention. Digital behavioral health (DBH) platforms offer novel means of collecting SI indicators outside the clinic, but the actionable utility of these data may be limited by clinician-dependent workflows such as reviewing patients' journaling exercises for signs of SI. Large language models (LLMs) provide a methodology to streamline this task by rapidly risk-stratifying text based on the presence and severity of SI; however, this application has yet to be reliably evaluated. To test this approach, we first generated and validated a corpus of 125 synthetic journal responses to prompts from a real-world DBH platform. The responses varied on the presence and severity of suicidal ideation, readability, length, use of emojis, and other common language features, allowing for over 1 trillion feature permutations. Next, five collaborating behavioral health experts worked independently to stratify these responses as no-, low-, moderate-, or high-risk SI. Finally, we risk-stratified the responses using several tailored implementations of OpenAI's Generative Pretrained Transformer (GPT) models and compared the results to those of our raters. Using clinician consensus as "ground truth," our ensemble LLM performed significantly above chance (30.38%) in exact risk-assessment agreement (65.60%; χ2 = 86.58). The ensemble model also aligned with 92% of clinicians' "do/do not intervene" decisions (Cohen's Kappa = 0.84) and achieved 94% sensitivity and 91% specificity in that task. Additional results of precision-recall, time-to-decision, and cost analyses are reported. While further testing and exploration of ethical considerations remain critical, our results offer preliminary evidence that LLM-powered risk stratification can serve as a powerful and cost-effective tool to enhance suicide prevention frameworks.
全球每年有超过70万人死于自杀,从自杀意念(SI)到自杀未遂的快速发展常常使干预机会丧失。数字行为健康(DBH)平台提供了在诊所外收集SI指标的新方法,但这些数据的可操作效用可能受到依赖临床医生的工作流程的限制,例如查看患者的日志练习以寻找SI迹象。大语言模型(LLM)提供了一种方法,通过根据SI的存在和严重程度快速对文本进行风险分层来简化这项任务;然而,这种应用尚未得到可靠评估。为了测试这种方法,我们首先生成并验证了一个由125条合成日志回复组成的语料库,这些回复是针对一个真实世界DBH平台的提示生成的。这些回复在自杀意念的存在和严重程度、可读性、长度、表情符号的使用以及其他常见语言特征方面各不相同,允许有超过1万亿种特征排列。接下来,五位合作的行为健康专家独立工作,将这些回复分为无、低、中或高风险SI。最后,我们使用OpenAI的生成式预训练变换器(GPT)模型的几个定制实现对回复进行风险分层,并将结果与我们的评估者的结果进行比较。以临床医生的共识作为“地面真相”,我们的集成LLM在精确风险评估一致性方面(65.60%;χ2 = 86.58)显著高于随机水平(30.38%)。集成模型还与92%的临床医生的“干预/不干预”决策一致(科恩卡方系数 = 0.84),并在该任务中达到了94%的灵敏度和91%的特异性。报告了精确召回率、决策时间和成本分析的其他结果。虽然对伦理考量的进一步测试和探索仍然至关重要,但我们的结果提供了初步证据,表明基于LLM的风险分层可以作为一种强大且具有成本效益的工具,以加强自杀预防框架。
Cochrane Database Syst Rev. 2024-12-20
Cochrane Database Syst Rev. 2022-5-20
J Med Internet Res. 2025-1-23
Cochrane Database Syst Rev. 2014-4-29
Cochrane Database Syst Rev. 2016-7-1
JAMA Psychiatry. 2024-6-1
J Med Internet Res. 2023-10-4
JMIR Ment Health. 2023-9-20
Front Psychiatry. 2023-8-1
BMC Public Health. 2023-6-28
JAMA Netw Open. 2023-4-3
NCHS Data Brief. 2023-4