McCoy Thomas H, Perlis Roy H
Center for Quantitative Health, Massachusetts General Hospital, Boston, Massachusetts, USA.
Department of Psychiatry, Harvard Medical School, Boston, Massachusetts, USA.
BMJ Ment Health. 2025 May 11;28(1):e301654. doi: 10.1136/bmjment-2025-301654.
We previously demonstrated that a large language model could estimate suicide risk using hospital discharge notes.
With the emergence of reasoning models that can be run on consumer-grade hardware, we investigated whether these models can approximate the performance of much larger and costlier models.
From 458 053 adults hospitalised at one of two academic medical centres between 4 January 2005 and 2 January 2014, we identified 1995 who died by suicide or accident, and matched them with 5 control individuals. We used Llama-DeepSeek-R1 8B to generate predictions of risk. Beyond discrimination and calibration, we examined the aspects of model reasoning-that is, the topics in the chain of thought-associated with correct or incorrect predictions.
The cohort included 1995 individuals who died by suicide or accidental death and 9975 individuals matched 5:1, totalling 11 954 discharges and 58 933 person-years of follow-up. In Fine and Grey regression, hazard as estimated by the Llama3-distilled model was significantly associated with observed risk (unadjusted HR 4.65 (3.58-6.04)). The corresponding c-statistic was 0.64 (0.63-0.65), modestly poorer than the GPT4o model (0.67 (0.66-0.68)). In chain-of-thought reasoning, topics including Substance Abuse, Surgical Procedure, and Age-related Comorbidities were associated with correct predictions, while Fall-related Injury was associated with incorrect prediction.
Application of a reasoning model using local, consumer-grade hardware only modestly diminished performance in stratifying suicide risk.
Smaller models can yield more secure, scalable and transparent risk prediction.
我们之前证明了一个大语言模型可以使用医院出院记录来估计自杀风险。
随着可在消费级硬件上运行的推理模型的出现,我们研究了这些模型是否能接近更大且成本更高的模型的性能。
从2005年1月4日至2014年1月2日在两个学术医疗中心之一住院的458053名成年人中,我们确定了1995名自杀或意外死亡者,并将他们与5名对照个体进行匹配。我们使用Llama-DeepSeek-R1 8B来生成风险预测。除了区分度和校准外,我们还检查了模型推理的方面,即与正确或错误预测相关的思维链中的主题。
该队列包括1995名自杀或意外死亡个体以及9975名按5:1匹配的个体,总计11954份出院记录和58933人年的随访。在Fine和Grey回归中,Llama3蒸馏模型估计的风险与观察到的风险显著相关(未调整的风险比为4.65(3.58 - 6.04))。相应的c统计量为0.64(0.63 - 0.65),略逊于GPT4o模型(0.67(0.66 - 0.68))。在思维链推理中,包括药物滥用、外科手术和年龄相关合并症等主题与正确预测相关,而与跌倒相关的损伤与错误预测相关。
使用本地消费级硬件的推理模型在分层自杀风险方面仅略微降低了性能。
较小的模型可以产生更安全、可扩展且透明的风险预测。