大语言模型与专家临床医生在远程心理健康患者危机预测中的比较研究。

Large Language Models Versus Expert Clinicians in Crisis Prediction Among Telemental Health Patients: Comparative Study.

机构信息

Brightside Health, San Francisco, CA, United States.

出版信息

JMIR Ment Health. 2024 Aug 2;11:e58129. doi: 10.2196/58129.

DOI:10.2196/58129

PMID:38876484

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11329850/

Abstract

BACKGROUND

Due to recent advances in artificial intelligence, large language models (LLMs) have emerged as a powerful tool for a variety of language-related tasks, including sentiment analysis, and summarization of provider-patient interactions. However, there is limited research on these models in the area of crisis prediction.

OBJECTIVE

This study aimed to evaluate the performance of LLMs, specifically OpenAI's generative pretrained transformer 4 (GPT-4), in predicting current and future mental health crisis episodes using patient-provided information at intake among users of a national telemental health platform.

METHODS

Deidentified patient-provided data were pulled from specific intake questions of the Brightside telehealth platform, including the chief complaint, for 140 patients who indicated suicidal ideation (SI), and another 120 patients who later indicated SI with a plan during the course of treatment. Similar data were pulled for 200 randomly selected patients, treated during the same time period, who never endorsed SI. In total, 6 senior Brightside clinicians (3 psychologists and 3 psychiatrists) were shown patients' self-reported chief complaint and self-reported suicide attempt history but were blinded to the future course of treatment and other reported symptoms, including SI. They were asked a simple yes or no question regarding their prediction of endorsement of SI with plan, along with their confidence level about the prediction. GPT-4 was provided with similar information and asked to answer the same questions, enabling us to directly compare the performance of artificial intelligence and clinicians.

RESULTS

Overall, the clinicians' average precision (0.7) was higher than that of GPT-4 (0.6) in identifying the SI with plan at intake (n=140) versus no SI (n=200) when using the chief complaint alone, while sensitivity was higher for the GPT-4 (0.62) than the clinicians' average (0.53). The addition of suicide attempt history increased the clinicians' average sensitivity (0.59) and precision (0.77) while increasing the GPT-4 sensitivity (0.59) but decreasing the GPT-4 precision (0.54). Performance decreased comparatively when predicting future SI with plan (n=120) versus no SI (n=200) with a chief complaint only for the clinicians (average sensitivity=0.4; average precision=0.59) and the GPT-4 (sensitivity=0.46; precision=0.48). The addition of suicide attempt history increased performance comparatively for the clinicians (average sensitivity=0.46; average precision=0.69) and the GPT-4 (sensitivity=0.74; precision=0.48).

CONCLUSIONS

GPT-4, with a simple prompt design, produced results on some metrics that approached those of a trained clinician. Additional work must be done before such a model can be piloted in a clinical setting. The model should undergo safety checks for bias, given evidence that LLMs can perpetuate the biases of the underlying data on which they are trained. We believe that LLMs hold promise for augmenting the identification of higher-risk patients at intake and potentially delivering more timely care to patients.

摘要

背景

由于人工智能的最新进展，大型语言模型（LLM）已成为各种语言相关任务的强大工具，包括情感分析和医患互动的总结。然而，在危机预测领域，对这些模型的研究有限。

目的

本研究旨在评估 LLM，特别是 OpenAI 的生成式预训练变压器 4（GPT-4），在使用全国远程心理健康平台患者提供的信息预测当前和未来心理健康危机发作方面的表现。

方法

从 Brightside 远程医疗平台的特定入学问题中提取了 140 名表示有自杀意念（SI）的患者和另外 120 名在治疗过程中表示有自杀计划的患者的患者提供的数据，以及 200 名在同一时期接受治疗但从未表示过 SI 的随机选择的患者。共有 6 名高级 Brightside 临床医生（3 名心理学家和 3 名精神科医生）查看了患者的自我报告的主要投诉和自我报告的自杀尝试史，但对未来的治疗过程和其他报告的症状（包括 SI）一无所知。他们被问到一个简单的是或否问题，即他们是否预测会出现有计划的 SI，以及他们对预测的信心水平。向 GPT-4 提供了类似的信息，并要求回答相同的问题，使我们能够直接比较人工智能和临床医生的表现。

结果

总体而言，当仅使用主要投诉时，临床医生的平均准确率（0.7）高于 GPT-4（0.6），用于识别有计划的 SI（n=140）与无 SI（n=200），而 GPT-4 的敏感性（0.62）高于临床医生的平均水平（0.53）。自杀尝试史的增加提高了临床医生的平均敏感性（0.59）和准确性（0.77），同时提高了 GPT-4 的敏感性（0.59），但降低了 GPT-4 的准确性（0.54）。对于仅使用主要投诉预测未来有计划的 SI（n=120）与无 SI（n=200）的临床医生（平均敏感性=0.4；平均准确性=0.59）和 GPT-4（敏感性=0.46；准确性=0.48），性能相对下降。自杀尝试史的增加提高了临床医生（平均敏感性=0.46；平均准确性=0.69）和 GPT-4（敏感性=0.74；准确性=0.48）的性能。

结论

GPT-4 通过简单的提示设计，在一些指标上取得了接近训练有素的临床医生的结果。在临床环境中试用之前，还需要做更多的工作。鉴于有证据表明，大型语言模型可能会延续其训练数据中的偏见，因此该模型应进行安全性检查，以防止偏见。我们相信，大型语言模型有可能在患者入学时增强对高风险患者的识别，并有可能为患者提供更及时的护理。