Thotapalli Samanvith, Yilanli Musa, McKay Ian, Leever William, Youngstrom Eric, Harvey-Nuckles Karah, Lowder Kimberly, Schweitzer Steffanie, Sunderland Erin, Jackson Daniel I, Sezgin Emre
Department of Psychiatry and Behavioral Health The Ohio State University Columbus Ohio USA.
Nationwide Children's Hospital Columbus Ohio USA.
PCN Rep. 2025 Jul 15;4(3):e70159. doi: 10.1002/pcn5.70159. eCollection 2025 Sep.
Large language models, such as GPT-4, are increasingly integrated into healthcare to support clinicians in making informed decisions. Given ChatGPT's potential, it is necessary to explore such applications as a support tool, particularly within mental health telephone triage services. This study evaluates whether GPT Models can accurately triage psychiatric emergency vignettes and compares its performance to that of clinicians.
A cross-sectional study was performed to assess the performance of three different GPT-4 models (GPT-4o, GPT-4o Mini, and GPT-4 Legacy) in psychiatric emergency triage. Twenty-two psychiatric emergency vignettes, intended to represent realistic prehospital triage scenarios, were initially drafted using ChatGPT and subsequently reviewed and refined by the research team to ensure clinical accuracy and relevance. The GPT-4 models independently generated clinical responses to the vignettes over three iterations to ensure consistency. Thereafter, two advanced practice nurse practitioners independently assessed these responses utilizing a 3-point Likert-type scale for the main triage criteria: risk level ( = 1 to = 3), necessity of hospital admission (s = 1; = 2), and urgency of clinical evaluation ( = 1 to = 3). Additionally, the nurse practitioners provided their clinical judgments independently for each vignette. Interrater reliability was evaluated by comparing responses generated by the GPT Models with the independent clinical assessments of nurse practitioners, and agreement was evaluated using Cohen's Kappa. The clinical expert committee ( = 3) conducted qualitative analyses of responses from both GPT Models using a systematic coding method to evaluate triage accuracy, clarity, completeness, and total score. The evaluation of responses focused on three key triage criteria: risk ( = 1 to = 3), admission necessity (s = 1; = 2), and urgency of clinical evaluation ( = 1 to = 3).
GPT Models had an average admission score of 1.73 (standard deviation [SD] = 0.45; scale: = 1, = 2), indicating a general trend toward recommending against hospital admission. Risk (mean = 2.12, SD = 0.83) and urgency (mean = 2.27, SD = 0.44) assessments suggested moderate-to-high perceived risk and urgency (scale: = 1, = 3), reflecting conservative decision-making. Interrater reliability between clinicians and GPT-4 models was substantial, with Cohen's Kappa values of 0.77 (admission), 0.78 (risk), and 0.76 (urgency). GPT Models' responses tended toward slight over-triage, indicated by four false-positive admission recommendations and zero false negatives. Substantial interrater reliability was observed between clinicians and GPT-4 responses across the three triage criteria (Cohen's Kappa: admission = 0.77; risk = 0.78; urgency = 0.76).The mean scores for triage criteria responses between GPT-4 models and clinicians exhibited consistent patterns with minimal variability. Overall, GPT Models had a tendency to over-triage patients as indicated by four total false positives and zero false negatives for admissions.
This study indicates that GPT Models may serve as supportive decision-support tools in mental health telephone triage, particularly for psychiatric emergencies. Although response variability across iterations was minimal, most discrepancies in admission decisions were identified as false positives, reflecting that GPT Models may have a tendency to over-triage relative to clinician judgment. Further investigation is needed to establish robust structure to increase alignment with clinical decisions and response relevance in clinical practice.
诸如GPT-4之类的大语言模型越来越多地被整合到医疗保健中,以支持临床医生做出明智的决策。鉴于ChatGPT的潜力,有必要探索将此类应用作为一种支持工具,特别是在心理健康电话分诊服务中。本研究评估GPT模型是否能够准确分诊精神科急诊病例,并将其表现与临床医生的表现进行比较。
进行了一项横断面研究,以评估三种不同的GPT-4模型(GPT-4o、GPT-4o Mini和GPT-4 Legacy)在精神科急诊分诊中的表现。最初使用ChatGPT起草了22个精神科急诊病例,旨在代表现实的院前分诊场景,随后由研究团队进行审查和完善,以确保临床准确性和相关性。GPT-4模型通过三轮迭代独立生成对病例的临床反应,以确保一致性。此后,两名高级执业护士从业者使用3点李克特式量表对主要分诊标准独立评估这些反应:风险水平(1=低至3=高)、住院必要性(否=1;是=2)和临床评估紧迫性(1=低至3=高)。此外,护士从业者针对每个病例独立提供他们的临床判断。通过将GPT模型生成的反应与护士从业者的独立临床评估进行比较来评估评分者间信度,并使用科恩kappa系数评估一致性。临床专家委员会(n=3)使用系统编码方法对GPT模型的反应进行定性分析,以评估分诊准确性、清晰度、完整性和总分。对反应的评估集中在三个关键分诊标准上:风险(1=低至3=高)、住院必要性(否=1;是=2)和临床评估紧迫性(1=低至3=高)。
GPT模型的平均住院评分为1.73(标准差[SD]=0.45;量表:否=1,是=2),表明总体趋势是不建议住院。风险(平均=2.12,SD=0.83)和紧迫性(平均=2.27,SD=0.44)评估表明感知到的风险和紧迫性为中度至高(量表:1=低至3=高),反映出保守的决策。临床医生与GPT-4模型之间的评分者间信度较高,科恩kappa系数值分别为0.77(住院)、0.78(风险)和0.76(紧迫性)。GPT模型的反应倾向于轻微过度分诊,表现为四条假阳性住院建议和零条假阴性。在三个分诊标准上,临床医生与GPT-4反应之间观察到较高的评分者间信度(科恩kappa系数:住院=0.77;风险=0.78;紧迫性=0.76)。GPT-4模型与临床医生之间分诊标准反应的平均得分呈现出一致的模式,变异性最小。总体而言,GPT模型有将患者过度分诊的倾向,表现为总共四条假阳性和零条住院假阴性。
本研究表明,GPT模型可作为心理健康电话分诊中的支持性决策工具,特别是对于精神科急诊。尽管各轮迭代之间的反应变异性最小,但住院决策中的大多数差异被确定为假阳性,这反映出相对于临床医生的判断,GPT模型可能有过度分诊的倾向。需要进一步研究以建立强大的结构,以提高在临床实践中与临床决策和反应相关性的一致性。