Department of Applied Statistics, Yonsei University, Seoul, Republic of Korea.
Department of Psychiatry, Yonsei University College of Medicine, Seoul, Republic of Korea.
JMIR Form Res. 2024 Oct 24;8:e58418. doi: 10.2196/58418.
Recent advancements in large language models (LLMs) have accelerated their use across various domains. Psychiatric interviews, which are goal-oriented and structured, represent a significantly underexplored area where LLMs can provide substantial value. In this study, we explore the application of LLMs to enhance psychiatric interviews by analyzing counseling data from North Korean defectors who have experienced traumatic events and mental health issues.
This study aims to investigate whether LLMs can (1) delineate parts of the conversation that suggest psychiatric symptoms and identify those symptoms, and (2) summarize stressors and symptoms based on the interview dialogue transcript.
Given the interview transcripts, we align the LLMs to perform 3 tasks: (1) extracting stressors from the transcripts, (2) delineating symptoms and their indicative sections, and (3) summarizing the patients based on the extracted stressors and symptoms. These 3 tasks address the 2 objectives, where delineating symptoms is based on the output from the second task, and generating the summary of the interview incorporates the outputs from all 3 tasks. In this context, the transcript data were labeled by mental health experts for the training and evaluation of the LLMs.
First, we present the performance of LLMs in estimating (1) the transcript sections related to psychiatric symptoms and (2) the names of the corresponding symptoms. In the zero-shot inference setting using the GPT-4 Turbo model, 73 out of 102 transcript segments demonstrated a recall mid-token distance d<20 for estimating the sections associated with the symptoms. For evaluating the names of the corresponding symptoms, the fine-tuning method demonstrates a performance advantage over the zero-shot inference setting of the GPT-4 Turbo model. On average, the fine-tuning method achieves an accuracy of 0.82, a precision of 0.83, a recall of 0.82, and an F1-score of 0.82. Second, the transcripts are used to generate summaries for each interviewee using LLMs. This generative task was evaluated using metrics such as Generative Evaluation (G-Eval) and Bidirectional Encoder Representations from Transformers Score (BERTScore). The summaries generated by the GPT-4 Turbo model, utilizing both symptom and stressor information, achieve high average G-Eval scores: coherence of 4.66, consistency of 4.73, fluency of 2.16, and relevance of 4.67. Furthermore, it is noted that the use of retrieval-augmented generation did not lead to a significant improvement in performance.
LLMs, using either (1) appropriate prompting techniques or (2) fine-tuning methods with data labeled by mental health experts, achieved an accuracy of over 0.8 for the symptom delineation task when measured across all segments in the transcript. Additionally, they attained a G-Eval score of over 4.6 for coherence in the summarization task. This research contributes to the emerging field of applying LLMs in psychiatric interviews and demonstrates their potential effectiveness in assisting mental health practitioners.
大型语言模型(LLMs)的最新进展加速了它们在各个领域的应用。精神病访谈是有明确目标和结构的,这是一个尚未得到充分探索的领域,LLMs 在其中具有很大的价值。本研究旨在探讨使用 LLM 增强精神病访谈,方法是分析经历过创伤和心理健康问题的朝鲜脱北者的咨询数据。
本研究旨在探讨 LLM 是否能够:(1) 勾勒出暗示精神症状的对话部分并识别这些症状,以及 (2) 根据访谈对话记录总结压力源和症状。
给定访谈记录,我们对齐 LLM 以执行 3 项任务:(1) 从记录中提取压力源,(2) 描绘症状及其指示部分,以及 (3) 根据提取的压力源和症状总结患者。这 3 项任务旨在实现上述 2 个目标,其中症状的描绘是基于第 2 项任务的输出,而访谈记录的总结则结合了所有 3 项任务的输出。在这种情况下,心理健康专家对转录数据进行了标记,用于训练和评估 LLM。
首先,我们展示了 LLM 在估计 (1) 与精神症状相关的转录部分,以及 (2) 相应症状的名称方面的性能。在使用 GPT-4 Turbo 模型进行零样本推断设置中,102 个转录段中有 73 个段的中令牌距离 d<20 可用于估计与症状相关的部分。对于评估相应症状的名称,微调方法比 GPT-4 Turbo 模型的零样本推断设置具有优势。平均而言,微调方法的准确性为 0.82,精确性为 0.83,召回率为 0.82,F1 得分为 0.82。其次,使用 LLM 根据访谈记录为每位受访者生成总结。使用生成评估 (G-Eval) 和变压器得分 (BERTScore) 等指标评估了这个生成任务。利用症状和压力源信息生成的 GPT-4 Turbo 模型总结,平均生成评估分数较高:连贯性为 4.66,一致性为 4.73,流畅性为 2.16,相关性为 4.67。此外,值得注意的是,使用检索增强生成并没有显著提高性能。
当在转录记录中的所有片段上进行测量时,使用 (1) 适当的提示技术或 (2) 由心理健康专家标记数据的微调方法的 LLM,在症状描绘任务中实现了超过 0.8 的准确性。此外,它们在总结任务中的 G-Eval 得分为 4.6 以上。本研究为在精神病访谈中应用 LLM 这一新兴领域做出了贡献,并展示了它们在协助心理健康从业者方面的潜在有效性。