探讨大型语言模型在总结心理健康咨询会话中的功效:基准研究。
Exploring the Efficacy of Large Language Models in Summarizing Mental Health Counseling Sessions: Benchmark Study.
机构信息
Department of Electrical Engineering, Indian Institute of Technology Delhi, New Delhi, India.
Department of Computer Science & Engineering, Indraprastha Institute of Information Technology Delhi, New Delhi, India.
出版信息
JMIR Ment Health. 2024 Jul 23;11:e57306. doi: 10.2196/57306.
BACKGROUND
Comprehensive session summaries enable effective continuity in mental health counseling, facilitating informed therapy planning. However, manual summarization presents a significant challenge, diverting experts' attention from the core counseling process. Leveraging advances in automatic summarization to streamline the summarization process addresses this issue because this enables mental health professionals to access concise summaries of lengthy therapy sessions, thereby increasing their efficiency. However, existing approaches often overlook the nuanced intricacies inherent in counseling interactions.
OBJECTIVE
This study evaluates the effectiveness of state-of-the-art large language models (LLMs) in selectively summarizing various components of therapy sessions through aspect-based summarization, aiming to benchmark their performance.
METHODS
We first created Mental Health Counseling-Component-Guided Dialogue Summaries, a benchmarking data set that consists of 191 counseling sessions with summaries focused on 3 distinct counseling components (also known as counseling aspects). Next, we assessed the capabilities of 11 state-of-the-art LLMs in addressing the task of counseling-component-guided summarization. The generated summaries were evaluated quantitatively using standard summarization metrics and verified qualitatively by mental health professionals.
RESULTS
Our findings demonstrated the superior performance of task-specific LLMs such as MentalLlama, Mistral, and MentalBART evaluated using standard quantitative metrics such as Recall-Oriented Understudy for Gisting Evaluation (ROUGE)-1, ROUGE-2, ROUGE-L, and Bidirectional Encoder Representations from Transformers Score across all aspects of the counseling components. Furthermore, expert evaluation revealed that Mistral superseded both MentalLlama and MentalBART across 6 parameters: affective attitude, burden, ethicality, coherence, opportunity costs, and perceived effectiveness. However, these models exhibit a common weakness in terms of room for improvement in the opportunity costs and perceived effectiveness metrics.
CONCLUSIONS
While LLMs fine-tuned specifically on mental health domain data display better performance based on automatic evaluation scores, expert assessments indicate that these models are not yet reliable for clinical application. Further refinement and validation are necessary before their implementation in practice.
背景
全面的会谈总结能够实现心理健康咨询的有效连续性,有助于制定知情的治疗计划。然而,手动总结是一项重大挑战,会分散专家对核心咨询过程的注意力。利用自动总结的进展来简化总结过程可以解决这个问题,因为这使心理健康专业人员能够获得冗长治疗会谈的简明总结,从而提高他们的效率。然而,现有的方法往往忽略了咨询互动中固有的细微复杂之处。
目的
本研究通过基于方面的总结来评估最先进的大型语言模型(LLM)选择性总结治疗会谈各个方面的效果,旨在对其性能进行基准测试。
方法
我们首先创建了心理健康咨询-组件引导对话总结,这是一个基准数据集,由 191 个咨询会话组成,总结重点关注 3 个不同的咨询组件(也称为咨询方面)。然后,我们评估了 11 个最先进的 LLM 解决咨询组件引导总结任务的能力。生成的总结使用标准总结指标进行定量评估,并由心理健康专业人员进行定性验证。
结果
我们的发现表明,特定于任务的 LLM,如 MentalLlama、Mistral 和 MentalBART,在使用标准定量指标(如 ROUGE-1、ROUGE-2、ROUGE-L 和 Transformer 得分的双向编码器表示)评估时,表现优于所有咨询组件方面的性能。此外,专家评估表明,Mistral 在 6 个参数方面优于 MentalLlama 和 MentalBART:情感态度、负担、伦理、连贯性、机会成本和感知效果。然而,这些模型在机会成本和感知效果指标方面都存在改进空间,这是它们的共同弱点。
结论
虽然针对心理健康领域数据进行微调的 LLM 在自动评估分数方面表现出更好的性能,但专家评估表明,这些模型在临床应用中还不可靠。在实际应用之前,需要进一步改进和验证。