• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

探讨大型语言模型在总结心理健康咨询会话中的功效:基准研究。

Exploring the Efficacy of Large Language Models in Summarizing Mental Health Counseling Sessions: Benchmark Study.

机构信息

Department of Electrical Engineering, Indian Institute of Technology Delhi, New Delhi, India.

Department of Computer Science & Engineering, Indraprastha Institute of Information Technology Delhi, New Delhi, India.

出版信息

JMIR Ment Health. 2024 Jul 23;11:e57306. doi: 10.2196/57306.

DOI:10.2196/57306
PMID:39042893
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11303879/
Abstract

BACKGROUND

Comprehensive session summaries enable effective continuity in mental health counseling, facilitating informed therapy planning. However, manual summarization presents a significant challenge, diverting experts' attention from the core counseling process. Leveraging advances in automatic summarization to streamline the summarization process addresses this issue because this enables mental health professionals to access concise summaries of lengthy therapy sessions, thereby increasing their efficiency. However, existing approaches often overlook the nuanced intricacies inherent in counseling interactions.

OBJECTIVE

This study evaluates the effectiveness of state-of-the-art large language models (LLMs) in selectively summarizing various components of therapy sessions through aspect-based summarization, aiming to benchmark their performance.

METHODS

We first created Mental Health Counseling-Component-Guided Dialogue Summaries, a benchmarking data set that consists of 191 counseling sessions with summaries focused on 3 distinct counseling components (also known as counseling aspects). Next, we assessed the capabilities of 11 state-of-the-art LLMs in addressing the task of counseling-component-guided summarization. The generated summaries were evaluated quantitatively using standard summarization metrics and verified qualitatively by mental health professionals.

RESULTS

Our findings demonstrated the superior performance of task-specific LLMs such as MentalLlama, Mistral, and MentalBART evaluated using standard quantitative metrics such as Recall-Oriented Understudy for Gisting Evaluation (ROUGE)-1, ROUGE-2, ROUGE-L, and Bidirectional Encoder Representations from Transformers Score across all aspects of the counseling components. Furthermore, expert evaluation revealed that Mistral superseded both MentalLlama and MentalBART across 6 parameters: affective attitude, burden, ethicality, coherence, opportunity costs, and perceived effectiveness. However, these models exhibit a common weakness in terms of room for improvement in the opportunity costs and perceived effectiveness metrics.

CONCLUSIONS

While LLMs fine-tuned specifically on mental health domain data display better performance based on automatic evaluation scores, expert assessments indicate that these models are not yet reliable for clinical application. Further refinement and validation are necessary before their implementation in practice.

摘要

背景

全面的会谈总结能够实现心理健康咨询的有效连续性,有助于制定知情的治疗计划。然而,手动总结是一项重大挑战,会分散专家对核心咨询过程的注意力。利用自动总结的进展来简化总结过程可以解决这个问题,因为这使心理健康专业人员能够获得冗长治疗会谈的简明总结,从而提高他们的效率。然而,现有的方法往往忽略了咨询互动中固有的细微复杂之处。

目的

本研究通过基于方面的总结来评估最先进的大型语言模型(LLM)选择性总结治疗会谈各个方面的效果,旨在对其性能进行基准测试。

方法

我们首先创建了心理健康咨询-组件引导对话总结,这是一个基准数据集,由 191 个咨询会话组成,总结重点关注 3 个不同的咨询组件(也称为咨询方面)。然后,我们评估了 11 个最先进的 LLM 解决咨询组件引导总结任务的能力。生成的总结使用标准总结指标进行定量评估,并由心理健康专业人员进行定性验证。

结果

我们的发现表明,特定于任务的 LLM,如 MentalLlama、Mistral 和 MentalBART,在使用标准定量指标(如 ROUGE-1、ROUGE-2、ROUGE-L 和 Transformer 得分的双向编码器表示)评估时,表现优于所有咨询组件方面的性能。此外,专家评估表明,Mistral 在 6 个参数方面优于 MentalLlama 和 MentalBART:情感态度、负担、伦理、连贯性、机会成本和感知效果。然而,这些模型在机会成本和感知效果指标方面都存在改进空间,这是它们的共同弱点。

结论

虽然针对心理健康领域数据进行微调的 LLM 在自动评估分数方面表现出更好的性能,但专家评估表明,这些模型在临床应用中还不可靠。在实际应用之前,需要进一步改进和验证。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3e2f/11303879/e7f11e10a1f2/mental_v11i1e57306_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3e2f/11303879/0755844f2fbb/mental_v11i1e57306_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3e2f/11303879/e7f11e10a1f2/mental_v11i1e57306_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3e2f/11303879/0755844f2fbb/mental_v11i1e57306_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3e2f/11303879/e7f11e10a1f2/mental_v11i1e57306_fig2.jpg

相似文献

1
Exploring the Efficacy of Large Language Models in Summarizing Mental Health Counseling Sessions: Benchmark Study.探讨大型语言模型在总结心理健康咨询会话中的功效:基准研究。
JMIR Ment Health. 2024 Jul 23;11:e57306. doi: 10.2196/57306.
2
Exploring the potential of ChatGPT in medical dialogue summarization: a study on consistency with human preferences.探索 ChatGPT 在医学对话总结中的潜力:一项关于与人类偏好一致性的研究。
BMC Med Inform Decis Mak. 2024 Mar 14;24(1):75. doi: 10.1186/s12911-024-02481-8.
3
Exploiting Intersentence Information for Better Question-Driven Abstractive Summarization: Algorithm Development and Validation.利用句间信息实现更好的问题驱动摘要生成:算法开发与验证
JMIR Med Inform. 2022 Aug 15;10(8):e38052. doi: 10.2196/38052.
4
Knowledge-Infused Abstractive Summarization of Clinical Diagnostic Interviews: Framework Development Study.临床诊断访谈的知识注入式摘要生成:框架开发研究
JMIR Ment Health. 2021 May 10;8(5):e20865. doi: 10.2196/20865.
5
Impact of a Digital Scribe System on Clinical Documentation Time and Quality: Usability Study.数字抄写系统对临床文档记录时间和质量的影响:可用性研究
JMIR AI. 2024 Sep 23;3:e60020. doi: 10.2196/60020.
6
Leveraging GPT-4 for food effect summarization to enhance product-specific guidance development via iterative prompting.利用GPT-4进行食物效应总结,通过迭代提示增强特定产品指南的制定。
J Biomed Inform. 2023 Dec;148:104533. doi: 10.1016/j.jbi.2023.104533. Epub 2023 Nov 2.
7
Clinical Text Summarization: Adapting Large Language Models Can Outperform Human Experts.临床文本摘要:适配大语言模型可超越人类专家。
Res Sq. 2023 Oct 30:rs.3.rs-3483777. doi: 10.21203/rs.3.rs-3483777/v1.
8
Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同行用户对解释非专业患者实验室检测结果的答案质量比较:评估研究。
J Med Internet Res. 2024 Apr 17;26:e56655. doi: 10.2196/56655.
9
Development and Evaluation of a Digital Scribe: Conversation Summarization Pipeline for Emergency Department Counseling Sessions towards Reducing Documentation Burden.数字书记员的开发与评估:用于急诊科咨询会话的对话摘要流程以减轻文档负担
medRxiv. 2023 Dec 7:2023.12.06.23299573. doi: 10.1101/2023.12.06.23299573.
10
Large Language Models for Mental Health Applications: Systematic Review.大型语言模型在精神健康应用中的应用:系统评价。
JMIR Ment Health. 2024 Oct 18;11:e57400. doi: 10.2196/57400.

引用本文的文献

1
Multimodal Sensing-Enabled Large Language Models for Automated Emotional Regulation: A Review of Current Technologies, Opportunities, and Challenges.用于自动情绪调节的多模态传感大语言模型:当前技术、机遇与挑战综述
Sensors (Basel). 2025 Aug 1;25(15):4763. doi: 10.3390/s25154763.
2
The Application and Ethical Implication of Generative AI in Mental Health: Systematic Review.生成式人工智能在心理健康领域的应用及伦理意义:系统综述
JMIR Ment Health. 2025 Jun 27;12:e70610. doi: 10.2196/70610.
3
The Applications of Large Language Models in Mental Health: Scoping Review.

本文引用的文献

1
Novel framework for dialogue summarization based on factual-statement fusion and dialogue segmentation.基于事实陈述融合和对话分割的对话摘要新框架。
PLoS One. 2024 Apr 16;19(4):e0302104. doi: 10.1371/journal.pone.0302104. eCollection 2024.
2
Exploring the potential of ChatGPT in medical dialogue summarization: a study on consistency with human preferences.探索 ChatGPT 在医学对话总结中的潜力:一项关于与人类偏好一致性的研究。
BMC Med Inform Decis Mak. 2024 Mar 14;24(1):75. doi: 10.1186/s12911-024-02481-8.
3
Adapted large language models can outperform medical experts in clinical text summarization.
大语言模型在心理健康领域的应用:范围综述
J Med Internet Res. 2025 May 5;27:e69284. doi: 10.2196/69284.
4
Responsible Design, Integration, and Use of Generative AI in Mental Health.生成式人工智能在心理健康领域的负责任设计、整合与应用。
JMIR Ment Health. 2025 Jan 20;12:e70439. doi: 10.2196/70439.
5
An Ethical Perspective on the Democratization of Mental Health With Generative AI.生成式人工智能助力精神健康民主化的伦理视角
JMIR Ment Health. 2024 Oct 17;11:e58011. doi: 10.2196/58011.
经过改编的大型语言模型在临床文本总结方面的表现优于医学专家。
Nat Med. 2024 Apr;30(4):1134-1142. doi: 10.1038/s41591-024-02855-5. Epub 2024 Feb 27.
4
Overview of the Problem List Summarization (ProbSum) 2023 Shared Task on Summarizing Patients' Active Diagnoses and Problems from Electronic Health Record Progress Notes.2023年电子健康记录病程记录中患者当前诊断和问题总结的问题列表总结(ProbSum)共享任务概述
Proc Conf Assoc Comput Linguist Meet. 2023 Jul;2023:461-467. doi: 10.18653/v1/2023.bionlp-1.43.
5
Leveraging Summary Guidance on Medical Report Summarization.利用医疗报告总结中的指导意见。
IEEE J Biomed Health Inform. 2023 Oct;27(10):5066-5075. doi: 10.1109/JBHI.2023.3304376. Epub 2023 Oct 5.
6
Learning to Summarize Chinese Radiology Findings With a Pre-Trained Encoder.利用预训练编码器学习总结中文放射学检查结果
IEEE Trans Biomed Eng. 2023 Dec;70(12):3277-3287. doi: 10.1109/TBME.2023.3280987. Epub 2023 Nov 21.
7
Exploring optimal granularity for extractive summarization of unstructured health records: Analysis of the largest multi-institutional archive of health records in Japan.探索非结构化健康记录提取式摘要的最佳粒度:对日本最大的多机构健康记录存档进行分析。
PLOS Digit Health. 2022 Sep 15;1(9):e0000099. doi: 10.1371/journal.pdig.0000099. eCollection 2022 Sep.
8
Summarizing Patients' Problems from Hospital Progress Notes Using Pre-trained Sequence-to-Sequence Models.使用预训练的序列到序列模型从医院病程记录中总结患者问题
Proc Int Conf Comput Ling. 2022 Oct;2022:2979-2991.
9
Generating (Factual?) Narrative Summaries of RCTs: Experiments with Neural Multi-Document Summarization.生成(真实的?)RCT 叙述性摘要:神经多文档摘要实验。
AMIA Jt Summits Transl Sci Proc. 2021 May 17;2021:605-614. eCollection 2021.
10
Knowledge-Infused Abstractive Summarization of Clinical Diagnostic Interviews: Framework Development Study.临床诊断访谈的知识注入式摘要生成:框架开发研究
JMIR Ment Health. 2021 May 10;8(5):e20865. doi: 10.2196/20865.