Suppr超能文献

用于自动转录心理治疗会话的大语言模型评分量表的开发与验证

Development and validation of large language model rating scales for automatically transcribed psychological therapy sessions.

作者信息

Eberhardt Steffen T, Vehlen Antonia, Schaffrath Jana, Schwartz Brian, Baur Tobias, Schiller Dominik, Hallmen Tobias, André Elisabeth, Lutz Wolfgang

机构信息

Department of Psychology, Trier University, Trier, Germany.

Chair for Human-Centered Artificial Intelligence, Augsburg University, Wissenschaftspark 25+27, 54296, Trier, Germany.

出版信息

Sci Rep. 2025 Aug 12;15(1):29541. doi: 10.1038/s41598-025-14923-y.

Abstract

Rating scales have shaped psychological research, but are resource-intensive and can burden participants. Large Language Models (LLMs) offer a tool to assess latent constructs in text. This study introduces LLM rating scales, which use LLM responses instead of human ratings. We demonstrate this approach with an LLM rating scale measuring patient engagement in therapy transcripts. Automatically transcribed videos of 1,131 sessions from 155 patients were analyzed using DISCOVER, a software framework for local multimodal human behavior analysis. Llama 3.1 8B LLM rated 120 engagement items, averaging the top eight into a total score. Psychometric evaluation showed a normal distribution, strong reliability (ω = 0.953), and acceptable fit (CFI = 0.968, SRMR = 0.022), except RMSEA = 0.108. Validity was supported by significant correlations with engagement determinants (e.g., motivation, r = .413), processes (e.g., between-session efforts, r = .390), and outcomes (e.g., symptoms, r = - .304). Results remained robust across bootstrap resampling and cross-validation, accounting for nested data. The LLM rating scale exhibited strong psychometric properties, demonstrating the potential of the approach as an assessment tool. Importantly, this automated approach uses interpretable items, ensuring clear understanding of measured constructs, while supporting local implementation and protecting confidential data.

摘要

评分量表塑造了心理学研究,但资源密集且会给参与者带来负担。大语言模型(LLMs)提供了一种评估文本中潜在结构的工具。本研究引入了大语言模型评分量表,该量表使用大语言模型的回答而非人工评分。我们通过一个测量患者在治疗记录中参与度的大语言模型评分量表来展示这种方法。使用DISCOVER(一种用于局部多模态人类行为分析的软件框架)对来自155名患者的1131次治疗会话的自动转录视频进行了分析。Llama 3.1 8B大语言模型对120个参与度项目进行了评分,将前八项的平均分作为总分。心理测量学评估显示,除了近似误差均方根(RMSEA)为0.108外,呈正态分布、可靠性强(ω = 0.953)且拟合度可接受(比较拟合指数CFI = 0.968,标准化残差均方根SRMR = 0.022)。与参与度决定因素(如动机,r = 0.413)、过程(如治疗期间的努力,r = 0.390)和结果(如症状,r = -0.304)的显著相关性支持了效度。在考虑嵌套数据的情况下,经自助重采样和交叉验证,结果仍然稳健。大语言模型评分量表表现出强大的心理测量学特性,证明了该方法作为一种评估工具的潜力。重要的是,这种自动化方法使用可解释的项目,确保对所测量的结构有清晰的理解,同时支持本地实施并保护机密数据。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/023f/12343941/50d7e47bcf68/41598_2025_14923_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验