Eberhardt Steffen T, Vehlen Antonia, Schaffrath Jana, Schwartz Brian, Baur Tobias, Schiller Dominik, Hallmen Tobias, André Elisabeth, Lutz Wolfgang
Department of Psychology, Trier University, Trier, Germany.
Chair for Human-Centered Artificial Intelligence, Augsburg University, Wissenschaftspark 25+27, 54296, Trier, Germany.
Sci Rep. 2025 Aug 12;15(1):29541. doi: 10.1038/s41598-025-14923-y.
Rating scales have shaped psychological research, but are resource-intensive and can burden participants. Large Language Models (LLMs) offer a tool to assess latent constructs in text. This study introduces LLM rating scales, which use LLM responses instead of human ratings. We demonstrate this approach with an LLM rating scale measuring patient engagement in therapy transcripts. Automatically transcribed videos of 1,131 sessions from 155 patients were analyzed using DISCOVER, a software framework for local multimodal human behavior analysis. Llama 3.1 8B LLM rated 120 engagement items, averaging the top eight into a total score. Psychometric evaluation showed a normal distribution, strong reliability (ω = 0.953), and acceptable fit (CFI = 0.968, SRMR = 0.022), except RMSEA = 0.108. Validity was supported by significant correlations with engagement determinants (e.g., motivation, r = .413), processes (e.g., between-session efforts, r = .390), and outcomes (e.g., symptoms, r = - .304). Results remained robust across bootstrap resampling and cross-validation, accounting for nested data. The LLM rating scale exhibited strong psychometric properties, demonstrating the potential of the approach as an assessment tool. Importantly, this automated approach uses interpretable items, ensuring clear understanding of measured constructs, while supporting local implementation and protecting confidential data.
评分量表塑造了心理学研究,但资源密集且会给参与者带来负担。大语言模型(LLMs)提供了一种评估文本中潜在结构的工具。本研究引入了大语言模型评分量表,该量表使用大语言模型的回答而非人工评分。我们通过一个测量患者在治疗记录中参与度的大语言模型评分量表来展示这种方法。使用DISCOVER(一种用于局部多模态人类行为分析的软件框架)对来自155名患者的1131次治疗会话的自动转录视频进行了分析。Llama 3.1 8B大语言模型对120个参与度项目进行了评分,将前八项的平均分作为总分。心理测量学评估显示,除了近似误差均方根(RMSEA)为0.108外,呈正态分布、可靠性强(ω = 0.953)且拟合度可接受(比较拟合指数CFI = 0.968,标准化残差均方根SRMR = 0.022)。与参与度决定因素(如动机,r = 0.413)、过程(如治疗期间的努力,r = 0.390)和结果(如症状,r = -0.304)的显著相关性支持了效度。在考虑嵌套数据的情况下,经自助重采样和交叉验证,结果仍然稳健。大语言模型评分量表表现出强大的心理测量学特性,证明了该方法作为一种评估工具的潜力。重要的是,这种自动化方法使用可解释的项目,确保对所测量的结构有清晰的理解,同时支持本地实施并保护机密数据。