Department of Computer Science, University of Colorado Boulder, United States; Institute of Cognitive Science, University of Colorado Boulder, United States.
Institute of Cognitive Science, University of Colorado Boulder, United States.
Psychiatry Res. 2024 Nov;341:116119. doi: 10.1016/j.psychres.2024.116119. Epub 2024 Aug 3.
Natural Language Processing (NLP) methods have shown promise for the assessment of formal thought disorder, a hallmark feature of schizophrenia in which disturbances to the structure, organization, or coherence of thought can manifest as disordered or incoherent speech. We investigated the suitability of modern Large Language Models (LLMs - e.g., GPT-3.5, GPT-4, and Llama 3) to predict expert-generated ratings for three dimensions of thought disorder (coherence, content, and tangentiality) assigned to speech samples collected from both patients with a diagnosis of schizophrenia (n = 26) and healthy control participants (n = 25). In addition to (1) evaluating the accuracy of LLM-generated ratings relative to human experts, we also (2) investigated the degree to which the LLMs produced consistent ratings across multiple trials, and we (3) sought to understand the factors that impacted the consistency of LLM-generated output. We found that machine-generated ratings of the level of thought disorder in speech matched favorably those of expert humans, and we identified a tradeoff between accuracy and consistency in LLM ratings. Unlike traditional NLP methods, LLMs were not always consistent in their predictions, but these inconsistencies could be mitigated with careful parameter selection and ensemble methods. We discuss implications for NLP-based assessment of thought disorder and provide recommendations of best practices for integrating these methods in the field of psychiatry.
自然语言处理(NLP)方法在评估形式思维障碍方面显示出了潜力,这是精神分裂症的一个标志性特征,其中思维的结构、组织或连贯性的紊乱可能表现为言语紊乱或不连贯。我们研究了现代大型语言模型(例如 GPT-3.5、GPT-4 和 Llama 3)在预测专家对言语样本中三个思维障碍维度(连贯性、内容和离题)的评分方面的适用性,这些言语样本来自被诊断为精神分裂症的患者(n=26)和健康对照组参与者(n=25)。除了(1)评估 LLM 生成的评分相对于人类专家的准确性外,我们还(2)研究了 LLMs 在多次试验中产生一致评分的程度,并且我们(3)试图了解影响 LLM 生成输出一致性的因素。我们发现,机器生成的言语思维障碍程度评分与专家人类的评分相当吻合,并且我们发现 LLM 评分的准确性和一致性之间存在权衡。与传统的 NLP 方法不同,LLMs 的预测并不总是一致,但通过仔细选择参数和集成方法可以减轻这些不一致性。我们讨论了基于 NLP 的思维障碍评估的影响,并为在精神病学领域整合这些方法提供了最佳实践建议。