Kim-Dufor Deok-Hee, Walter Michel, Krebs Marie-Odile, Haralambous Yannis, Lenca Philippe, Lemey Christophe
Limics, Sorbonne Université, Université Sorbonne Paris-Nord, INSERM, Paris, France.
Unité de Recherche Clinique en Psychiatrie (URCP), Department of Psychiatry, Centre Hospitalier Universitaire (CHU) de Brest, Brest, France.
Front Psychiatry. 2025 Jun 16;16:1595197. doi: 10.3389/fpsyt.2025.1595197. eCollection 2025.
Peculiar use of language and even language deficits are one of the well-known signs of schizophrenia. Different language features analyzed using natural language processing and machine learning have been reported to differentiate patients at ultra-high risk for psychosis. However, it has not always been explained how, and to what extent, those linguistic markers allow the distinction of patients. This study aims to find relevant linguistic markers for classifying patients at ultra-high risk and explain how the detected markers contribute to the classification.
The first consultations with a psychiatrist of 68 patients (15 not-at-risk patients, 45 at-risk patients, and 8 patients with first episode psychosis) were recorded, transcribed verbatim, and annotated for analyses using natural language processing. A gradient-boosted decision tree algorithm was tested to evaluate its potential to correctly classify three categories of patients and find relevant linguistic markers at the level of lexical richness, semantic coherence, speech disfluency, and syntactic complexity. The Synthetic Minority Oversampling Technique was used to handle imbalanced data, and the SHapley Additive exPlanations (SHAP) values were computed to measure feature importance and each feature's contributions to the classification.
The model yielded good performance, that is, 0.82 accuracy, 0.82 F2-score, 0.85 precision, 0.82 recall, and 0.86 ROC-AUC score, with four linguistic variables that concern weak coherence, the use of "I," and filled pauses.
The findings in this study suggest that weak coherence play a key role in classification. No significant differences in the use of "I" and filled pauses were found between groups using a statistical test, but an explainability model showed its different contributions. The contribution of each linguistic feature to the classification by patient group provided deeper insight into linguistic manifestations of each patient group and their subtle differences, which could help better analyze and understand patients' language behaviors.
语言的特殊运用甚至语言缺陷是精神分裂症的著名症状之一。据报道,使用自然语言处理和机器学习分析的不同语言特征能够区分处于精神病超高风险的患者。然而,这些语言标记如何以及在多大程度上能够区分患者,却并未总是得到解释。本研究旨在寻找用于对超高风险患者进行分类的相关语言标记,并解释检测到的标记如何有助于分类。
记录了68名患者(15名非风险患者、45名风险患者和8名首次发作精神病患者)与精神科医生的首次会诊,逐字转录,并使用自然语言处理进行注释以便分析。测试了梯度提升决策树算法,以评估其正确分类三类患者的潜力,并在词汇丰富度、语义连贯、言语不流畅和句法复杂性层面找到相关语言标记。使用合成少数过采样技术处理不平衡数据,并计算夏普利值(SHapley Additive exPlanations,SHAP)以衡量特征重要性以及每个特征对分类的贡献。
该模型表现良好,即准确率为0.82、F2分数为0.82、精确率为0.85、召回率为0.82、ROC-AUC分数为0.86,有四个语言变量与连贯性弱、“我”的使用和填充停顿有关。
本研究结果表明,连贯性弱在分类中起关键作用。使用统计检验未发现各组在“我”的使用和填充停顿方面存在显著差异,但一个可解释性模型显示了它们不同的贡献。每个语言特征对患者组分类的贡献为深入了解每个患者组的语言表现及其细微差异提供了帮助,这有助于更好地分析和理解患者的语言行为。