Tang Sunny X, Kriz Reno, Cho Sunghye, Park Suh Jung, Harowitz Jenna, Gur Raquel E, Bhati Mahendra T, Wolf Daniel H, Sedoc João, Liberman Mark Y
Zucker Hillside Hospital, Department of Psychiatry, 75-59 263rd St., Glen Oaks, NY, USA.
University of Pennsylvania, Department of Psychiatry, 3400 Spruce St, Gates Building, Philadelphia, PA, USA.
NPJ Schizophr. 2021 May 14;7(1):25. doi: 10.1038/s41537-021-00154-3.
Computerized natural language processing (NLP) allows for objective and sensitive detection of speech disturbance, a hallmark of schizophrenia spectrum disorders (SSD). We explored several methods for characterizing speech changes in SSD (n = 20) compared to healthy control (HC) participants (n = 11) and approached linguistic phenotyping on three levels: individual words, parts-of-speech (POS), and sentence-level coherence. NLP features were compared with a clinical gold standard, the Scale for the Assessment of Thought, Language and Communication (TLC). We utilized Bidirectional Encoder Representations from Transformers (BERT), a state-of-the-art embedding algorithm incorporating bidirectional context. Through the POS approach, we found that SSD used more pronouns but fewer adverbs, adjectives, and determiners (e.g., "the," "a,"). Analysis of individual word usage was notable for more frequent use of first-person singular pronouns among individuals with SSD and first-person plural pronouns among HC. There was a striking increase in incomplete words among SSD. Sentence-level analysis using BERT reflected increased tangentiality among SSD with greater sentence embedding distances. The SSD sample had low speech disturbance on average and there was no difference in group means for TLC scores. However, NLP measures of language disturbance appear to be sensitive to these subclinical differences and showed greater ability to discriminate between HC and SSD than a model based on clinical ratings alone. These intriguing exploratory results from a small sample prompt further inquiry into NLP methods for characterizing language disturbance in SSD and suggest that NLP measures may yield clinically relevant and informative biomarkers.
计算机化自然语言处理(NLP)能够客观且灵敏地检测言语障碍,这是精神分裂症谱系障碍(SSD)的一个标志。我们探索了几种方法来表征SSD患者(n = 20)与健康对照(HC)参与者(n = 11)的言语变化,并从三个层面进行语言表型分析:单个单词、词性(POS)和句子层面的连贯性。将NLP特征与临床金标准——思维、语言和沟通评估量表(TLC)进行比较。我们使用了来自Transformer的双向编码器表示(BERT),这是一种结合双向上下文的先进嵌入算法。通过词性分析方法,我们发现SSD患者使用更多代词,但副词、形容词和限定词(如“the”“a”)较少。对单个单词使用情况的分析表明,SSD患者中第一人称单数代词的使用频率更高,而HC参与者中第一人称复数代词的使用频率更高。SSD患者中不完整单词的数量显著增加。使用BERT进行的句子层面分析反映出SSD患者的言语更加离题,句子嵌入距离更大。SSD样本的平均言语障碍程度较低,TLC评分的组均值没有差异。然而,NLP语言障碍测量方法似乎对这些亚临床差异敏感,并且与仅基于临床评分的模型相比,在区分HC和SSD方面表现出更强的能力。这些来自小样本的有趣探索性结果促使我们进一步探究用于表征SSD语言障碍的NLP方法,并表明NLP测量可能产生具有临床相关性和信息量的生物标志物。