Cohen Alex S, Rodriguez Zachary, Warren Kiara K, Cowan Tovah, Masucci Michael D, Edvard Granrud Ole, Holmlund Terje B, Chandler Chelsea, Foltz Peter W, Strauss Gregory P
Louisiana State University, Department of Psychology, Baton Rouge, LA, USA.
Louisiana State University, Center for Computation and Technology, Baton Rouge, LA, USA.
Schizophr Bull. 2022 Sep 1;48(5):939-948. doi: 10.1093/schbul/sbac051.
Despite decades of "proof of concept" findings supporting the use of Natural Language Processing (NLP) in psychosis research, clinical implementation has been slow. One obstacle reflects the lack of comprehensive psychometric evaluation of these measures. There is overwhelming evidence that criterion and content validity can be achieved for many purposes, particularly using machine learning procedures. However, there has been very little evaluation of test-retest reliability, divergent validity (sufficient to address concerns of a "generalized deficit"), and potential biases from demographics and other individual differences.
This article highlights these concerns in development of an NLP measure for tracking clinically rated paranoia from video "selfies" recorded from smartphone devices. Patients with schizophrenia or bipolar disorder were recruited and tracked over a week-long epoch. A small NLP-based feature set from 499 language samples were modeled on clinically rated paranoia using regularized regression.
While test-retest reliability was high, criterion, and convergent/divergent validity were only achieved when considering moderating variables, notably whether a patient was away from home, around strangers, or alone at the time of the recording. Moreover, there were systematic racial and sex biases in the model, in part, reflecting whether patients submitted videos when they were away from home, around strangers, or alone.
Advancing NLP measures for psychosis will require deliberate consideration of test-retest reliability, divergent validity, systematic biases and the potential role of moderators. In our example, a comprehensive psychometric evaluation revealed clear strengths and weaknesses that can be systematically addressed in future research.
尽管数十年来有“概念验证”研究结果支持在精神病研究中使用自然语言处理(NLP),但其临床应用进展缓慢。一个障碍是这些测量方法缺乏全面的心理测量评估。有大量证据表明,许多目的下都能实现标准效度和内容效度,尤其是使用机器学习程序时。然而,对重测信度、区分效度(足以解决“普遍缺陷”问题)以及人口统计学和其他个体差异导致的潜在偏差的评估却很少。
本文在开发一种用于从智能手机录制的视频“自拍”中追踪临床评定偏执狂的NLP测量方法时突出了这些问题。招募了患有精神分裂症或双相情感障碍的患者,并在为期一周的时间段内进行追踪。基于499个语言样本的一个小型NLP特征集,使用正则化回归对临床评定的偏执狂进行建模。
虽然重测信度较高,但只有在考虑调节变量时才能实现标准效度以及收敛/区分效度,特别是患者在录制时是离家在外、周围有陌生人还是独自一人。此外,该模型存在系统性的种族和性别偏差,部分反映了患者在离家在外、周围有陌生人或独自一人时是否提交视频。
推进用于精神病的NLP测量方法需要慎重考虑重测信度、区分效度、系统性偏差以及调节变量的潜在作用。在我们这个例子中,全面的心理测量评估揭示了明显的优势和劣势,可在未来研究中系统地加以解决。