Lao Cecilia, Lane Jo, Suominen Hanna
School of Computing, College of Engineering and Computer Science, The Australian National University, Canberra, ACT, Australia.
National Centre for Epidemiology and Population Health, College of Health and Medicine, The Australian National University, Canberra, ACT, Australia.
JMIR Form Res. 2022 Aug 30;6(8):e35563. doi: 10.2196/35563.
Effective suicide risk assessments and interventions are vital for suicide prevention. Although assessing such risks is best done by health care professionals, people experiencing suicidal ideation may not seek help. Hence, machine learning (ML) and computational linguistics can provide analytical tools for understanding and analyzing risks. This, therefore, facilitates suicide intervention and prevention.
This study aims to explore, using statistical analyses and ML, whether computerized language analysis could be applied to assess and better understand a person's suicide risk on social media.
We used the University of Maryland Suicidality Dataset comprising text posts written by users (N=866) of mental health-related forums on Reddit. Each user was classified with a suicide risk rating (no, low, moderate, or severe) by either medical experts or crowdsourced annotators, denoting their estimated likelihood of dying by suicide. In language analysis, the Linguistic Inquiry and Word Count lexicon assessed sentiment, thinking styles, and part of speech, whereas readability was explored using the TextStat library. The Mann-Whitney U test identified differences between at-risk (low, moderate, and severe risk) and no-risk users. Meanwhile, the Kruskal-Wallis test and Spearman correlation coefficient were used for granular analysis between risk levels and to identify redundancy, respectively. In the ML experiments, gradient boost, random forest, and support vector machine models were trained using 10-fold cross validation. The area under the receiver operator curve and F-score were the primary measures. Finally, permutation importance uncovered the features that contributed the most to each model's decision-making.
Statistically significant differences (P<.05) were identified between the at-risk (671/866, 77.5%) and no-risk groups (195/866, 22.5%). This was true for both the crowd- and expert-annotated samples. Overall, at-risk users had higher median values for most variables (authenticity, first-person pronouns, and negation), with a notable exception of clout, which indicated that at-risk users were less likely to engage in social posturing. A high positive correlation (ρ>0.84) was present between the part of speech variables, which implied redundancy and demonstrated the utility of aggregate features. All ML models performed similarly in their area under the curve (0.66-0.68); however, the random forest and gradient boost models were noticeably better in their F-score (0.65 and 0.62) than the support vector machine (0.52). The features that contributed the most to the ML models were authenticity, clout, and negative emotions.
In summary, our statistical analyses found linguistic features associated with suicide risk, such as social posturing (eg, authenticity and clout), first-person singular pronouns, and negation. This increased our understanding of the behavioral and thought patterns of social media users and provided insights into the mechanisms behind ML models. We also demonstrated the applicative potential of ML in assisting health care professionals to assess and manage individuals experiencing suicide risk.
有效的自杀风险评估和干预对于预防自杀至关重要。尽管评估此类风险最好由医疗保健专业人员进行,但有自杀念头的人可能不会寻求帮助。因此,机器学习(ML)和计算语言学可以提供用于理解和分析风险的分析工具。这进而有助于自杀干预和预防。
本研究旨在使用统计分析和机器学习来探索计算机化语言分析是否可用于评估和更好地理解社交媒体上一个人的自杀风险。
我们使用了马里兰大学自杀倾向数据集,该数据集包含Reddit上心理健康相关论坛用户(N = 866)撰写的文本帖子。每位用户由医学专家或众包注释者分类为具有自杀风险评级(无、低、中或高),表示他们自杀死亡的估计可能性。在语言分析中,语言查询与字数统计词典评估情感、思维方式和词性,而使用TextStat库探索可读性。曼-惠特尼U检验确定了有风险(低、中、高风险)和无风险用户之间的差异。同时,Kruskal-Wallis检验和斯皮尔曼相关系数分别用于风险水平之间的粒度分析和识别冗余。在机器学习实验中,使用10折交叉验证训练梯度提升、随机森林和支持向量机模型。受试者工作特征曲线下面积和F分数是主要指标。最后,排列重要性揭示了对每个模型决策贡献最大的特征。
在有风险组(671/866,77.5%)和无风险组(195/866,22.5%)之间发现了具有统计学意义的差异(P <.05)。这在众包和专家注释的样本中都是如此。总体而言,有风险的用户在大多数变量(真实性、第一人称代词和否定)上的中位数较高,但影响力是一个显著例外,这表明有风险的用户不太可能进行社会姿态展示。词性变量之间存在高度正相关(ρ>0.84),这意味着冗余并证明了聚合特征的效用。所有机器学习模型在曲线下面积方面表现相似(0.66 - 0.68);然而,随机森林和梯度提升模型在F分数(0.65和0.62)方面明显优于支持向量机(0.52)。对机器学习模型贡献最大的特征是真实性、影响力和负面情绪。
总之,我们的统计分析发现了与自杀风险相关的语言特征,如社会姿态展示(如真实性和影响力)、第一人称单数代词和否定。这增进了我们对社交媒体用户行为和思维模式的理解,并为机器学习模型背后的机制提供了见解。我们还展示了机器学习在协助医疗保健专业人员评估和管理有自杀风险的个体方面的应用潜力。