基于新浪微博的抑郁预测的自然语言处理方法研究与分析

Natural Language Processing for Depression Prediction on Sina Weibo: Method Study and Analysis.

机构信息

Gansu Provincial Key Laboratory of Wearable Computing, School of Information Science and Engineering, Lanzhou University, Lanzhou, China.

出版信息

JMIR Ment Health. 2024 Sep 4;11:e58259. doi: 10.2196/58259.

DOI:10.2196/58259

PMID:39233477

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11391090/

Abstract

BACKGROUND

Depression represents a pressing global public health concern, impacting the physical and mental well-being of hundreds of millions worldwide. Notwithstanding advances in clinical practice, an alarming number of individuals at risk for depression continue to face significant barriers to timely diagnosis and effective treatment, thereby exacerbating a burgeoning social health crisis.

OBJECTIVE

This study seeks to develop a novel online depression risk detection method using natural language processing technology to identify individuals at risk of depression on the Chinese social media platform Sina Weibo.

METHODS

First, we collected approximately 527,333 posts publicly shared over 1 year from 1600 individuals with depression and 1600 individuals without depression on the Sina Weibo platform. We then developed a hierarchical transformer network for learning user-level semantic representations, which consists of 3 primary components: a word-level encoder, a post-level encoder, and a semantic aggregation encoder. The word-level encoder learns semantic embeddings from individual posts, while the post-level encoder explores features in user post sequences. The semantic aggregation encoder aggregates post sequence semantics to generate a user-level semantic representation that can be classified as depressed or nondepressed. Next, a classifier is employed to predict the risk of depression. Finally, we conducted statistical and linguistic analyses of the post content from individuals with and without depression using the Chinese Linguistic Inquiry and Word Count.

RESULTS

We divided the original data set into training, validation, and test sets. The training set consisted of 1000 individuals with depression and 1000 individuals without depression. Similarly, each validation and test set comprised 600 users, with 300 individuals from both cohorts (depression and nondepression). Our method achieved an accuracy of 84.62%, precision of 84.43%, recall of 84.50%, and F1-score of 84.32% on the test set without employing sampling techniques. However, by applying our proposed retrieval-based sampling strategy, we observed significant improvements in performance: an accuracy of 95.46%, precision of 95.30%, recall of 95.70%, and F1-score of 95.43%. These outstanding results clearly demonstrate the effectiveness and superiority of our proposed depression risk detection model and retrieval-based sampling technique. This breakthrough provides new insights for large-scale depression detection through social media. Through language behavior analysis, we discovered that individuals with depression are more likely to use negation words (the value of "swear" is 0.001253). This may indicate the presence of negative emotions, rejection, doubt, disagreement, or aversion in individuals with depression. Additionally, our analysis revealed that individuals with depression tend to use negative emotional vocabulary in their expressions ("NegEmo": 0.022306; "Anx": 0.003829; "Anger": 0.004327; "Sad": 0.005740), which may reflect their internal negative emotions and psychological state. This frequent use of negative vocabulary could be a way for individuals with depression to express negative feelings toward life, themselves, or their surrounding environment.

CONCLUSIONS

The research results indicate the feasibility and effectiveness of using deep learning methods to detect the risk of depression. These findings provide insights into the potential for large-scale, automated, and noninvasive prediction of depression among online social media users.

摘要

背景

抑郁症是一个严峻的全球公共卫生问题，影响着全球数亿人的身心健康。尽管临床实践有所进步，但仍有大量处于抑郁风险中的个体面临着及时诊断和有效治疗的巨大障碍，从而加剧了日益严重的社会健康危机。

目的

本研究旨在利用自然语言处理技术开发一种新的在线抑郁风险检测方法，以识别中国社交媒体平台新浪微博上处于抑郁风险的个体。

方法

首先，我们从 1600 名抑郁症患者和 1600 名非抑郁症患者中收集了大约 527333 条在 Sina Weibo 平台上公开分享的帖子。然后，我们开发了一个分层转换器网络来学习用户级别的语义表示，该网络由三个主要组件组成：词级编码器、帖子级编码器和语义聚合编码器。词级编码器从单个帖子中学习语义嵌入，而帖子级编码器则探索用户帖子序列中的特征。语义聚合编码器聚合帖子序列语义，生成可分类为抑郁或非抑郁的用户级别语义表示。接下来，使用分类器预测抑郁风险。最后，我们使用中文词汇统计分析和词频分析对有和没有抑郁的个体的帖子内容进行了分析。

结果

我们将原始数据集分为训练集、验证集和测试集。训练集由 1000 名抑郁症患者和 1000 名非抑郁症患者组成。同样，每个验证集和测试集也由 600 名用户组成，其中 300 名来自两个队列（抑郁症和非抑郁症）。我们的方法在不使用抽样技术的情况下，在测试集上达到了 84.62%的准确率、84.43%的精确率、84.50%的召回率和 84.32%的 F1 分数。然而，通过应用我们提出的基于检索的抽样策略，我们观察到性能有了显著提高：准确率为 95.46%，精确率为 95.30%，召回率为 95.70%，F1 分数为 95.43%。这些出色的结果清楚地表明了我们提出的抑郁风险检测模型和基于检索的抽样技术的有效性和优越性。这一突破为通过社交媒体进行大规模抑郁检测提供了新的思路。通过语言行为分析，我们发现抑郁症患者更倾向于使用否定词（“swear”的值为 0.001253）。这可能表明抑郁症患者存在负面情绪、拒绝、怀疑、不同意或厌恶。此外，我们的分析还揭示了抑郁症患者在表达中更倾向于使用负面情绪词汇（“NegEmo”：0.022306；“Anx”：0.003829；“Anger”：0.004327；“Sad”：0.005740），这可能反映了他们内心的负面情绪和心理状态。这种频繁使用负面词汇可能是抑郁症患者表达对生活、自己或周围环境的负面感受的一种方式。