理解不同社交网络服务领域的心理健康问题：基于文本的 Reddit 帖子的计算分析。

BACKGROUND: Users increasingly use social networking services (SNSs) to share their feelings and emotions. For those with mental disorders, SNSs can also be used to seek advice on mental health issues. One available SNS is Reddit, in which users can freely discuss such matters on relevant health diagnostic subreddits. OBJECTIVE: In this study, we analyzed the distinctive linguistic characteristics in users' posts on specific mental disorder subreddits (depression, anxiety, bipolar disorder, borderline personality disorder, schizophrenia, autism, and mental health) and further validated their distinctiveness externally by comparing them with posts of subreddits not related to mental illness. We also confirmed that these differences in linguistic formulations can be learned through a machine learning process. METHODS: Reddit posts uploaded by users were collected for our research. We used various statistical analysis methods in Linguistic Inquiry and Word Count (LIWC) software, including 1-way ANOVA and subsequent post hoc tests, to see sentiment differences in various lexical features within mental health-related subreddits and against unrelated ones. We also applied 3 supervised and unsupervised clustering methods for both cases after extracting textual features from posts on each subreddit using bidirectional encoder representations from transformers (BERT) to ensure that our data set is suitable for further machine learning or deep learning tasks. RESULTS: We collected 3,133,509 posts of 919,722 Reddit users. The results using the data indicated that there are notable linguistic differences among the subreddits, consistent with the findings of prior research. The findings from LIWC analyses revealed that patients with each mental health issue show significantly different lexical and semantic patterns, such as word count or emotion, throughout their online social networking activities, with P<.001 for all cases. Furthermore, distinctive features of each subreddit group were successfully identified through supervised and unsupervised clustering methods, using the BERT embeddings extracted from textual posts. This distinctiveness was reflected in the Davies-Bouldin scores ranging from 0.222 to 0.397 and the silhouette scores ranging from 0.639 to 0.803 in the former case, with scores of 1.638 and 0.729, respectively, in the latter case. CONCLUSIONS: By taking a multifaceted approach, analyzing textual posts related to mental health issues using statistical, natural language processing, and machine learning techniques, our approach provides insights into aspects of recent lexical usage and information about the linguistic characteristics of patients with specific mental health issues, which can inform clinicians about patients' mental health in diagnostic terms to aid online intervention. Our findings can further promote research areas involving linguistic analysis and machine learning approaches for patients with mental health issues by identifying and detecting mentally vulnerable groups of people online.

背景：用户越来越多地使用社交网络服务（SNS）来分享他们的感受和情绪。对于那些患有精神障碍的人来说，SNS 也可以用来寻求有关心理健康问题的建议。一个可用的 SNS 是 Reddit，用户可以在相关的健康诊断子版块上自由讨论此类问题。

目的：在这项研究中，我们分析了用户在特定精神障碍子版块（抑郁、焦虑、双相情感障碍、边缘型人格障碍、精神分裂症、自闭症和心理健康）上发布的帖子中的独特语言特征，并通过将其与非精神疾病相关的子版块的帖子进行比较，从外部验证其独特性。我们还证实，这些语言表达方式的差异可以通过机器学习过程来学习。

方法：我们收集了用户上传的 Reddit 帖子。我们使用了 Linguistic Inquiry and Word Count（LIWC）软件中的各种统计分析方法，包括单向方差分析和随后的事后检验，以观察心理健康相关子版块内和与非相关子版块内各种词汇特征的情绪差异。我们还应用了 3 种监督和无监督聚类方法，对每个子版块的帖子提取文本特征后，使用来自变压器的双向编码器表示（BERT），以确保我们的数据适合进一步的机器学习或深度学习任务。

结果：我们共收集了 3133509 篇来自 919722 名 Reddit 用户的帖子。数据结果表明，子版块之间存在显著的语言差异，这与先前的研究结果一致。LIWC 分析的结果表明，每个心理健康问题的患者在其在线社交网络活动中表现出明显不同的词汇和语义模式，例如词汇量或情绪，所有情况下 P<.001。此外，通过使用从文本帖子中提取的 BERT 嵌入，使用监督和无监督聚类方法成功识别了每个子版块组的独特特征。这种独特性反映在 Davies-Bouldin 分数在 0.222 到 0.397 之间，轮廓分数在 0.639 到 0.803 之间，在后者的情况下，分别为 1.638 和 0.729。

结论：通过采用多方面的方法，使用统计、自然语言处理和机器学习技术分析与心理健康问题相关的文本帖子，我们的方法提供了有关最近词汇用法的见解，并提供了有关特定心理健康问题患者语言特征的信息，这可以为临床医生提供诊断术语方面的患者心理健康信息，以帮助在线干预。我们的发现可以通过识别和检测在线上易受精神伤害的人群，进一步促进涉及精神健康问题患者的语言分析和机器学习方法的研究领域。