Li Genghao, Li Bing, Huang Langlin, Hou Sibing
School of Information Technology & Management, University of International Business and Economics, Beijing, China.
Graduate School of Art and Science, Columbia University, New York, NY, United States.
JMIR Med Inform. 2020 Jun 23;8(6):e17650. doi: 10.2196/17650.
According to a World Health Organization report in 2017, there was almost one patient with depression among every 20 people in China. However, the diagnosis of depression is usually difficult in terms of clinical detection owing to slow observation, high cost, and patient resistance. Meanwhile, with the rapid emergence of social networking sites, people tend to share their daily life and disclose inner feelings online frequently, making it possible to effectively identify mental conditions using the rich text information. There are many achievements regarding an English web-based corpus, but for research in China so far, the extraction of language features from web-related depression signals is still in a relatively primary stage.
The purpose of this study was to propose an effective approach for constructing a depression-domain lexicon. This lexicon will contain language features that could help identify social media users who potentially have depression. Our study also compared the performance of detection with and without our lexicon.
We autoconstructed a depression-domain lexicon using Word2Vec, a semantic relationship graph, and the label propagation algorithm. These two methods combined performed well in a specific corpus during construction. The lexicon was obtained based on 111,052 Weibo microblogs from 1868 users who were depressed or nondepressed. During depression detection, we considered six features, and we used five classification methods to test the detection performance.
The experiment results showed that in terms of the F1 value, our autoconstruction method performed 1% to 6% better than baseline approaches and was more effective and steadier. When applied to detection models like logistic regression and support vector machine, our lexicon helped the models outperform by 2% to 9% and was able to improve the final accuracy of potential depression detection.
Our depression-domain lexicon was proven to be a meaningful input for classification algorithms, providing linguistic insights on the depressive status of test subjects. We believe that this lexicon will enhance early depression detection in people on social media. Future work will need to be carried out on a larger corpus and with more complex methods.
根据世界卫生组织2017年的一份报告,在中国,每20人中就有近1人患有抑郁症。然而,由于观察缓慢、成本高昂以及患者抵触等原因,抑郁症的临床检测诊断通常很困难。与此同时,随着社交网站的迅速兴起,人们倾向于在网上频繁分享日常生活并披露内心感受,这使得利用丰富的文本信息有效识别心理状况成为可能。关于基于英语网络语料库已有许多成果,但就目前中国的研究而言,从网络相关抑郁信号中提取语言特征仍处于相对初级阶段。
本研究的目的是提出一种构建抑郁领域词汇表的有效方法。该词汇表将包含有助于识别可能患有抑郁症的社交媒体用户的语言特征。我们的研究还比较了使用和不使用我们的词汇表进行检测的性能。
我们使用Word2Vec、语义关系图和标签传播算法自动构建了一个抑郁领域词汇表。在构建过程中,这两种方法相结合在特定语料库中表现良好。该词汇表是基于来自1868名抑郁或非抑郁用户的111,052条微博构建的。在抑郁检测过程中,我们考虑了六个特征,并使用五种分类方法来测试检测性能。
实验结果表明,在F1值方面,我们的自动构建方法比基线方法表现好1%至6%,更有效且更稳定。当应用于逻辑回归和支持向量机等检测模型时,我们的词汇表帮助模型性能提升了2%至9%,并能够提高潜在抑郁检测的最终准确率。
我们的抑郁领域词汇表被证明是分类算法的有意义输入,为测试对象的抑郁状态提供了语言洞察。我们相信这个词汇表将加强对社交媒体上人群的早期抑郁检测。未来的工作需要在更大的语料库上并使用更复杂的方法进行。