Aladağ Ahmet Emre, Muderrisoglu Serra, Akbas Naz Berfu, Zahmacioglu Oguzhan, Bingol Haluk O
Department of Computer Engineering, Bogazici University, Istanbul, Turkey.
Amazon Research, Madrid, Spain.
J Med Internet Res. 2018 Jun 21;20(6):e215. doi: 10.2196/jmir.9840.
In 2016, 44,965 people in the United States died by suicide. It is common to see people with suicidal ideation seek help or leave suicide notes on social media before attempting suicide. Many prefer to express their feelings with longer passages on forums such as Reddit and blogs. Because these expressive posts follow regular language patterns, potential suicide attempts can be prevented by detecting suicidal posts as they are written.
This study aims to build a classifier that differentiates suicidal and nonsuicidal forum posts via text mining methods applied on post titles and bodies.
A total of 508,398 Reddit posts longer than 100 characters and posted between 2008 and 2016 on SuicideWatch, Depression, Anxiety, and ShowerThoughts subreddits were downloaded from the publicly available Reddit dataset. Of these, 10,785 posts were randomly selected and 785 were manually annotated as suicidal or nonsuicidal. Features were extracted using term frequency-inverse document frequency, linguistic inquiry and word count, and sentiment analysis on post titles and bodies. Logistic regression, random forest, and support vector machine (SVM) classification algorithms were applied on resulting corpus and prediction performance is evaluated.
The logistic regression and SVM classifiers correctly identified suicidality of posts with 80% to 92% accuracy and F1 score, respectively, depending on different data compositions closely followed by random forest, compared to baseline ZeroR algorithm achieving 50% accuracy and 66% F1 score.
This study demonstrated that it is possible to detect people with suicidal ideation on online forums with high accuracy. The logistic regression classifier in this study can potentially be embedded on blogs and forums to make the decision to offer real-time online counseling in case a suicidal post is being written.
2016年,美国有44965人死于自杀。有自杀念头的人在自杀前寻求帮助或在社交媒体上留下遗书的情况很常见。许多人更喜欢在Reddit和博客等论坛上用较长的段落来表达自己的感受。由于这些表达性的帖子遵循常规语言模式,因此在撰写自杀相关帖子时通过检测来预防潜在的自杀企图是可行的。
本研究旨在通过对帖子标题和正文应用文本挖掘方法,构建一个区分自杀和非自杀论坛帖子的分类器。
从公开可用的Reddit数据集中下载了2008年至2016年期间在“SuicideWatch”“Depression”“Anxiety”和“ShowerThoughts”子版块上发布的508398条长度超过100个字符的Reddit帖子。其中,随机选择了10785条帖子,785条被人工标注为自杀或非自杀。使用词频逆文档频率、语言查询和词数统计以及对帖子标题和正文的情感分析来提取特征。将逻辑回归、随机森林和支持向量机(SVM)分类算法应用于所得语料库,并评估预测性能。
逻辑回归和SVM分类器分别以80%至92%的准确率和F1分数正确识别帖子的自杀倾向,具体取决于不同的数据组成,随机森林紧随其后,相比之下,基线ZeroR算法的准确率为50%,F1分数为66%。
本研究表明,在在线论坛上高精度地检测有自杀念头的人是可能的。本研究中的逻辑回归分类器有可能嵌入到博客和论坛中,以便在撰写自杀相关帖子时决定提供实时在线咨询。