Suppr超能文献

在 Twitter 上检测潜在有害和保护自杀相关内容:机器学习方法。

Detecting Potentially Harmful and Protective Suicide-Related Content on Twitter: Machine Learning Approach.

机构信息

Section for the Science of Complex Systems, Center for Medical Statistics, Informatics and Intelligent Systems, Medical University of Vienna, Vienna, Austria.

Unit Suicide Research and Mental Health Promotion, Center for Public Health, Medical University of Vienna, Vienna, Austria.

出版信息

J Med Internet Res. 2022 Aug 17;24(8):e34705. doi: 10.2196/34705.

Abstract

BACKGROUND

Research has repeatedly shown that exposure to suicide-related news media content is associated with suicide rates, with some content characteristics likely having harmful and others potentially protective effects. Although good evidence exists for a few selected characteristics, systematic and large-scale investigations are lacking. Moreover, the growing importance of social media, particularly among young adults, calls for studies on the effects of the content posted on these platforms.

OBJECTIVE

This study applies natural language processing and machine learning methods to classify large quantities of social media data according to characteristics identified as potentially harmful or beneficial in media effects research on suicide and prevention.

METHODS

We manually labeled 3202 English tweets using a novel annotation scheme that classifies suicide-related tweets into 12 categories. Based on these categories, we trained a benchmark of machine learning models for a multiclass and a binary classification task. As models, we included a majority classifier, an approach based on word frequency (term frequency-inverse document frequency with a linear support vector machine) and 2 state-of-the-art deep learning models (Bidirectional Encoder Representations from Transformers [BERT] and XLNet). The first task classified posts into 6 main content categories, which are particularly relevant for suicide prevention based on previous evidence. These included personal stories of either suicidal ideation and attempts or coping and recovery, calls for action intending to spread either problem awareness or prevention-related information, reporting of suicide cases, and other tweets irrelevant to these 5 categories. The second classification task was binary and separated posts in the 11 categories referring to actual suicide from posts in the off-topic category, which use suicide-related terms in another meaning or context.

RESULTS

In both tasks, the performance of the 2 deep learning models was very similar and better than that of the majority or the word frequency classifier. BERT and XLNet reached accuracy scores above 73% on average across the 6 main categories in the test set and F-scores between 0.69 and 0.85 for all but the suicidal ideation and attempts category (F=0.55). In the binary classification task, they correctly labeled around 88% of the tweets as about suicide versus off-topic, with BERT achieving F-scores of 0.93 and 0.74, respectively. These classification performances were similar to human performance in most cases and were comparable with state-of-the-art models on similar tasks.

CONCLUSIONS

The achieved performance scores highlight machine learning as a useful tool for media effects research on suicide. The clear advantage of BERT and XLNet suggests that there is crucial information about meaning in the context of words beyond mere word frequencies in tweets about suicide. By making data labeling more efficient, this work has enabled large-scale investigations on harmful and protective associations of social media content with suicide rates and help-seeking behavior.

摘要

背景

研究反复表明,接触与自杀相关的新闻媒体内容与自杀率有关,某些内容特征可能具有有害影响,而某些内容特征则可能具有保护作用。尽管有一些选定的特征有很好的证据,但系统的大规模调查仍然缺乏。此外,社交媒体的重要性日益增加,尤其是在年轻人中,这就需要研究这些平台上发布的内容的影响。

目的

本研究应用自然语言处理和机器学习方法,根据媒体对自杀和预防的影响研究中确定的潜在有害或有益的特征,对大量社交媒体数据进行分类。

方法

我们使用一种新的标注方案,对 3202 条英文推文进行了手动标注,该方案将与自杀相关的推文分为 12 个类别。在此基础上,我们为多类和二类分类任务训练了一组机器学习模型基准。作为模型,我们包括多数分类器、一种基于词频的方法(带线性支持向量机的词频-逆文档频率)和 2 个最先进的深度学习模型(Bidirectional Encoder Representations from Transformers [BERT] 和 XLNet)。第一个任务将帖子分类为 6 个主要内容类别,这些类别基于先前的证据,与自杀预防特别相关。这些类别包括自杀意念和尝试或应对和恢复的个人故事、旨在传播问题意识或预防相关信息的行动呼吁、自杀案例报告,以及其他与这 5 个类别无关的推文。第二个分类任务是二进制的,将 11 个类别中提到实际自杀的帖子与主题无关的帖子分开,这些帖子在另一种意义或上下文中使用与自杀相关的术语。

结果

在这两个任务中,2 个深度学习模型的性能非常相似,优于多数或词频分类器。BERT 和 XLNet 在测试集中对 6 个主要类别中的平均准确率得分均高于 73%,除了自杀意念和尝试类别(F=0.55)外,所有类别得分为 0.69 到 0.85。在二进制分类任务中,它们正确地将大约 88%的推文标记为与主题相关或与主题无关,其中 BERT 的 F 值分别为 0.93 和 0.74。在大多数情况下,这些分类性能与人类表现相似,与类似任务中的最新模型相当。

结论

所达到的性能得分突出了机器学习作为自杀媒体效果研究的有用工具。BERT 和 XLNet 的明显优势表明,在关于自杀的推文的词的上下文中,除了单词频率之外,还有关于单词含义的关键信息。通过使数据标注更有效,这项工作使对社交媒体内容与自杀率和求助行为之间的有害和保护关联进行大规模调查成为可能。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验