Department of Computer Science, Johannes Gutenberg University, Mainz, 55128, Germany.
Department of Biology, Johannes Gutenberg University, Institute of Molecular Biology, Mainz, 55128, Germany,
Pac Symp Biocomput. 2020;25:319-330.
Social media is a largely untapped source of information on side effects of drugs. Twitter in particular is widely used to report on everyday events and personal ailments. However, labeling this noisy data is a difficult problem because labeled training data is sparse and automatic labeling is error-prone. Crowd sourcing can help in such a scenario to obtain more reliable labels, but is expensive in comparison because workers have to be paid. To remedy this, semi-supervised active learning may reduce the number of labeled data needed and focus the manual labeling process on important information.
We extracted data from Twitter using the public API. We subsequently use Amazon Mechanical Turk in combination with a state-of-the-art semi-supervised active learning method to label tweets with their associated drugs and side effects in two stages. Our results show that our method is an effective way of discovering side effects in tweets with an improvement from 53% F-measure to 67% F-measure as compared to a one stage work flow. Additionally, we show the effectiveness of the active learning scheme in reducing the labeling cost in comparison to a non-active baseline.
Code and data will be published on https://github.com/kramerlab.
社交媒体是一个关于药物副作用的信息的未充分开发的来源。尤其是 Twitter,被广泛用于报告日常事件和个人疾病。然而,对这些嘈杂的数据进行标注是一个困难的问题,因为有标签的训练数据是稀疏的,自动标注容易出错。众包可以帮助在这种情况下获得更可靠的标签,但成本更高,因为工人需要得到报酬。为了解决这个问题,半监督主动学习可以减少所需有标签数据的数量,并将手动标注过程集中在重要信息上。
我们使用公共 API 从 Twitter 提取数据。随后,我们在两个阶段中使用 Amazon Mechanical Turk 结合最先进的半监督主动学习方法,根据相关药物和副作用对推文进行标注。我们的结果表明,与单阶段工作流程相比,我们的方法是一种在推文中发现副作用的有效方法,其 F 值从 53%提高到 67%。此外,我们还展示了主动学习方案在降低标注成本方面相对于非主动基线的有效性。
代码和数据将发布在 https://github.com/kramerlab。