利用弱监督从社交媒体数据生成训练数据集：识别药物提及的概念验证。

Using weak supervision to generate training datasets from social media data: a proof of concept to identify drug mentions.

作者信息

Tekumalla Ramya, Banda Juan M

机构信息

Department of Computer Science, Georgia State University, Atlanta, GA USA.

出版信息

Neural Comput Appl. 2021 Oct 29:1-9. doi: 10.1007/s00521-021-06614-2.

DOI:10.1007/s00521-021-06614-2

PMID:34728902

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8554513/

Abstract

Twitter has been a remarkable resource for research in pharmacovigilance in the last decade. Traditionally, rule- or lexicon-based methods have been utilized for automatically extracting drug tweets for human annotation. The process of human annotation to create labeled sets for machine learning models is laborious, time consuming and not scalable. In this work, we demonstrate the feasibility of applying weak supervision (noisy labeling) to select drug data, and build machine learning models using large amounts of noisy labeled data instead of limited gold standard labelled sets. Our results demonstrate the models built with large amounts of noisy data achieve similar performance than models trained on limited gold standard datasets, hence demonstrating that weak supervision helps reduce the need to rely on manual annotation, allowing more data to be easily labeled and useful for downstream machine learning applications, in this case drug mention identification.

摘要

在过去十年中，推特一直是药物警戒研究的重要资源。传统上，基于规则或词典的方法被用于自动提取药物推文以进行人工标注。为机器学习模型创建标注集的人工标注过程既费力又耗时，而且不可扩展。在这项工作中，我们证明了应用弱监督（噪声标注）来选择药物数据，并使用大量噪声标注数据而非有限的金标准标注集构建机器学习模型的可行性。我们的结果表明，用大量噪声数据构建的模型与在有限金标准数据集上训练的模型具有相似的性能，因此表明弱监督有助于减少对人工标注的依赖，使更多数据能够轻松标注并用于下游机器学习应用，在这种情况下即药物提及识别。