用于药物警戒的深度学习:用于标记推特帖子中药物不良反应的循环神经网络架构
Deep learning for pharmacovigilance: recurrent neural network architectures for labeling adverse drug reactions in Twitter posts.
作者信息
Cocos Anne, Fiks Alexander G, Masino Aaron J
机构信息
Department of Biomedical and Health Informatics, The Children's Hospital of Philadelphia Philadelphia, PA, USA.
出版信息
J Am Med Inform Assoc. 2017 Jul 1;24(4):813-821. doi: 10.1093/jamia/ocw180.
OBJECTIVE
Social media is an important pharmacovigilance data source for adverse drug reaction (ADR) identification. Human review of social media data is infeasible due to data quantity, thus natural language processing techniques are necessary. Social media includes informal vocabulary and irregular grammar, which challenge natural language processing methods. Our objective is to develop a scalable, deep-learning approach that exceeds state-of-the-art ADR detection performance in social media.
MATERIALS AND METHODS
We developed a recurrent neural network (RNN) model that labels words in an input sequence with ADR membership tags. The only input features are word-embedding vectors, which can be formed through task-independent pretraining or during ADR detection training.
RESULTS
Our best-performing RNN model used pretrained word embeddings created from a large, non-domain-specific Twitter dataset. It achieved an approximate match F-measure of 0.755 for ADR identification on the dataset, compared to 0.631 for a baseline lexicon system and 0.65 for the state-of-the-art conditional random field model. Feature analysis indicated that semantic information in pretrained word embeddings boosted sensitivity and, combined with contextual awareness captured in the RNN, precision.
DISCUSSION
Our model required no task-specific feature engineering, suggesting generalizability to additional sequence-labeling tasks. Learning curve analysis showed that our model reached optimal performance with fewer training examples than the other models.
CONCLUSION
ADR detection performance in social media is significantly improved by using a contextually aware model and word embeddings formed from large, unlabeled datasets. The approach reduces manual data-labeling requirements and is scalable to large social media datasets.
目的
社交媒体是识别药物不良反应(ADR)的重要药物警戒数据源。由于数据量巨大,人工审核社交媒体数据并不可行,因此自然语言处理技术很有必要。社交媒体包含非正式词汇和不规则语法,这对自然语言处理方法构成了挑战。我们的目标是开发一种可扩展的深度学习方法,在社交媒体中超越当前最先进的ADR检测性能。
材料与方法
我们开发了一种循环神经网络(RNN)模型,该模型用ADR成员标签对输入序列中的单词进行标注。唯一的输入特征是词嵌入向量,其可以通过与任务无关的预训练或在ADR检测训练期间形成。
结果
我们表现最佳的RNN模型使用了从一个大型、非特定领域的推特数据集创建的预训练词嵌入。在该数据集上,其ADR识别的近似匹配F值达到了0.755,相比之下,基线词典系统为0.631,当前最先进的条件随机场模型为0.65。特征分析表明,预训练词嵌入中的语义信息提高了敏感性,并且与RNN中捕获的上下文感知相结合,提高了精确性。
讨论
我们的模型不需要特定于任务的特征工程,这表明它可推广到其他序列标注任务。学习曲线分析表明,与其他模型相比,我们的模型用更少的训练示例就达到了最佳性能。
结论
通过使用上下文感知模型和由大型未标注数据集形成的词嵌入,社交媒体中的ADR检测性能得到了显著提高。该方法减少了人工数据标注需求,并且可扩展到大型社交媒体数据集。
相似文献
Int J Med Inform. 2019-5-30
J Biomed Inform. 2021-11
Comput Math Methods Med. 2021
J Healthc Inform Res. 2018-4-12
BMC Bioinformatics. 2018-6-13
引用本文的文献
Fundam Res. 2023-5-11
Pharmaceuticals (Basel). 2024-6-22
Artif Intell Med. 2024-8
Cancers (Basel). 2024-4-4
Front Neurosci. 2023-11-9
本文引用的文献
Sci Data. 2016-5-24
Pac Symp Biocomput. 2016
AMIA Annu Symp Proc. 2014-11-14
J Pharmacol Pharmacother. 2013-12
Clin Pharmacol Ther. 2013-3-4