通过多语料库训练实现用于药物不良反应检测的便携式自动文本分类

Portable automatic text classification for adverse drug reaction detection via multi-corpus training.

作者信息

Sarker Abeed, Gonzalez Graciela

机构信息

Department of Biomedical Informatics, Arizona State University, 13212 East Shea Blvd., Scottsdale, AZ 85259, USA.

出版信息

J Biomed Inform. 2015 Feb;53:196-207. doi: 10.1016/j.jbi.2014.11.002. Epub 2014 Nov 8.

DOI:10.1016/j.jbi.2014.11.002

PMID:25451103

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4355323/

Abstract

OBJECTIVE

Automatic detection of adverse drug reaction (ADR) mentions from text has recently received significant interest in pharmacovigilance research. Current research focuses on various sources of text-based information, including social media-where enormous amounts of user posted data is available, which have the potential for use in pharmacovigilance if collected and filtered accurately. The aims of this study are: (i) to explore natural language processing (NLP) approaches for generating useful features from text, and utilizing them in optimized machine learning algorithms for automatic classification of ADR assertive text segments; (ii) to present two data sets that we prepared for the task of ADR detection from user posted internet data; and (iii) to investigate if combining training data from distinct corpora can improve automatic classification accuracies.

METHODS

One of our three data sets contains annotated sentences from clinical reports, and the two other data sets, built in-house, consist of annotated posts from social media. Our text classification approach relies on generating a large set of features, representing semantic properties (e.g., sentiment, polarity, and topic), from short text nuggets. Importantly, using our expanded feature sets, we combine training data from different corpora in attempts to boost classification accuracies.

RESULTS

Our feature-rich classification approach performs significantly better than previously published approaches with ADR class F-scores of 0.812 (previously reported best: 0.770), 0.538 and 0.678 for the three data sets. Combining training data from multiple compatible corpora further improves the ADR F-scores for the in-house data sets to 0.597 (improvement of 5.9 units) and 0.704 (improvement of 2.6 units) respectively.

CONCLUSIONS

Our research results indicate that using advanced NLP techniques for generating information rich features from text can significantly improve classification accuracies over existing benchmarks. Our experiments illustrate the benefits of incorporating various semantic features such as topics, concepts, sentiments, and polarities. Finally, we show that integration of information from compatible corpora can significantly improve classification performance. This form of multi-corpus training may be particularly useful in cases where data sets are heavily imbalanced (e.g., social media data), and may reduce the time and costs associated with the annotation of data in the future.

摘要

目的

从文本中自动检测药物不良反应（ADR）提及内容最近在药物警戒研究中受到了极大关注。当前研究聚焦于各种基于文本的信息来源，包括社交媒体——在社交媒体上有大量用户发布的数据，如果能准确收集和筛选，这些数据有可能用于药物警戒。本研究的目的是：（i）探索自然语言处理（NLP）方法，以便从文本中生成有用特征，并将其用于优化的机器学习算法中，对ADR断言文本片段进行自动分类；（ii）展示我们为从用户发布的互联网数据中检测ADR任务而准备的两个数据集；（iii）研究合并来自不同语料库的训练数据是否能提高自动分类准确率。

方法

我们的三个数据集中，有一个包含来自临床报告的带注释句子，另外两个内部构建的数据集由社交媒体上的带注释帖子组成。我们的文本分类方法依赖于从短文本片段中生成大量代表语义属性（如情感、极性和主题）的特征。重要的是，使用我们扩展的特征集，我们合并来自不同语料库的训练数据，试图提高分类准确率。

结果

我们基于丰富特征的分类方法表现明显优于先前发表的方法，三个数据集的ADR类别F值分别为0.812（先前报告的最佳值为0.770）、0.538和0.678。将来自多个兼容语料库的训练数据合并，进一步将内部数据集的ADR F值分别提高到0.597（提高了5.9个单位）和0.704（提高了2.6个单位）。

结论

我们的研究结果表明，使用先进的NLP技术从文本中生成信息丰富的特征，相对于现有基准能显著提高分类准确率。我们的实验说明了纳入各种语义特征（如主题、概念、情感和极性）的好处。最后，我们表明来自兼容语料库的信息整合能显著提高分类性能。这种多语料库训练形式在数据集严重不平衡的情况下（如社交媒体数据）可能特别有用，并且可能会减少未来与数据注释相关的时间和成本。

相似文献

Portable automatic text classification for adverse drug reaction detection via multi-corpus training.

J Biomed Inform. 2015 Feb;53:196-207. doi: 10.1016/j.jbi.2014.11.002. Epub 2014 Nov 8.

Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features.

J Am Med Inform Assoc. 2015 May;22(3):671-81. doi: 10.1093/jamia/ocu041. Epub 2015 Mar 9.

Filtering big data from social media--Building an early warning system for adverse drug reactions.

J Biomed Inform. 2015 Apr;54:230-40. doi: 10.1016/j.jbi.2015.01.011. Epub 2015 Feb 14.

Utilizing social media data for pharmacovigilance: A review.

J Biomed Inform. 2015 Apr;54:202-12. doi: 10.1016/j.jbi.2015.02.004. Epub 2015 Feb 23.

SOCIAL MEDIA MINING SHARED TASK WORKSHOP.

Pac Symp Biocomput. 2016;21:581-92.

Classifying adverse drug reactions from imbalanced twitter data.

Int J Med Inform. 2019 Sep;129:122-132. doi: 10.1016/j.ijmedinf.2019.05.017. Epub 2019 May 30.

From narrative descriptions to MedDRA: automagically encoding adverse drug reactions.

J Biomed Inform. 2018 Aug;84:184-199. doi: 10.1016/j.jbi.2018.07.001. Epub 2018 Jul 4.

On the creation of a clinical gold standard corpus in Spanish: Mining adverse drug reactions.

J Biomed Inform. 2015 Aug;56:318-32. doi: 10.1016/j.jbi.2015.06.016. Epub 2015 Jun 30.

Analysis of the effect of sentiment analysis on extracting adverse drug reactions from tweets and forum posts.

J Biomed Inform. 2016 Aug;62:148-58. doi: 10.1016/j.jbi.2016.06.007. Epub 2016 Jun 27.

Cadec: A corpus of adverse drug event annotations.

J Biomed Inform. 2015 Jun;55:73-81. doi: 10.1016/j.jbi.2015.03.010. Epub 2015 Mar 27.

引用本文的文献

Adverse drug reaction signal detection via the long short-term memory model.

Front Pharmacol. 2025 Jun 23;16:1554650. doi: 10.3389/fphar.2025.1554650. eCollection 2025.

Position-context additive transformer-based model for classifying text data on social media.

Sci Rep. 2025 Mar 8;15(1):8085. doi: 10.1038/s41598-025-90738-1.

Exploiting question-answer framework with multi-GRU to detect adverse drug reaction on social media.

Sci Rep. 2025 Feb 4;15(1):4157. doi: 10.1038/s41598-025-87724-y.

The Value of Social Media Analysis for Adverse Events Detection and Pharmacovigilance: Scoping Review.

JMIR Public Health Surveill. 2024 Sep 6;10:e59167. doi: 10.2196/59167.

Identification of patients' smoking status using an explainable AI approach: a Danish electronic health records case study.

BMC Med Res Methodol. 2024 May 17;24(1):114. doi: 10.1186/s12874-024-02231-4.

A taxonomy for advancing systematic error analysis in multi-site electronic health record-based clinical concept extraction.

J Am Med Inform Assoc. 2024 Jun 20;31(7):1493-1502. doi: 10.1093/jamia/ocae101.

#ChronicPain: Automated Building of a Chronic Pain Cohort from Twitter Using Machine Learning.

Health Data Sci. 2023;3. doi: 10.34133/hds.0078. Epub 2023 Jul 4.

A risk identification model for detection of patients at risk of antidepressant discontinuation.

Front Artif Intell. 2023 Aug 24;6:1229609. doi: 10.3389/frai.2023.1229609. eCollection 2023.

The Role of Social Media for Identifying Adverse Drug Events Data in Pharmacovigilance: Protocol for a Scoping Review.

JMIR Res Protoc. 2023 Aug 2;12:e47068. doi: 10.2196/47068.

Transferability Based on Drug Structure Similarity in the Automatic Classification of Noncompliant Drug Use on Social Media: Natural Language Processing Approach.

J Med Internet Res. 2023 May 3;25:e44870. doi: 10.2196/44870.

本文引用的文献

Phonetic spelling filter for keyword selection in drug mention mining from social media.

AMIA Jt Summits Transl Sci Proc. 2014 Apr 7;2014:90-5. eCollection 2014.

Digital drug safety surveillance: monitoring pharmaceutical products in twitter.

Drug Saf. 2014 May;37(5):343-50. doi: 10.1007/s40264-014-0155-x.

An ensemble heterogeneous classification methodology for discovering health-related knowledge in social media messages.

J Biomed Inform. 2014 Jun;49:255-68. doi: 10.1016/j.jbi.2014.03.005. Epub 2014 Mar 16.

A pipeline to extract drug-adverse event pairs from multiple data sources.

BMC Med Inform Decis Mak. 2014 Feb 24;14:13. doi: 10.1186/1472-6947-14-13.

Large-scale combining signals from both biomedical literature and the FDA Adverse Event Reporting System (FAERS) to improve post-marketing drug safety signal detection.

BMC Bioinformatics. 2014 Jan 15;15:17. doi: 10.1186/1471-2105-15-17.

Extending the NegEx lexicon for multiple languages.

Stud Health Technol Inform. 2013;192:677-81.

Extraction of potential adverse drug events from medical case reports.

J Biomed Semantics. 2012 Dec 20;3(1):15. doi: 10.1186/2041-1480-3-15.

Combing signals from spontaneous reports and electronic health records for detection of adverse drug reactions.

J Am Med Inform Assoc. 2013 May 1;20(3):413-9. doi: 10.1136/amiajnl-2012-000930. Epub 2012 Oct 31.

Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports.

J Biomed Inform. 2012 Oct;45(5):885-92. doi: 10.1016/j.jbi.2012.04.008. Epub 2012 Apr 25.

Novel data-mining methodologies for adverse drug event discovery and analysis.

Clin Pharmacol Ther. 2012 Jun;91(6):1010-21. doi: 10.1038/clpt.2012.50.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

通过多语料库训练实现用于药物不良反应检测的便携式自动文本分类

Portable automatic text classification for adverse drug reaction detection via multi-corpus training.

作者信息

机构信息

出版信息

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献