Sarker Abeed, Gonzalez Graciela
Department of Biomedical Informatics, Arizona State University, 13212 East Shea Blvd., Scottsdale, AZ 85259, USA.
J Biomed Inform. 2015 Feb;53:196-207. doi: 10.1016/j.jbi.2014.11.002. Epub 2014 Nov 8.
Automatic detection of adverse drug reaction (ADR) mentions from text has recently received significant interest in pharmacovigilance research. Current research focuses on various sources of text-based information, including social media-where enormous amounts of user posted data is available, which have the potential for use in pharmacovigilance if collected and filtered accurately. The aims of this study are: (i) to explore natural language processing (NLP) approaches for generating useful features from text, and utilizing them in optimized machine learning algorithms for automatic classification of ADR assertive text segments; (ii) to present two data sets that we prepared for the task of ADR detection from user posted internet data; and (iii) to investigate if combining training data from distinct corpora can improve automatic classification accuracies.
One of our three data sets contains annotated sentences from clinical reports, and the two other data sets, built in-house, consist of annotated posts from social media. Our text classification approach relies on generating a large set of features, representing semantic properties (e.g., sentiment, polarity, and topic), from short text nuggets. Importantly, using our expanded feature sets, we combine training data from different corpora in attempts to boost classification accuracies.
Our feature-rich classification approach performs significantly better than previously published approaches with ADR class F-scores of 0.812 (previously reported best: 0.770), 0.538 and 0.678 for the three data sets. Combining training data from multiple compatible corpora further improves the ADR F-scores for the in-house data sets to 0.597 (improvement of 5.9 units) and 0.704 (improvement of 2.6 units) respectively.
Our research results indicate that using advanced NLP techniques for generating information rich features from text can significantly improve classification accuracies over existing benchmarks. Our experiments illustrate the benefits of incorporating various semantic features such as topics, concepts, sentiments, and polarities. Finally, we show that integration of information from compatible corpora can significantly improve classification performance. This form of multi-corpus training may be particularly useful in cases where data sets are heavily imbalanced (e.g., social media data), and may reduce the time and costs associated with the annotation of data in the future.
从文本中自动检测药物不良反应(ADR)提及内容最近在药物警戒研究中受到了极大关注。当前研究聚焦于各种基于文本的信息来源,包括社交媒体——在社交媒体上有大量用户发布的数据,如果能准确收集和筛选,这些数据有可能用于药物警戒。本研究的目的是:(i)探索自然语言处理(NLP)方法,以便从文本中生成有用特征,并将其用于优化的机器学习算法中,对ADR断言文本片段进行自动分类;(ii)展示我们为从用户发布的互联网数据中检测ADR任务而准备的两个数据集;(iii)研究合并来自不同语料库的训练数据是否能提高自动分类准确率。
我们的三个数据集中,有一个包含来自临床报告的带注释句子,另外两个内部构建的数据集由社交媒体上的带注释帖子组成。我们的文本分类方法依赖于从短文本片段中生成大量代表语义属性(如情感、极性和主题)的特征。重要的是,使用我们扩展的特征集,我们合并来自不同语料库的训练数据,试图提高分类准确率。
我们基于丰富特征的分类方法表现明显优于先前发表的方法,三个数据集的ADR类别F值分别为0.812(先前报告的最佳值为0.770)、0.538和0.678。将来自多个兼容语料库的训练数据合并,进一步将内部数据集的ADR F值分别提高到0.597(提高了5.9个单位)和0.704(提高了2.6个单位)。
我们的研究结果表明,使用先进的NLP技术从文本中生成信息丰富的特征,相对于现有基准能显著提高分类准确率。我们的实验说明了纳入各种语义特征(如主题、概念、情感和极性)的好处。最后,我们表明来自兼容语料库的信息整合能显著提高分类性能。这种多语料库训练形式在数据集严重不平衡的情况下(如社交媒体数据)可能特别有用,并且可能会减少未来与数据注释相关的时间和成本。