Department of Electrical Engineering, National Kaohsiung University of Science and Technology, Kaohsiung, Taiwan, Republic of China; Post Baccalaureate Medicine, Kaohsiung Medical University, Kaohsiung, Taiwan, Republic of China.
Big Data laboratories of Chunghwa Telecom Laboratories, Taoyuan, Taiwan, Republic of China.
Int J Med Inform. 2019 Sep;129:122-132. doi: 10.1016/j.ijmedinf.2019.05.017. Epub 2019 May 30.
Nowadays, social media are often being used by general public to create and share public messages related to their health. With the global increase in social media usage, there is a trend of posting information related to adverse drug reactions (ADR). Mining the social media data for this type of information will be helpful for pharmacological post-marketing surveillance and monitoring. Although the concept of using social media to facilitate pharmacovigilance is convincing, construction of automatic ADR detection systems remains a challenge because the corpora compiled from social media tend to be highly imbalanced, posing a major obstacle to the development of classifiers with reliable performance.
Several methods have been proposed to address the challenge of imbalanced corpora. However, we are not aware of any studies that investigated the effectiveness of the strategies of dealing with the problem of imbalanced data in the context of ADR detection from social media. In light of this, we evaluated a variety of imbalanced techniques and proposed a novel word embedding-based synthetic minority over-sampling technique (WESMOTE), which synthesizes new training examples from the sentence representation based on word embeddings. We compared the performance of all methods on two large imbalanced datasets released for the purpose of detecting ADR posts.
In comparison with the state-of-the-art approaches, the classifiers that incorporated imbalanced classification techniques achieved comparable or better F-scores. All of our best performing configurations combined random under-sampling with techniques including the proposed WESMOTE, boosting and ensemble, implying that an integration of these approaches with under-sampling provides a reliable solution for large imbalanced social media datasets. Furthermore, ensemble-based methods like vote-based under-sampling (VUE) and random under-sampling boosting can be alternatives for the hybrid synthetic methods because both methods increase the diversity of the created weak classifiers, leading to better recall and overall F-scores for the minority classes.
Data collected from the social media are usually very large and highly imbalanced. In order to maximize the performance of a classifier trained on such data, applications of imbalanced strategies are required. We considered several practical methods for handling imbalanced Twitter data along with their performance on the binary classification task with respect to ADRs. In conclusion, the following practical insights are gained: 1) When dealing with text classification, the proposed word embedding-based synthetic minority over-sampling technique is more effective than traditional synthetic-based over-sampling methods. 2) In cases where large amounts of training data are available, the imbalanced strategies combined with under-sampling techniques are preferred. 3) Finally, employment of advanced methods does not guarantee better performance than simpler ones such as VUE, which achieved high performance with advantages like faster building time and ease of development.
如今,社交媒体常被大众用于发布与健康相关的公共信息。随着社交媒体的全球普及,人们在社交媒体上发布不良反应(ADR)相关信息的趋势愈演愈烈。挖掘此类信息的社交媒体数据有助于进行药物上市后监测。虽然利用社交媒体促进药物警戒的概念令人信服,但构建自动 ADR 检测系统仍然具有挑战性,因为从社交媒体编译的语料库往往极不平衡,这对开发性能可靠的分类器构成了重大障碍。
已经提出了几种方法来解决语料库不平衡的问题。然而,我们尚未发现任何研究探讨在社交媒体 ADR 检测背景下处理不平衡数据问题的策略的有效性。鉴于此,我们评估了多种不平衡技术,并提出了一种新的基于词嵌入的合成少数类过采样技术(WESMOTE),该技术基于词嵌入从句子表示中合成新的训练示例。我们在两个为检测 ADR 帖子而发布的大型不平衡数据集上比较了所有方法的性能。
与最新方法相比,采用不平衡分类技术的分类器实现了可比或更好的 F 分数。我们所有表现最佳的配置均将随机欠采样与包括所提出的 WESMOTE、提升和集成在内的技术相结合,这表明将这些方法与欠采样相结合是解决大型不平衡社交媒体数据集的可靠方法。此外,基于投票的欠采样(VUE)和随机欠采样提升等基于集成的方法可以作为混合合成方法的替代方法,因为这两种方法都增加了创建的弱分类器的多样性,从而提高了少数类别的召回率和整体 F 分数。
从社交媒体收集的数据通常非常庞大且极不平衡。为了最大限度地提高基于此类数据训练的分类器的性能,需要应用不平衡策略。我们考虑了几种处理不平衡的 Twitter 数据的实用方法,以及它们在 ADR 二进制分类任务中的性能。总之,我们得出以下实用见解:1)在处理文本分类时,基于词嵌入的合成少数类过采样技术比传统的基于合成的过采样方法更有效。2)在有大量训练数据的情况下,优先采用与欠采样技术相结合的不平衡策略。3)最后,使用先进的方法并不一定能保证比简单的方法(如 VUE)更好的性能,VUE 具有构建时间更快和易于开发等优势,其性能也很高。