Department of Health Services Policy and Management, Arnold School of Public Health, University of South Carolina, Columbia, South Carolina, USA.
School of Medicine, University of South Carolina, Columbia, South Carolina, USA.
Stud Health Technol Inform. 2022 Jun 6;290:552-556. doi: 10.3233/SHTI220138.
As Twitter emerged as an important data source for pharmacovigilance, heterogeneous data veracity becomes a major concern for extracted adverse drug reactions (ADRs). Our objective is to categorize different levels of data veracity and explore linguistic features of tweets and Twitter variables as they may be used for automatic screening high-veracity tweets that contain ADR-related information. We annotated a published Twitter corpus with linguistic features from existing studies and clinical experts. Multinomial logistic regression models found that first-person pronouns, expressing negative sentiment, ADR and drug name being in the same sentence were significantly associated with higher levels of data veracity (p<0.05), using medical terminology and fewer indications were associated with good data veracity (p<0.05), less drug numbers were marginally associated with good data veracity (p=0.053). These findings suggest opportunities for developing machine learning models for automatic screening of ADR-related tweets using key linguistic features, Twitter variables, and association rules.
随着 Twitter 成为药物警戒的重要数据来源,异质数据的真实性成为提取药物不良反应 (ADR) 的主要关注点。我们的目标是对不同的数据真实性水平进行分类,并探索推文和 Twitter 变量的语言特征,因为它们可能被用于自动筛选包含 ADR 相关信息的高真实性推文。我们使用来自现有研究和临床专家的语言特征对已发表的 Twitter 语料库进行了注释。多项逻辑回归模型发现,第一人称代词、表达负面情绪、ADR 和药物名称在同一句话中与更高的数据真实性显著相关(p<0.05),使用医学术语和较少的适应症与良好的数据真实性相关(p<0.05),较少的药物数量与良好的数据真实性略有相关(p=0.053)。这些发现为使用关键语言特征、Twitter 变量和关联规则开发自动筛选与 ADR 相关推文的机器学习模型提供了机会。