Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, USA.
DBEI, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
Database (Oxford). 2023 Feb 3;2023. doi: 10.1093/database/baac108.
This study presents the outcomes of the shared task competition BioCreative VII (Task 3) focusing on the extraction of medication names from a Twitter user's publicly available tweets (the user's 'timeline'). In general, detecting health-related tweets is notoriously challenging for natural language processing tools. The main challenge, aside from the informality of the language used, is that people tweet about any and all topics, and most of their tweets are not related to health. Thus, finding those tweets in a user's timeline that mention specific health-related concepts such as medications requires addressing extreme imbalance. Task 3 called for detecting tweets in a user's timeline that mentions a medication name and, for each detected mention, extracting its span. The organizers made available a corpus consisting of 182 049 tweets publicly posted by 212 Twitter users with all medication mentions manually annotated. The corpus exhibits the natural distribution of positive tweets, with only 442 tweets (0.2%) mentioning a medication. This task was an opportunity for participants to evaluate methods that are robust to class imbalance beyond the simple lexical match. A total of 65 teams registered, and 16 teams submitted a system run. This study summarizes the corpus created by the organizers and the approaches taken by the participating teams for this challenge. The corpus is freely available at https://biocreative.bioinformatics.udel.edu/tasks/biocreative-vii/track-3/. The methods and the results of the competing systems are analyzed with a focus on the approaches taken for learning from class-imbalanced data.
本研究介绍了 BioCreative VII (任务 3)的共享任务竞赛的结果,该竞赛的重点是从 Twitter 用户的公开可用推文(用户的“时间线”)中提取药物名称。一般来说,对于自然语言处理工具来说,检测与健康相关的推文是一项极具挑战性的任务。除了语言的非正式性之外,主要的挑战是人们会在 Twitter 上发布任何和所有主题的推文,而且他们的大多数推文都与健康无关。因此,在用户的时间线中找到那些提及特定健康相关概念(如药物)的推文需要解决极端不平衡的问题。任务 3要求检测用户时间线中提及药物名称的推文,并为每个检测到的提及提取其跨度。组织者提供了一个由 212 名 Twitter 用户公开发布的 182049 条推文组成的语料库,所有药物提及都经过人工注释。该语料库展示了阳性推文的自然分布情况,只有 442 条推文(0.2%)提及药物。这项任务为参与者提供了一个机会,使他们能够评估除了简单的词汇匹配之外对类不平衡具有鲁棒性的方法。共有 65 个团队注册,其中 16 个团队提交了系统运行结果。本研究总结了组织者创建的语料库和参与团队为应对这一挑战而采取的方法。该语料库可在 https://biocreative.bioinformatics.udel.edu/tasks/biocreative-vii/track-3/ 免费获得。分析了方法和竞争系统的结果,重点关注了从类不平衡数据中学习的方法。