Chemoinformatics and Molecular Modeling Laboratory, The Alexander Butlerov Institute of Chemistry, Kazan Federal University, Kazan 420008, Russian Federation.
Samsung-PDMI AI Center, Steklov Institute of Mathematics at St. Petersburg, St. Petersburg 191023, Russian Federation.
Bioinformatics. 2021 Apr 19;37(2):243-249. doi: 10.1093/bioinformatics/btaa675.
Drugs and diseases play a central role in many areas of biomedical research and healthcare. Aggregating knowledge about these entities across a broader range of domains and languages is critical for information extraction (IE) applications. To facilitate text mining methods for analysis and comparison of patient's health conditions and adverse drug reactions reported on the Internet with traditional sources such as drug labels, we present a new corpus of Russian language health reviews.
The Russian Drug Reaction Corpus (RuDReC) is a new partially annotated corpus of consumer reviews in Russian about pharmaceutical products for the detection of health-related named entities and the effectiveness of pharmaceutical products. The corpus itself consists of two parts, the raw one and the labeled one. The raw part includes 1.4 million health-related user-generated texts collected from various Internet sources, including social media. The labeled part contains 500 consumer reviews about drug therapy with drug- and disease-related information. Labels for sentences include health-related issues or their absence. The sentences with one are additionally labeled at the expression level for identification of fine-grained subtypes such as drug classes and drug forms, drug indications and drug reactions. Further, we present a baseline model for named entity recognition (NER) and multilabel sentence classification tasks on this corpus. The macro F1 score of 74.85% in the NER task was achieved by our RuDR-BERT model. For the sentence classification task, our model achieves the macro F1 score of 68.82% gaining 7.47% over the score of BERT model trained on Russian data.
We make the RuDReC corpus and pretrained weights of domain-specific BERT models freely available at https://github.com/cimm-kzn/RuDReC.
Supplementary data are available at Bioinformatics online.
药物和疾病在许多生物医学研究和医疗保健领域中起着核心作用。在更广泛的领域和语言中汇总有关这些实体的知识对于信息提取(IE)应用程序至关重要。为了促进文本挖掘方法,以便在互联网上与传统来源(如药物标签)分析和比较患者的健康状况和药物不良反应,我们提出了一个新的俄语健康评论语料库。
俄语药物反应语料库(RuDReC)是一个新的部分注释的俄语消费者评论语料库,用于检测与健康相关的命名实体和药物产品的有效性。语料库本身由两部分组成,原始部分和标记部分。原始部分包括从各种互联网来源(包括社交媒体)收集的 140 万与健康相关的用户生成文本。标记部分包含 500 条关于药物治疗的消费者评论,其中包含药物和疾病相关信息。句子的标签包括健康相关问题或不存在。对于包含一个的句子,还会在表达级别上进行标记,以识别细粒度的亚类型,例如药物类别和药物形式、药物适应症和药物反应。此外,我们在该语料库上展示了命名实体识别(NER)和多标签句子分类任务的基线模型。我们的 RuDR-BERT 模型在 NER 任务中实现了 74.85%的宏 F1 得分。对于句子分类任务,我们的模型获得了 68.82%的宏 F1 得分,比在俄语数据上训练的 BERT 模型的得分高出 7.47%。
我们在 https://github.com/cimm-kzn/RuDReC 上免费提供 RuDReC 语料库和特定领域 BERT 模型的预训练权重。
补充数据可在生物信息学在线获得。