Kazan Federal University, 18 Kremlyovskaya Street, Kazan 420008, Russian Federation.
Kazan Federal University, 18 Kremlyovskaya Street, Kazan 420008, Russian Federation; St. Petersburg Department of the Steklov Mathematical Institute, 27 Fontanka, St. Petersburg 191023, Russian Federation; Insilico Medicine Hong Kong Ltd, Pak Shek Kok, New Territories, Hong Kong.
J Biomed Inform. 2020 Mar;103:103382. doi: 10.1016/j.jbi.2020.103382. Epub 2020 Feb 3.
Relation extraction aims to discover relational facts about entity mentions from plain texts. In this work, we focus on clinical relation extraction; namely, given a medical record with mentions of drugs and their attributes, we identify relations between these entities. We propose a machine learning model with a novel set of knowledge-based and BioSentVec embedding features. We systematically investigate the impact of these features with standard distance- and word-based features, conducting experiments on two benchmark datasets of clinical texts from MADE 2018 and n2c2 2018 shared tasks. For comparison with the feature-based model, we utilize state-of-the-art models and three BERT-based models, including BioBERT and Clinical BERT. Our results demonstrate that distance and word features provide significant benefits to the classifier. Knowledge-based features improve classification results only for particular types of relations. The sentence embedding feature provides the largest improvement in results, among other explored features on the MADE corpus. The classifier obtains state-of-the-art performance in clinical relation extraction with F-measure of 92.6%, improving F-measure by 3.5% on the MADE corpus.
关系抽取旨在从纯文本中发现关于实体提及的关系事实。在这项工作中,我们专注于临床关系抽取;即,给定一个包含药物及其属性提及的病历,我们确定这些实体之间的关系。我们提出了一个具有新颖的基于知识和 BioSentVec 嵌入特征的机器学习模型。我们系统地研究了这些特征与标准距离和基于单词的特征的影响,在 MADE 2018 和 n2c2 2018 共享任务的两个临床文本基准数据集上进行了实验。为了与基于特征的模型进行比较,我们利用了最先进的模型和三个基于 BERT 的模型,包括 BioBERT 和 Clinical BERT。我们的结果表明,距离和单词特征对分类器有显著的帮助。基于知识的特征仅对特定类型的关系提高分类结果。在 MADE 语料库上,句子嵌入特征在其他探索的特征中提供了最大的结果改进。该分类器在临床关系抽取中获得了最先进的性能,在 MADE 语料库上的 F1 得分为 92.6%,提高了 3.5%。