relSCAN-从生物医学文献中提取化学诱导疾病关系的系统。

relSCAN - A system for extracting chemical-induced disease relation from biomedical literature.

机构信息

Department of Applied Mathematics and Computer Science, Faculty of Arts & Sciences, Eastern Mediterranean University, Famagusta, North Cyprus via Mersin 10, Turkey.

Department of Mathematics, Faculty of Arts & Sciences, Eastern Mediterranean University, Famagusta, North Cyprus via Mersin 10, Turkey.

出版信息

J Biomed Inform. 2018 Nov;87:79-87. doi: 10.1016/j.jbi.2018.09.018. Epub 2018 Oct 6.

DOI:10.1016/j.jbi.2018.09.018

PMID:30296491

Abstract

This paper proposes an effective and robust approach for Chemical-Induced Disease (CID) relation extraction from PubMed articles. The study was performed on the Chemical Disease Relation (CDR) task of BioCreative V track-3 corpus. The proposed system, named relSCAN, is an efficient CID relation extraction system with two phases to classify relation instances from the Co-occurrence and Non-Co-occurrence mention levels. We describe the case of chemical and disease mentions that occur in the same sentence as 'Co-occurrence', or as 'Non-Co-occurrence' otherwise. In the first phase, the relation instances are constructed on both mention levels. In the second phase, we employ a hybrid feature set to classify the relation instances at both of these mention levels using the combination of two Machine Learning (ML) classifiers (Support Vector Machine (SVM) and J48 Decision tree). This system is entirely corpus dependent and does not rely on information from external resources in order to boost its performance. We achieved good results, which are comparable with the other state-of-the-art CID relation extraction systems on the BioCreative V corpus. Furthermore, our system achieves the best performance on the Non-Co-occurrence mention level.

摘要

本文提出了一种从 PubMed 文章中提取化学诱导疾病（Chemical-Induced Disease，CID）关系的有效且稳健的方法。该研究基于 BioCreative V 挑战赛第 3 轮的 Chemical Disease Relation（CDR）任务。所提出的系统名为 relSCAN，是一种高效的 CID 关系提取系统，具有两个阶段，可以从共现和非共现提及级别对关系实例进行分类。我们描述了在同一句话中出现的化学物质和疾病提及的情况，称为“共现”，否则称为“非共现”。在第一阶段，在这两个提及级别上构建关系实例。在第二阶段，我们使用混合特征集，通过组合两种机器学习（Machine Learning，ML）分类器（支持向量机（Support Vector Machine，SVM）和 J48 决策树），在这两个提及级别上对关系实例进行分类。该系统完全依赖于语料库，不依赖于外部资源的信息来提高其性能。我们在 BioCreative V 语料库上取得了良好的结果，与其他最先进的 CID 关系提取系统相当。此外，我们的系统在非共现提及级别上取得了最佳性能。