Peng Yifan, Wei Chih-Hsuan, Lu Zhiyong
National Center for Biotechnology Information, Bethesda, MD 20894 USA ; Computer and Information Sciences, University of Delaware, Newark, DE 19716 USA.
National Center for Biotechnology Information, Bethesda, MD 20894 USA.
J Cheminform. 2016 Oct 7;8:53. doi: 10.1186/s13321-016-0165-z. eCollection 2016.
Due to the importance of identifying relations between chemicals and diseases for new drug discovery and improving chemical safety, there has been a growing interest in developing automatic relation extraction systems for capturing these relations from the rich and rapid-growing biomedical literature. In this work we aim to build on current advances in named entity recognition and a recent BioCreative effort to further improve the state of the art in biomedical relation extraction, in particular for the chemical-induced disease (CID) relations.
We propose a rich-feature approach with Support Vector Machine to aid in the extraction of CIDs from PubMed articles. Our feature vector includes novel statistical features, linguistic knowledge, and domain resources. We also incorporate the output of a rule-based system as features, thus combining the advantages of rule- and machine learning-based systems. Furthermore, we augment our approach with automatically generated labeled text from an existing knowledge base to improve performance without additional cost for corpus construction. To evaluate our system, we perform experiments on the human-annotated BioCreative V benchmarking dataset and compare with previous results. When trained using only BioCreative V training and development sets, our system achieves an F-score of 57.51 %, which already compares favorably to previous methods. Our system performance was further improved to 61.01 % in F-score when augmented with additional automatically generated weakly labeled data.
Our text-mining approach demonstrates state-of-the-art performance in disease-chemical relation extraction. More importantly, this work exemplifies the use of (freely available) curated document-level annotations in existing biomedical databases, which are largely overlooked in text-mining system development.
由于识别化学物质与疾病之间的关系对于新药研发和提高化学物质安全性至关重要,因此开发自动关系提取系统以从丰富且快速增长的生物医学文献中捕捉这些关系的兴趣日益浓厚。在这项工作中,我们旨在基于命名实体识别的当前进展以及最近的一项生物创意工作,进一步提高生物医学关系提取的技术水平,特别是针对化学诱导疾病(CID)关系。
我们提出了一种基于支持向量机的丰富特征方法,以帮助从PubMed文章中提取CID。我们的特征向量包括新颖的统计特征、语言知识和领域资源。我们还将基于规则的系统的输出作为特征纳入,从而结合了基于规则和机器学习的系统的优点。此外,我们用从现有知识库自动生成的带标签文本增强我们的方法,以提高性能,而无需额外的语料库构建成本。为了评估我们的系统,我们在人工标注的生物创意V基准数据集上进行实验,并与以前的结果进行比较。当仅使用生物创意V训练集和开发集进行训练时,我们的系统获得了57.51%的F值,这已经优于以前的方法。当使用额外的自动生成的弱标签数据进行增强时,我们系统的F值性能进一步提高到61.01%。
我们的文本挖掘方法在疾病 - 化学关系提取方面展示了先进的性能。更重要的是,这项工作例证了在现有生物医学数据库中使用(免费可用的)经过策划的文档级注释,而这些注释在文本挖掘系统开发中很大程度上被忽视了。