DRDO-BU Center for Life Sciences, Bharathiar University Campus, Coimbatore, Tamilnadu, India.
Data mining and Text mining Laboratory, Department of Bioinformatics, Bharathiar University, Coimbatore, Tamilnadu, India.
PLoS One. 2018 Jul 26;13(7):e0200699. doi: 10.1371/journal.pone.0200699. eCollection 2018.
A wealth of knowledge concerning relations between genes and its associated diseases is present in biomedical literature. Mining these biological associations from literature can provide immense support to research ranging from drug-targetable pathways to biomarker discovery. However, time and cost of manual curation heavily slows it down. In this current scenario one of the crucial technologies is biomedical text mining, and relation extraction shows the promising result to explore the research of genes associated with diseases. By developing automatic extraction of gene-disease associations from the literature using joint ensemble learning we addressed this problem from a text mining perspective. In the proposed work, we employ a supervised machine learning approach in which a rich feature set covering conceptual, syntax and semantic properties jointly learned with word embedding are trained using ensemble support vector machine for extracting gene-disease relations from four gold standard corpora. Upon evaluating the machine learning approach shows promised results of 85.34%, 83.93%,87.39% and 85.57% of F-measure on EUADR, GAD, CoMAGC and PolySearch corpora respectively. We strongly believe that the presented novel approach combining rich syntax and semantic feature set with domain-specific word embedding through ensemble support vector machines evaluated on four gold standard corpora can act as a new baseline for future works in gene-disease relation extraction from literature.
生物医学文献中蕴含着丰富的基因与其相关疾病之间关系的知识。从文献中挖掘这些生物学关联,可以为从药物靶点途径到生物标志物发现的研究提供巨大支持。然而,手动编纂的时间和成本严重减缓了这一进程。在当前的情况下,生物医学文本挖掘是一项关键技术,关系提取显示出了有前途的结果,可以探索与疾病相关的基因的研究。通过使用联合集成学习从文献中自动提取基因-疾病关联,我们从文本挖掘的角度解决了这个问题。在提出的工作中,我们采用了一种有监督的机器学习方法,其中一个包含概念、语法和语义属性的丰富特征集与词嵌入一起学习,并使用集成支持向量机对来自四个黄金标准语料库的基因-疾病关系进行训练。在评估机器学习方法时,在 EUADR、GAD、CoMAGC 和 PolySearch 语料库上的 F 测度分别达到了 85.34%、83.93%、87.39%和 85.57%的有希望的结果。我们坚信,通过联合支持向量机在四个黄金标准语料库上评估,结合丰富的语法和语义特征集与特定领域的词嵌入的新颖方法,可以作为未来从文献中提取基因-疾病关系的新基准。