Panyam Nagesh C, Verspoor Karin, Cohn Trevor, Ramamohanarao Kotagiri
School of Computing and Information Systems, University of Melbourne, Melbourne, Australia.
J Biomed Semantics. 2018 Jan 30;9(1):7. doi: 10.1186/s13326-017-0168-3.
BACKGROUND: Relation extraction from biomedical publications is an important task in the area of semantic mining of text. Kernel methods for supervised relation extraction are often preferred over manual feature engineering methods, when classifying highly ordered structures such as trees and graphs obtained from syntactic parsing of a sentence. Tree kernels such as the Subset Tree Kernel and Partial Tree Kernel have been shown to be effective for classifying constituency parse trees and basic dependency parse graphs of a sentence. Graph kernels such as the All Path Graph kernel (APG) and Approximate Subgraph Matching (ASM) kernel have been shown to be suitable for classifying general graphs with cycles, such as the enhanced dependency parse graph of a sentence. In this work, we present a high performance Chemical-Induced Disease (CID) relation extraction system. We present a comparative study of kernel methods for the CID task and also extend our study to the Protein-Protein Interaction (PPI) extraction task, an important biomedical relation extraction task. We discuss novel modifications to the ASM kernel to boost its performance and a method to apply graph kernels for extracting relations expressed in multiple sentences. RESULTS: Our system for CID relation extraction attains an F-score of 60%, without using external knowledge sources or task specific heuristic or rules. In comparison, the state of the art Chemical-Disease Relation Extraction system achieves an F-score of 56% using an ensemble of multiple machine learning methods, which is then boosted to 61% with a rule based system employing task specific post processing rules. For the CID task, graph kernels outperform tree kernels substantially, and the best performance is obtained with APG kernel that attains an F-score of 60%, followed by the ASM kernel at 57%. The performance difference between the ASM and APG kernels for CID sentence level relation extraction is not significant. In our evaluation of ASM for the PPI task, ASM performed better than APG kernel for the BioInfer dataset, in the Area Under Curve (AUC) measure (74% vs 69%). However, for all the other PPI datasets, namely AIMed, HPRD50, IEPA and LLL, ASM is substantially outperformed by the APG kernel in F-score and AUC measures. CONCLUSIONS: We demonstrate a high performance Chemical Induced Disease relation extraction, without employing external knowledge sources or task specific heuristics. Our work shows that graph kernels are effective in extracting relations that are expressed in multiple sentences. We also show that the graph kernels, namely the ASM and APG kernels, substantially outperform the tree kernels. Among the graph kernels, we showed the ASM kernel as effective for biomedical relation extraction, with comparable performance to the APG kernel for datasets such as the CID-sentence level relation extraction and BioInfer in PPI. Overall, the APG kernel is shown to be significantly more accurate than the ASM kernel, achieving better performance on most datasets.
背景:从生物医学出版物中提取关系是文本语义挖掘领域的一项重要任务。在对从句子句法分析中获得的诸如树和图等高阶结构进行分类时,用于监督关系提取的核方法通常比手动特征工程方法更受青睐。诸如子集树核和部分树核等树核已被证明对句子的成分分析树和基本依存关系分析图的分类有效。诸如全路径图核(APG)和近似子图匹配(ASM)核等图核已被证明适用于对具有循环的一般图进行分类,例如句子的增强依存关系分析图。在这项工作中,我们提出了一个高性能的化学诱导疾病(CID)关系提取系统。我们对CID任务的核方法进行了比较研究,并将我们的研究扩展到蛋白质 - 蛋白质相互作用(PPI)提取任务,这是一项重要的生物医学关系提取任务。我们讨论了对ASM核的新颖修改以提高其性能,以及一种应用图核来提取多句中表达的关系的方法。 结果:我们的CID关系提取系统在不使用外部知识源或特定任务启发式方法或规则的情况下,F值达到了60%。相比之下,当前最先进的化学 - 疾病关系提取系统使用多种机器学习方法的集成获得了56%的F值,然后通过采用特定任务后处理规则的基于规则的系统将其提高到61%。对于CID任务,图核在很大程度上优于树核,使用APG核获得了最佳性能,F值达到60%,其次是ASM核,为57%。ASM和APG核在CID句子级关系提取中的性能差异不显著。在我们对PPI任务的ASM评估中,在曲线下面积(AUC)度量方面,ASM在BioInfer数据集上的表现优于APG核(74%对69%)。然而,对于所有其他PPI数据集,即AIMed、HPRD50、IEPA和LLL,在F值和AUC度量方面,ASM明显不如APG核。 结论:我们展示了一种高性能的化学诱导疾病关系提取方法,无需使用外部知识源或特定任务启发式方法。我们的工作表明图核在提取多句中表达的关系方面是有效的。我们还表明,图核,即ASM和APG核,在很大程度上优于树核。在图核中,我们表明ASM核对于生物医学关系提取是有效的,在诸如CID句子级关系提取和PPI中的BioInfer等数据集上与APG核具有可比的性能。总体而言,APG核在大多数数据集上表现出比ASM核显著更准确的性能。
J Biomed Semantics. 2018-1-30
IEEE/ACM Trans Comput Biol Bioinform. 2012
J Biomed Semantics. 2017-9-20
PLoS Comput Biol. 2010-7-1
BMC Bioinformatics. 2008-11-19
J Biomed Semantics. 2022-6-3
BMC Res Notes. 2022-2-14
BMC Bioinformatics. 2022-1-6
Philos Trans A Math Phys Eng Sci. 2022-1-10
Front Cell Dev Biol. 2020-8-28
Comput Struct Biotechnol J. 2020-6-2
BMC Bioinformatics. 2019-4-29
Database (Oxford). 2016-4-14
PLoS Comput Biol. 2010-7-1
Database (Oxford). 2009
J Biomed Inform. 2009-7-16
BMC Bioinformatics. 2008-4-11
Bioinformatics. 2007-7-1