Özgür Arzucan, Hur Junguk, He Yongqun
Department of Computer Engineering, Bogazici University, 34342 Istanbul, Turkey.
Department of Biomedical Sciences, University of North Dakota School of Medicine and Health Sciences, Grand Forks, ND 58202 USA.
BioData Min. 2016 Dec 19;9:41. doi: 10.1186/s13040-016-0118-0. eCollection 2016.
The Interaction Network Ontology (INO) logically represents biological interactions, pathways, and networks. INO has been demonstrated to be valuable in providing a set of structured ontological terms and associated keywords to support literature mining of gene-gene interactions from biomedical literature. However, previous work using INO focused on single keyword matching, while many interactions are represented with two or more interaction keywords used in combination.
This paper reports our extension of INO to include combinatory patterns of two or more literature mining keywords co-existing in one sentence to represent specific INO interaction classes. Such keyword combinations and related INO interaction type information could be automatically obtained via SPARQL queries, formatted in Excel format, and used in an INO-supported SciMiner, an in-house literature mining program. We studied the gene interaction sentences from the commonly used benchmark Learning Logic in Language (LLL) dataset and one internally generated vaccine-related dataset to identify and analyze interaction types containing multiple keywords. Patterns obtained from the dependency parse trees of the sentences were used to identify the interaction keywords that are related to each other and collectively represent an interaction type.
The INO ontology currently has 575 terms including 202 terms under the interaction branch. The relations between the INO interaction types and associated keywords are represented using the INO annotation relations: 'has literature mining keywords' and 'has keyword dependency pattern'. The keyword dependency patterns were generated via running the Stanford Parser to obtain dependency relation types. Out of the 107 interactions in the LLL dataset represented with two-keyword interaction types, 86 were identified by using the direct dependency relations. The LLL dataset contained 34 gene regulation interaction types, each of which associated with multiple keywords. A hierarchical display of these 34 interaction types and their ancestor terms in INO resulted in the identification of specific gene-gene interaction patterns from the LLL dataset. The phenomenon of having multi-keyword interaction types was also frequently observed in the vaccine dataset.
By modeling and representing multiple textual keywords for interaction types, the extended INO enabled the identification of complex biological gene-gene interactions represented with multiple keywords.
相互作用网络本体(INO)从逻辑上表示生物相互作用、途径和网络。INO已被证明在提供一组结构化本体术语和相关关键词以支持从生物医学文献中挖掘基因-基因相互作用方面具有价值。然而,之前使用INO的工作侧重于单个关键词匹配,而许多相互作用是由两个或更多组合使用的相互作用关键词来表示的。
本文报告了我们对INO的扩展,以纳入在同一句子中同时出现的两个或更多文献挖掘关键词的组合模式,以表示特定的INO相互作用类别。此类关键词组合和相关的INO相互作用类型信息可通过SPARQL查询自动获取,以Excel格式格式化,并用于INO支持的SciMiner(一个内部文献挖掘程序)。我们研究了常用基准语言学习逻辑(LLL)数据集中的基因相互作用句子以及一个内部生成的疫苗相关数据集,以识别和分析包含多个关键词的相互作用类型。从句子的依存句法分析树中获得的模式用于识别相互关联并共同表示一种相互作用类型的相互作用关键词。
INO本体目前有575个术语,其中相互作用分支下有202个术语。INO相互作用类型与相关关键词之间的关系使用INO注释关系“具有文献挖掘关键词”和“具有关键词依存模式”来表示。关键词依存模式通过运行斯坦福解析器以获得依存关系类型来生成。在LLL数据集中以双关键词相互作用类型表示的107种相互作用中,有86种通过使用直接依存关系得以识别。LLL数据集包含34种基因调控相互作用类型,每种类型都与多个关键词相关。对INO中这34种相互作用类型及其祖先术语的层次显示导致从LLL数据集中识别出特定的基因-基因相互作用模式。在疫苗数据集中也经常观察到具有多关键词相互作用类型的现象。
通过对相互作用类型的多个文本关键词进行建模和表示,扩展后的INO能够识别由多个关键词表示的复杂生物基因-基因相互作用。