Xing Guangming, Zhang Guo-Qiang, Cui Licong
Department of Computer Science, Western Kentucky University, Bowling Green, 42101 KY USA.
Institute of Biomedical Informatics, University of Kentucky, Lexington, 40536 KY USA.
BioData Min. 2016 Oct 10;9:31. doi: 10.1186/s13040-016-0110-8. eCollection 2016.
Redundant hierarchical relations refer to such patterns as two paths from one concept to another, one with length one (direct) and the other with length greater than one (indirect). Each redundant relation represents a possibly unintended defect that needs to be corrected in the ontology quality assurance process. Detecting and eliminating redundant relations would help improve the results of all methods relying on the relevant ontological systems as knowledge source, such as the computation of semantic distance between concepts and for ontology matching and alignment.
This paper introduces a novel and scalable approach, called FEDRR - Fast, Exhaustive Detection of Redundant Relations - for quality assurance work during ontological evolution. FEDRR combines the algorithm ideas of Dynamic Programming with Topological Sort, for exhaustive mining of all redundant hierarchical relations in ontological hierarchies, in (·||+||) time, where || is the number of concepts, || is the number of the relations, and is a constant in practice. Using FEDRR, we performed exhaustive search of all redundant is-a relations in two of the largest ontological systems in biomedicine: SNOMED CT and Gene Ontology (GO). 372 and 1609 redundant is-a relations were found in the 2015-09-01 version of SNOMED CT and 2015-05-01 version of GO, respectively. We have also performed FEDRR on over 190 source vocabularies in the UMLS - a large integrated repository of biomedical ontologies, and identified six sources containing redundant is-a relations. Randomly generated ontologies have also been used to further validate the efficiency of FEDRR.
FEDRR provides a generally applicable, effective tool for systematic detecting redundant relations in large ontological systems for quality improvement.
冗余层次关系指的是从一个概念到另一个概念存在两条路径的模式,一条路径长度为一(直接路径),另一条路径长度大于一(间接路径)。每个冗余关系都代表了一个可能意外出现的缺陷,需要在本体质量保证过程中加以纠正。检测并消除冗余关系将有助于提升所有依赖相关本体系统作为知识源的方法的结果,比如概念间语义距离的计算以及本体匹配与对齐。
本文介绍了一种新颖且可扩展的方法,名为FEDRR(快速、详尽的冗余关系检测),用于本体演化过程中的质量保证工作。FEDRR将动态规划的算法思想与拓扑排序相结合,以详尽挖掘本体层次结构中所有冗余层次关系,时间复杂度为(·||+||) ,其中||为概念数量,||为关系数量, 在实际应用中是一个常数。使用FEDRR,我们对生物医学领域两个最大的本体系统:SNOMED CT和基因本体(GO)中的所有冗余“是一种”关系进行了详尽搜索。在SNOMED CT的2015 - 09 - 01版本和GO的2015 - 05 - 01版本中,分别发现了372个和1609个冗余“是一种”关系。我们还对统一医学语言系统(UMLS,一个大型生物医学本体集成库)中的190多个源词汇表执行了FEDRR,并识别出六个包含冗余“是一种”关系的源。随机生成的本体也被用于进一步验证FEDRR的效率。
FEDRR为系统检测大型本体系统中的冗余关系以提高质量提供了一个普遍适用且有效的工具。