Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, United States.
CIS Department, Borough of Manhattan Community College, CUNY, NY 10007, United States.
J Biomed Inform. 2018 Jul;83:135-149. doi: 10.1016/j.jbi.2018.05.015. Epub 2018 May 28.
In previous research, we have demonstrated for a number of ontologies that structurally complex concepts (for different definitions of "complex") in an ontology are more likely to exhibit errors than other concepts. Thus, such complex concepts often become fertile ground for quality assurance (QA) in ontologies. They should be audited first. One example of complex concepts is given by "overlapping concepts" (to be defined below.) Historically, a different auditing methodology had to be developed for every single ontology. For better scalability and efficiency, it is desirable to identify family-wide QA methodologies. Each such methodology would be applicable to a whole family of similar ontologies. In past research, we had divided the 685 ontologies of BioPortal into families of structurally similar ontologies. We showed for four ontologies of the same large family in BioPortal that "overlapping concepts" are indeed statistically significantly more likely to exhibit errors. In order to make an authoritative statement concerning the success of "overlapping concepts" as a methodology for a whole family of similar ontologies (or of large subhierarchies of ontologies), it is necessary to show that "overlapping concepts" have a higher likelihood of errors for six out of six ontologies of the family. In this paper, we are demonstrating for two more ontologies that "overlapping concepts" can successfully predict groups of concepts with a higher error rate than concepts from a control group. The fifth ontology is the Neoplasm subhierarchy of the National Cancer Institute thesaurus (NCIt). The sixth ontology is the Infectious Disease subhierarchy of SNOMED CT. We demonstrate quality assurance results for both of them. Furthermore, in this paper we observe two novel, important, and useful phenomena during quality assurance of "overlapping concepts." First, an erroneous "overlapping concept" can help with discovering other erroneous "non-overlapping concepts" in its vicinity. Secondly, correcting erroneous "overlapping concepts" may turn them into "non-overlapping concepts." We demonstrate that this may reduce the complexity of parts of the ontology, which in turn makes the ontology more comprehensible, simplifying maintenance and use of the ontology.
在之前的研究中,我们已经证明了对于许多本体,结构复杂的概念(根据“复杂”的不同定义)比其他概念更容易出现错误。因此,这种复杂的概念通常成为本体质量保证(QA)的肥沃土壤。它们应该首先被审核。一个复杂概念的例子是“重叠概念”(下面定义)。从历史上看,每个本体都必须开发不同的审核方法。为了更好的可扩展性和效率,希望能够确定全家族范围的 QA 方法。每个这样的方法都适用于整个相似本体家族。在过去的研究中,我们已经将 BioPortal 的 685 个本体划分为结构相似的本体家族。我们已经证明,在 BioPortal 的四个同一家族的本体中,“重叠概念”确实更有可能出现错误。为了对“重叠概念”作为整个相似本体家族(或本体的大子层次结构)的方法的成功做出权威声明,有必要表明“重叠概念”在家族的六个本体中有六个以上更有可能出现错误。在本文中,我们将再证明两个本体,“重叠概念”可以成功预测具有更高错误率的概念组,而不是来自对照组的概念。第五个本体是国家癌症研究所词汇表(NCIt)的肿瘤子层次结构。第六个本体是 SNOMED CT 的传染病子层次结构。我们对这两个本体进行了质量保证的结果展示。此外,在本文中,我们在“重叠概念”的质量保证过程中观察到了两个新的、重要的、有用的现象。首先,错误的“重叠概念”可以帮助发现其附近其他错误的“非重叠概念”。其次,纠正错误的“重叠概念”可以将其变成“非重叠概念”。我们证明,这可能会减少本体的某些部分的复杂性,从而使本体更易于理解,简化本体的维护和使用。