Hao Xubing, Abeysinghe Rashmie, Zheng Fengbo, Cui Licong
School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA.
Department of Neurology, The University of Texas Health Science Center at Houston, Houston, Texas, USA.
Proceedings (IEEE Int Conf Bioinformatics Biomed). 2021 Dec;2021:1805-1812. doi: 10.1109/bibm52615.2021.9669407.
Missing hierarchical relations and missing concepts are common quality issues in biomedical ontologies. Non-lattice subgraphs have been extensively studied for automatically identifying missing relations in biomedical ontologies like SNOMED CT. However, little is known about non-lattice subgraphs' capability to uncover new or missing concepts in biomedical ontologies. In this work, we investigate a lexical-based intersection approach based on non-lattice subgraphs to identify potential missing concepts in SNOMED CT. We first construct lexical features of concepts using their fully specified names. Then we generate hierarchically unrelated concept pairs in non-lattice subgraphs as the candidates to derive new concepts. For each candidate pair of concepts, we conduct an order-preserving intersection based on the two concepts' lexical features, with the intersection result serving as the potential new concept name suggested. We further perform automatic validation through terminologies in the Unified Medical Language System (UMLS) and literature in PubMed. Applying this approach to the March 2021 release of SNOMED CT US Edition, we obtained 7,702 potential missing concepts, among which 1,288 were validated through UMLS and 1,309 were validated through PubMed. The results showed that non-lattice subgraphs have the potential to facilitate suggestion of new concepts for SNOMED CT.
层次关系缺失和概念缺失是生物医学本体中常见的质量问题。非格状子图已被广泛研究,用于自动识别生物医学本体(如SNOMED CT)中缺失的关系。然而,关于非格状子图在揭示生物医学本体中的新概念或缺失概念方面的能力,人们了解甚少。在这项工作中,我们研究了一种基于非格状子图的基于词汇的交集方法,以识别SNOMED CT中潜在的缺失概念。我们首先使用概念的完全指定名称构建概念的词汇特征。然后,我们在非格状子图中生成层次无关的概念对,作为推导新概念的候选对。对于每一对候选概念,我们基于这两个概念的词汇特征进行保序交集,交集结果作为建议的潜在新概念名称。我们进一步通过统一医学语言系统(UMLS)中的术语和PubMed中的文献进行自动验证。将这种方法应用于2021年3月发布的SNOMED CT美国版,我们获得了7702个潜在的缺失概念,其中1288个通过UMLS得到验证,1309个通过PubMed得到验证。结果表明,非格状子图有潜力为SNOMED CT促进新概念的建议。