Abeysinghe Rashmie, Brooks Michael A, Cui Licong
School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX.
AMIA Annu Symp Proc. 2020 Mar 4;2019:982-991. eCollection 2019.
Auditing National Cancer Institute (NCI) thesaurus is essential to ensure that it provides accurate terminology for cancer-related clinical care as well as translational and basic research. We leverage a structural-lexical approach to identify missing hierarchical IS-A relations in NCI thesaurus based on non-lattice subgraphs and derived lexical attributes of concepts. For each concept in a non-lattice subgraph, we use two ways to derive the concept's lexical attributes: (1) inheriting lexical attributes from its ancestors within the subgraph; and (2) inheriting lexical attributes from all its ancestors. For a pair of concepts not having a hierarchical relation, if the lexical attributes of one concept is a subset of that of the other, we suggest there is a potential missing IS-A relation between the two concepts. Our approach identified 547 non-lattice subgraphs in the 19.01d release of NCI thesaurus which revealed a total of 1,022 unique potential missing IS-A relations. A random sample of 100 relations was evaluated by a domain expert. Among these relations, 90 can be obtained by the way of inheriting lexical attributes from ancestors within non-lattice subgraph, among which 76 were confirmed as valid (a precision of 84.44%); and 82 can be obtained by the way of inheriting all ancestors, among which 73 were confirmed as valid (a precision of 89.02%). The results show that our structural-lexical approach based on non-lattice subgraphs is effective for auditing NCI thesaurus.
审核美国国立癌症研究所(NCI)叙词表对于确保其为癌症相关的临床护理以及转化研究和基础研究提供准确的术语至关重要。我们利用一种结构-词汇方法,基于非格状子图和概念的派生词汇属性来识别NCI叙词表中缺失的层次化“是一种”关系。对于非格状子图中的每个概念,我们使用两种方法来派生该概念的词汇属性:(1)从子图内的祖先继承词汇属性;(2)从其所有祖先继承词汇属性。对于一对没有层次关系的概念,如果一个概念的词汇属性是另一个概念的词汇属性的子集,我们认为这两个概念之间可能存在缺失的“是一种”关系。我们的方法在NCI叙词表的19.01d版本中识别出547个非格状子图,共揭示了1022个独特的潜在缺失“是一种”关系。一位领域专家对100个关系的随机样本进行了评估。在这些关系中,90个可以通过从非格状子图内的祖先继承词汇属性的方式获得,其中76个被确认为有效(精确率为84.44%);82个可以通过继承所有祖先的方式获得,其中73个被确认为有效(精确率为89.02%)。结果表明,我们基于非格状子图的结构-词汇方法对于审核NCI叙词表是有效的。