Cui Licong, Zhu Wei, Tao Shiqiang, Case James T, Bodenreider Olivier, Zhang Guo-Qiang
Department of Computer Science, University of Kentucky, Lexington, KY, USA.
Institute for Biomedical Informatics, University of Kentucky.
J Am Med Inform Assoc. 2017 Jul 1;24(4):788-798. doi: 10.1093/jamia/ocw175.
Quality assurance of large ontological systems such as SNOMED CT is an indispensable part of the terminology management lifecycle. We introduce a hybrid structural-lexical method for scalable and systematic discovery of missing hierarchical relations and concepts in SNOMED CT.
All non-lattice subgraphs (the structural part) in SNOMED CT are exhaustively extracted using a scalable MapReduce algorithm. Four lexical patterns (the lexical part) are identified among the extracted non-lattice subgraphs. Non-lattice subgraphs exhibiting such lexical patterns are often indicative of missing hierarchical relations or concepts. Each lexical pattern is associated with a potential specific type of error.
Applying the structural-lexical method to SNOMED CT (September 2015 US edition), we found 6801 non-lattice subgraphs that matched these lexical patterns, of which 2046 were amenable to visual inspection. We evaluated a random sample of 100 small subgraphs, of which 59 were reviewed in detail by domain experts. All the subgraphs reviewed contained errors confirmed by the experts. The most frequent type of error was missing is-a relations due to incomplete or inconsistent modeling of the concepts.
Our hybrid structural-lexical method is innovative and proved effective not only in detecting errors in SNOMED CT, but also in suggesting remediation for these errors.
诸如SNOMED CT这样的大型本体系统的质量保证是术语管理生命周期中不可或缺的一部分。我们引入一种混合结构-词汇方法,用于可扩展且系统地发现SNOMED CT中缺失的层次关系和概念。
使用可扩展的MapReduce算法详尽提取SNOMED CT中的所有非格点子图(结构部分)。在提取的非格点子图中识别出四种词汇模式(词汇部分)。呈现此类词汇模式的非格点子图通常表明存在缺失的层次关系或概念。每种词汇模式都与一种潜在的特定错误类型相关联。
将结构-词汇方法应用于SNOMED CT(2015年9月美国版),我们发现6801个与这些词汇模式匹配的非格点子图,其中2046个适合目视检查。我们评估了100个小子图的随机样本,其中59个由领域专家进行了详细审查。所有审查的子图都包含专家确认的错误。最常见的错误类型是由于概念建模不完整或不一致导致的缺失“是一个”关系。
我们的混合结构-词汇方法具有创新性,不仅在检测SNOMED CT中的错误方面有效,而且在为这些错误提出补救措施方面也有效。