Gu H, Chen Y, He Z, Halper M, Chen L
Dr. Huanying (Helen) Gu, Computer Science Department, New York Institute of Technology, 1855 Broadway New York, NY 10023-7692, USA, E-mail:
Methods Inf Med. 2016;55(2):158-65. doi: 10.3414/ME14-01-0104. Epub 2015 Apr 30.
The Unified Medical Language System (UMLS) is one of the largest biomedical terminological systems, with over 2.5 million concepts in its Metathesaurus repository. The UMLS's Semantic Network (SN) with its collection of 133 high-level semantic types serves as an abstraction layer on top of the Metathesaurus. In particular, the SN elaborates an aspect of the Metathesaurus's concepts via the assignment of one or more types to each concept. Due to the scope and complexity of the Metathesaurus, errors are all but inevitable in this semantic-type assignment process.
To develop a semi-automated methodology to help assure the quality of semantic-type assignments within the UMLS.
The methodology uses a cross-validation strategy involving SNOMED CT's hierarchies in combination with UMLS semantic types. Semantically uniform, disjoint concept groups are generated programmatically by partitioning the collection of all concepts in the same SNOMED CT hierarchy according to their respective semantic-type assignments in the UMLS. Domain experts are then called upon to review the concepts in any group having a small number of concepts. It is our hypothesis that a semantic-type assignment combination applicable only to a very small number of concepts in a SNOMED CT hierarchy is an indicator of potential problems.
The methodology was applied to the UMLS 2013AA release along with the SNOMED CT from January 2013. An overall error rate of 33% was found for concepts proposed by the quality-assurance methodology. Supporting our hypothesis, that number was four times higher than the error rate found in control samples.
The results show that the quality-assurance methodology can aid in effective and efficient identification of UMLS semantic-type assignment errors.
统一医学语言系统(UMLS)是最大的生物医学术语系统之一,其元词库中拥有超过250万个概念。UMLS的语义网络(SN)包含133种高级语义类型,作为元词库之上的抽象层。特别是,SN通过为每个概念分配一种或多种类型来阐述元词库概念的一个方面。由于元词库的范围和复杂性,在这种语义类型分配过程中错误几乎不可避免。
开发一种半自动方法,以帮助确保UMLS中语义类型分配的质量。
该方法使用一种交叉验证策略,将SNOMED CT层次结构与UMLS语义类型相结合。通过根据UMLS中各自的语义类型分配,对同一SNOMED CT层次结构中的所有概念集合进行划分,以编程方式生成语义统一、不相交的概念组。然后要求领域专家审查任何概念数量较少的组中的概念。我们的假设是,在SNOMED CT层次结构中仅适用于极少数概念的语义类型分配组合是潜在问题的一个指标。
该方法应用于2013年1月发布的UMLS 2013AA版本以及SNOMED CT。质量保证方法提出的概念的总体错误率为33%。支持我们的假设,该数字比对照样本中的错误率高四倍。
结果表明,质量保证方法有助于有效且高效地识别UMLS语义类型分配错误。