使用分布分析对统一医学语言系统（UMLS）概念进行语义分类。

Using distributional analysis to semantically classify UMLS concepts.

作者信息

Fan Jung-Wei, Xu Hua, Friedman Carol

机构信息

Department of Biomedical Informatics, Columbia University, USA.

出版信息

Stud Health Technol Inform. 2007;129(Pt 1):519-23.

PMID:17911771

Abstract

The UMLS is a widely used and comprehensive knowledge source in the biomedical domain. It specifies biomedical concepts and their semantic categories, and therefore is valuable for Natural Language Processing (NLP) and other knowledge-based systems. However, the UMLS semantic classification is not always accurate, which adversely affects performance of these systems. Therefore, it is desirable to automatically validate, or, when necessary, to semantically reclassify UMLS concepts. We applied a distributional similarity method based on syntactic dependencies and -skew divergence to classify concepts in the T033 Finding class in order to determine which ones were biologic functions or disorders. A gold standard of 100 randomly sampled concepts was created that was based on a majority annotation of three experts. Precision of 0.54 and recall of 0.654 was achieved by the top prediction; precision of 0.64 and recall of 0.769 was achieved by the top 2 predictions. Error analysis revealed problems in the current method, and provided insight into future improvements.

摘要

统一医学语言系统（UMLS）是生物医学领域广泛使用的综合知识源。它规定了生物医学概念及其语义类别，因此对自然语言处理（NLP）和其他基于知识的系统很有价值。然而，UMLS语义分类并不总是准确的，这对这些系统的性能产生了不利影响。因此，需要自动验证UMLS概念，或者在必要时对其进行语义重新分类。我们应用了一种基于句法依存关系和偏斜散度的分布相似性方法，对T033发现类中的概念进行分类，以确定哪些是生物学功能或疾病。基于三位专家的多数注释，创建了一个由100个随机抽样概念组成的黄金标准。顶级预测的精确率为0.54，召回率为0.654；前两个预测的精确率为0.64，召回率为0.769。误差分析揭示了当前方法存在的问题，并为未来的改进提供了思路。