Hole W T, Srinivasan S
National Library of Medicine, Bethesda, MD, USA.
Proc AMIA Symp. 2000:354-8.
The Unified Medical Language System (UMLS) [1, 2] Metathesuarus is concept-oriented; its goal is to unite all names with identical meaning in a single Concept. The names come from its constituent vocabularies or "sources"--a wide variety of biomedical terminologies including many controlled vocabularies and classifications used in patient records, administrative health data, bibliographic, research, full-text, and expert systems. Many offer little definitional information, and many are not themselves concept-oriented, so identifying synonymy is a challenging semantic task [3]. The rapidly increasing size of the Metathesaurus makes the task daunting, demanding effective computational support; there are more than 1.5 million names for 730,000 concepts in the January 2000 release. Vocabularies are added and updated using sophisticated lexical matching, selective algorithms, and expert review [4, 5, 6]. Yet the result is imperfect; we have discovered and corrected missed synonymy in approximately 1% of previously released concepts each year. This paper reviews general methods for finding missed synonymy and describes several specific novel approaches which we have found effective.
统一医学语言系统(UMLS)[1, 2]元词表是以概念为导向的;其目标是将所有具有相同含义的名称统一到一个单一概念中。这些名称来自其组成词汇表或“来源”——各种各样的生物医学术语,包括许多用于患者记录、行政健康数据、书目、研究、全文和专家系统的受控词汇表和分类法。许多术语提供的定义信息很少,而且许多本身并非以概念为导向,因此识别同义词是一项具有挑战性的语义任务[3]。元词表规模的迅速增长使得这项任务艰巨,需要有效的计算支持;在2000年1月发布的版本中,730,000个概念有超过150万个名称。词汇表通过复杂的词汇匹配、选择性算法和专家评审来添加和更新[4, 5, 6]。然而结果并不完美;我们每年都会在大约1%的先前发布的概念中发现并纠正遗漏的同义词。本文回顾了查找遗漏同义词的一般方法,并描述了我们发现有效的几种具体新颖方法。