Humphreys B L, McCray A T, Cheh M L
National Library of Medicine, Bethesda, MD 20894, USA.
J Am Med Inform Assoc. 1997 Nov-Dec;4(6):484-500. doi: 10.1136/jamia.1997.0040484.
To determine the extent to which a combination of existing machine-readable health terminologies cover the concepts and terms needed for a comprehensive controlled vocabulary for health information systems by carrying out a distributed national experiment using the Internet and the UMLS Knowledge Sources, lexical programs, and server.
Using a specially designed Web-based interface to the UMLS Knowledge Source Server, participants searched the more than 30 vocabularies in the 1996 UMLS Metathesaurus and three planned additions to determine if concepts for which they desired controlled terminology were present or absent. For each term submitted, the interface presented a candidate exact match or a set of potential approximate matches from which the participant selected the most closely related concept. The interface captured a profile of the terms submitted by the participant and for each term searched, information about the concept (if any) selected by the participant. The term information was loaded into a database at NLM for review and analysis and was also available to be downloaded by the participant. A team of subject experts reviewed records to identify matches missed by participants and to correct any obvious errors in relationships. The editors of SNOMED International and the Read Codes were given a random sample of reviewed terms for which exact meaning matches were not found to identify exact matches that were missed or any valid combinations of concepts that were synonymous to input terms. The 1997 UMLS Metathesaurus was used in the semantic type and vocabulary source analysis because it included most of the three planned additions.
Sixty-three participants submitted a total of 41,127 terms, which represented 32,679 normalized strings. More than 80% of the terms submitted were wanted for parts of the patient record related to the patient's condition. Following review, 58% of all submitted terms had exact meaning matches in the controlled vocabularies in the test, 41% had related concepts, and 1% were not found. Of the 28% of the terms which were narrower in meaning than a concept in the controlled vocabularies, 86% shared lexical items with the broader concept, but had additional modification. The percentage of exact meanings matches varied by specialty from 45% to 71%. Twenty-nine different vocabularies contained meanings for some of the 23,837 terms (a maximum of 12,707 discrete concepts) with exact meaning matches. Based on preliminary data and analysis, individual vocabularies contained < 1% to 63% of the terms and < 1% to 54% of the concepts. Only SNOMED International and the Read Codes had more than 60% of the terms and more than 50% of the concepts.
The combination of existing controlled vocabularies included in the test represents the meanings of the majority of the terminology needed to record patient conditions, providing substantially more exact matches than any individual vocabulary in the set. From a technical and organizational perspective, the test was successful and should serve as a useful model, both for distributed input to the enhancement of controlled vocabularies and for other kinds of collaborative informatics research.
通过利用互联网以及统一医学语言系统(UMLS)知识源、词汇程序和服务器开展一项分布式全国性实验,来确定现有机器可读健康术语的组合在多大程度上涵盖健康信息系统综合受控词汇所需的概念和术语。
参与者通过一个专门设计的基于网络的接口连接到UMLS知识源服务器,在1996年UMLS元词表中的30多个词汇表以及三个计划新增的词汇表中进行搜索,以确定他们所需受控术语的概念是否存在。对于提交的每个术语,该接口会呈现一个候选精确匹配项或一组潜在的近似匹配项,参与者从中选择最相关的概念。该接口会记录参与者提交的术语概况,以及针对每个搜索术语,参与者所选概念的相关信息(若有)。术语信息被加载到美国国立医学图书馆(NLM)的数据库中以供审查和分析,参与者也可下载。一组主题专家审查记录,以识别参与者遗漏的匹配项,并纠正关系中任何明显的错误。国际医学术语系统命名法(SNOMED)国际版和Read编码的编辑人员获得了一组经审查的术语的随机样本,这些术语未找到精确含义匹配项,目的是识别遗漏的精确匹配项或与输入术语同义的确切概念有效组合。1997年UMLS元词表用于语义类型和词汇源分析,因为它包含了三个计划新增词汇表中的大部分内容。
63名参与者共提交了41,127个术语,代表32,679个标准化字符串。提交的术语中超过80%用于与患者病情相关的患者记录部分。经审查后,测试中所有提交术语的58%在受控词汇表中有精确含义匹配项,41%有相关概念,1%未找到。在含义比受控词汇表中的概念更窄的28%的术语中,86%与更宽泛的概念共享词汇项,但有额外的修饰。精确含义匹配的百分比因专业而异,从45%到71%不等。29个不同的词汇表包含了23,837个术语(最多12,707个离散概念)中一些术语的含义,这些术语有精确含义匹配项。根据初步数据和分析,各个词汇表包含的术语占比<1%至63%,概念占比<1%至54%。只有SNOMED国际版和Read编码包含的术语超过60%,概念超过50%。
测试中包含的现有受控词汇表的组合代表了记录患者病情所需的大多数术语的含义,提供的精确匹配项比该集合中的任何单个词汇表都要多得多。从技术和组织角度来看,该测试是成功的,应作为一个有用的模型,既用于对受控词汇表增强的分布式输入,也用于其他类型的协作信息学研究。