Xu Rong, Musen Mark A, Shah Nigam H
Center for Biomedical Informatics Research, Stanford University School of Medicine Stanford, CA 94305, USA.
AMIA Annu Symp Proc. 2010 Nov 13;2010:907-11.
The Unified Medical Language System (UMLS) Metathesaurus is widely used for biomedical natural language processing (NLP) tasks. In this study, we systematically analyzed UMLS Metathesaurus terms by analyzing their occurrences in over 18 million MEDLINE abstracts. Our goals were: 1. analyze the frequency and syntactic distribution of Metathesaurus terms in MEDLINE; 2. create a filtered UMLS Metathesaurus based on the MEDLINE analysis; 3. augment the UMLS Metathesaurus where each term is associated with metadata on its MEDLINE frequency and syntactic distribution statistics. After MEDLINE frequency-based filtering, the augmented UMLS Metathesaurus contains 518,835 terms and is roughly 13% of its original size. We have shown that the syntactic and frequency information is useful to identify errors in the Metathesaurus. This filtered and augmented UMLS Metathesaurus can potentially be used to improve efficiency and precision of UMLS-based information retrieval and NLP tasks.
统一医学语言系统(UMLS)元词表被广泛用于生物医学自然语言处理(NLP)任务。在本研究中,我们通过分析元词表术语在超过1800万篇MEDLINE摘要中的出现情况,对UMLS元词表术语进行了系统分析。我们的目标是:1. 分析元词表术语在MEDLINE中的频率和句法分布;2. 基于MEDLINE分析创建一个经过筛选的UMLS元词表;3. 扩充UMLS元词表,使每个术语都与关于其MEDLINE频率和句法分布统计的元数据相关联。经过基于MEDLINE频率的筛选后,扩充后的UMLS元词表包含518,835个术语,约为其原始大小的13%。我们已经表明,句法和频率信息有助于识别元词表中的错误。这个经过筛选和扩充的UMLS元词表有可能用于提高基于UMLS的信息检索和NLP任务的效率和精度。