Luo Zhihui, Duffy Robert, Johnson Stephen, Weng Chunhua
Department of Biomedical Informatics, Columbia University.
Summit Transl Bioinform. 2010 Mar 1;2010:26-30.
We describe a corpus-based approach to creating a semantic lexicon using UMLS knowledge sources. We extracted 10,000 sentences from the eligibility criteria sections of clinical trial summaries contained in ClinicalTrials.gov. The UMLS Metathesaurus and SPECIALIST Lexical Tools were used to extract and normalize UMLS recognizable terms. When annotated with Semantic Network types, the corpus had a lexical ambiguity of 1.57 (=total types for unique lexemes / total unique lexemes) and a word occurrence ambiguity of 1.96 (=total type occurrences / total word occurrences). A set of semantic preference rules was developed and applied to completely eliminate ambiguity in semantic type assignment. The lexicon covered 95.95% UMLS-recognizable terms in our corpus. A total of 20 UMLS semantic types, representing about 17% of all the distinct semantic types assigned to corpus lexemes, covered about 80% of the vocabulary of our corpus.
我们描述了一种基于语料库的方法,利用美国国立医学图书馆统一医学语言系统(UMLS)知识源创建语义词汇表。我们从ClinicalTrials.gov中包含的临床试验摘要的纳入标准部分提取了10000个句子。使用UMLS元词表和专业词汇工具来提取和规范化UMLS可识别的术语。当用语义网络类型进行标注时,语料库的词汇歧义率为1.57(=唯一词元的总类型数/总唯一词元数),词出现歧义率为1.96(=总类型出现次数/总单词出现次数)。我们开发并应用了一组语义偏好规则,以完全消除语义类型分配中的歧义。该词汇表涵盖了我们语料库中95.95%的UMLS可识别术语。总共20种UMLS语义类型,约占分配给语料库词元的所有不同语义类型的17%,涵盖了我们语料库约80%的词汇。