Liu Hongfang, Wu Stephen T, Li Dingcheng, Jonnalagadda Siddhartha, Sohn Sunghwan, Wagholikar Kavishwar, Haug Peter J, Huff Stanley M, Chute Christopher G
Mayo Clinic College of Medicine, Rochester, MN, USA.
AMIA Annu Symp Proc. 2012;2012:568-76. Epub 2012 Nov 3.
A semantic lexicon which associates words and phrases in text to concepts is critical for extracting and encoding clinical information in free text and therefore achieving semantic interoperability between structured and unstructured data in Electronic Health Records (EHRs). Directly using existing standard terminologies may have limited coverage with respect to concepts and their corresponding mentions in text. In this paper, we analyze how tokens and phrases in a large corpus distribute and how well the UMLS captures the semantics. A corpus-driven semantic lexicon, MedLex, has been constructed where the semantics is based on the UMLS assisted with variants mined and usage information gathered from clinical text. The detailed corpus analysis of tokens, chunks, and concept mentions shows the UMLS is an invaluable source for natural language processing. Increasing the semantic coverage of tokens provides a good foundation in capturing clinical information comprehensively. The study also yields some insights in developing practical NLP systems.
一个将文本中的单词和短语与概念相关联的语义词典对于从自由文本中提取和编码临床信息至关重要,因此对于实现电子健康记录(EHR)中结构化和非结构化数据之间的语义互操作性也至关重要。直接使用现有的标准术语在概念及其在文本中的相应提及方面可能覆盖有限。在本文中,我们分析了大型语料库中的词元和短语是如何分布的,以及统一医学语言系统(UMLS)对语义的捕捉程度如何。我们构建了一个语料库驱动的语义词典MedLex,其语义基于UMLS,并辅助从临床文本中挖掘的变体和收集的使用信息。对词元、语块和概念提及的详细语料库分析表明,UMLS是自然语言处理的宝贵资源。增加词元的语义覆盖范围为全面捕捉临床信息提供了良好的基础。该研究还为开发实用的自然语言处理系统提供了一些见解。