Henriksson Aron, Conway Mike, Duneld Martin, Chapman Wendy W
Department of Computer and Systems Sciences (DSV), Stockholm University, Sweden.
Division of Behavioral Medicine, Department of Family & Preventive Medicine, University of California, San Diego, USA.
AMIA Annu Symp Proc. 2013 Nov 16;2013:600-9. eCollection 2013.
Medical terminologies and ontologies are important tools for natural language processing of health record narratives. To account for the variability of language use, synonyms need to be stored in a semantic resource as textual instantiations of a concept. Developing such resources manually is, however, prohibitively expensive and likely to result in low coverage. To facilitate and expedite the process of lexical resource development, distributional analysis of large corpora provides a powerful data-driven means of (semi-)automatically identifying semantic relations, including synonymy, between terms. In this paper, we demonstrate how distributional analysis of a large corpus of electronic health records - the MIMIC-II database - can be employed to extract synonyms of SNOMED CT preferred terms. A distinctive feature of our method is its ability to identify synonymous relations between terms of varying length.
医学术语和本体是健康记录叙述自然语言处理的重要工具。为了应对语言使用的多样性,同义词需要作为概念的文本实例存储在语义资源中。然而,手动开发此类资源成本过高,而且可能导致覆盖率较低。为了促进和加快词汇资源开发过程,对大型语料库进行分布分析提供了一种强大的数据驱动方法,用于(半)自动识别术语之间的语义关系,包括同义关系。在本文中,我们展示了如何利用对大型电子健康记录语料库——MIMIC-II数据库——的分布分析来提取SNOMED CT首选术语的同义词。我们方法的一个显著特点是能够识别不同长度术语之间的同义关系。