Wermter Joachim, Hahn Udo
Jena University Language and Information Engineering (JULIE) Lab. http://www.coling.uni-jena.de
AMIA Annu Symp Proc. 2005;2005:809-13.
The ever-increasing amount of textual information in biomedicine calls for effective procedures for automatic terminology extraction which assist biomedical researchers and professionals in gathering and organizing terminological knowledge encoded in text documents. In this study, we propose a new, linguistically grounded measure for automatically identifying multi-word terms from the biomedical literature. Our approach is based on the limited paradigmatic modifiability of terms and is tested on bigram, trigram and quadgram noun phrases extracted from a 104-million-word text corpus comprised of Medline abstracts. Using the UMLS Metathesaurus as a gold standard, we show that our algorithm substantially outperforms the standard term identification measures and, therefore, qualifies as a high-performing building block for any biomedical terminology mining system.
生物医学中不断增长的文本信息量需要有效的自动术语提取程序,以帮助生物医学研究人员和专业人员收集和整理编码在文本文献中的术语知识。在本研究中,我们提出了一种基于语言学的新方法,用于从生物医学文献中自动识别多词术语。我们的方法基于术语有限的范式可变性,并在从包含1亿零400万字的Medline摘要文本语料库中提取的双词、三词和四词名词短语上进行了测试。以统一医学语言系统(UMLS)元词表作为金标准,我们表明我们的算法显著优于标准术语识别方法,因此,可作为任何生物医学术语挖掘系统的高性能构建模块。