Bennett N A, He Q, Powell K, Schatz B R
CANIS-Community Architectures for Network Information Systems, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign 61820, USA.
Proc AMIA Symp. 1999:671-5.
A natural language parser that could extract noun phrases for all medical texts would be of great utility in analyzing content for information retrieval. We discuss the extraction of noun phrases from MEDLINE, using a general parser not tuned specifically for any medical domain. The noun phrase extractor is made up of three modules: tokenization; part-of-speech tagging; noun phrase identification. Using our program, we extracted noun phrases from the entire MEDLINE collection, encompassing 9.3 million abstracts. Over 270 million noun phrases were generated, of which 45 million were unique. The quality of these phrases was evaluated by examining all phrases from a sample collection of abstracts. The precision and recall of the phrases from our general parser compared favorably with those from three other parsers we had previously evaluated. We are continuing to improve our parser and evaluate our claim that a generic parser can effectively extract all the different phrases across the entire medical literature.
一个能够为所有医学文本提取名词短语的自然语言解析器,在分析信息检索内容方面将具有很大的实用价值。我们讨论了使用一个未针对任何医学领域进行专门调整的通用解析器从MEDLINE中提取名词短语的方法。名词短语提取器由三个模块组成:分词;词性标注;名词短语识别。使用我们的程序,我们从整个MEDLINE数据库中提取了名词短语,该数据库包含930万篇摘要。生成了超过2.7亿个名词短语,其中4500万个是唯一的。通过检查摘要样本集中的所有短语来评估这些短语的质量。我们通用解析器提取的短语的精确率和召回率与我们之前评估的其他三个解析器相比具有优势。我们正在继续改进我们的解析器,并评估我们的说法,即一个通用解析器可以有效地从整个医学文献中提取所有不同的短语。