Kim W, Wilbur W J
National Library of Medicine, Bethesda, Maryland 20894, USA.
J Am Med Inform Assoc. 2000 Sep-Oct;7(5):499-511. doi: 10.1136/jamia.2000.0070499.
The authors study the extraction of useful phrases from a natural language database by statistical methods. The aim is to leverage human effort by providing preprocessed phrase lists with a high percentage of useful material.
The approach is to develop six different scoring methods that are based on different aspects of phrase occurrence. The emphasis here is not on lexical information or syntactic structure but rather on the statistical properties of word pairs and triples that can be obtained from a large database.
The Unified Medical Language System (UMLS) incorporates a large list of humanly acceptable phrases in the medical field as a part of its structure. The authors use this list of phrases as a gold standard for validating their methods. A good method is one that ranks the UMLS phrases high among all phrases studied. Measurements are 11-point average precision values and precision-recall curves based on the rankings.
The authors find of six different scoring methods that each proves effective in identifying UMLS quality phrases in a large subset of MEDLINE. These methods are applicable both to word pairs and word triples. All six methods are optimally combined to produce composite scoring methods that are more effective than any single method. The quality of the composite methods appears sufficient to support the automatic placement of hyperlinks in text at the site of highly ranked phrases.
Statistical scoring methods provide a promising approach to the extraction of useful phrases from a natural language database for the purpose of indexing or providing hyperlinks in text.
作者研究通过统计方法从自然语言数据库中提取有用短语。目的是通过提供包含高比例有用材料的预处理短语列表来利用人力。
该方法是开发六种基于短语出现不同方面的不同评分方法。这里的重点不是词汇信息或句法结构,而是可以从大型数据库中获得的词对和三元组的统计特性。
统一医学语言系统(UMLS)在其结构中纳入了大量医学领域中人类可接受的短语列表。作者使用此短语列表作为验证其方法的黄金标准。一种好的方法是在所有研究的短语中将UMLS短语排在高位的方法。测量是基于排名的11点平均精度值和精确率-召回率曲线。
作者发现六种不同的评分方法在识别MEDLINE的一个大型子集中的UMLS优质短语方面均证明有效。这些方法适用于词对和词三元组。所有六种方法进行最佳组合以产生比任何单一方法更有效的复合评分方法。复合方法的质量似乎足以支持在文本中高排名短语处自动放置超链接。
统计评分方法为从自然语言数据库中提取有用短语以进行索引或在文本中提供超链接提供了一种有前景的方法。