Sandler Ted, Schein Andrew I, Ungar Lyle H
Department of Computer and Information Science, University of Pennsylvania 3330 Walnut Street, Philadelphia, 19104, USA.
Bioinformatics. 2006 Mar 15;22(6):651-7. doi: 10.1093/bioinformatics/bti733. Epub 2005 Oct 25.
Many entity taggers and information extraction systems make use of lists of terms of entities such as people, places, genes or chemicals. These lists have traditionally been constructed manually. We show that distributional clustering methods which group words based on the contexts that they appear in, including neighboring words and syntactic relations extracted using a shallow parser, can be used to aid in the construction of term lists.
Experiments on learning lists of terms and using them as part of a gene tagger on a corpus of abstracts from the scientific literature show that our automatically generated term lists significantly boost the precision of a state-of-the-art CRF-based gene tagger to a degree that is competitive with using hand curated lists and boosts recall to a degree that surpasses that of the hand-curated lists. Our results also show that these distributional clustering methods do not generate lists as helpful as those generated by supervised techniques, but that they can be used to complement supervised techniques so as to obtain better performance.
The code used in this paper is available from http://www.cis.upenn.edu/datamining/software_dist/autoterm/
许多实体标记器和信息提取系统都使用诸如人物、地点、基因或化学物质等实体的术语列表。传统上,这些列表是手动构建的。我们表明,基于词出现的上下文(包括相邻词和使用浅层解析器提取的句法关系)对词进行分组的分布聚类方法可用于辅助构建术语列表。
在科学文献摘要语料库上学习术语列表并将其用作基因标记器一部分的实验表明,我们自动生成的术语列表显著提高了基于最先进的条件随机场(CRF)的基因标记器的精度,达到了与使用人工策划列表相竞争的程度,并且召回率提高到超过人工策划列表的程度。我们的结果还表明,这些分布聚类方法生成的列表不如监督技术生成的列表有用,但它们可用于补充监督技术以获得更好的性能。
本文中使用的代码可从http://www.cis.upenn.edu/datamining/software_dist/autoterm/获取。