Divita Guy, Browne Allen C, Loane Russell
National Library of Medicine, Bethesda, Maryland, USA.
AMIA Annu Symp Proc. 2006;2006:200-3.
The Lexical Systems Group at the National Library of Medicine (NLM) has developed a Part-of-Speech (POS) tagger to be freely distributed with the SPECIALIST NLP Tools. dTagger is specifically designed for use with the SPECIALIST lexicon but it can be used with an arbitrary tag set. It is capable of single or multi-word chunking. It is trainable with previously annotated text and in development is a version that is tunable with untagged text. The tagger allows users to add local lexicon content. It can report likelihoods for each sentence tagged. New words seen while tagging (the unknowns) are handled by shape identification including heuristics based on suffix statistics gleaned during the training. The performance of the supervised training is noted to be 95% on a modified version of the MedPost hand annotated Medline abstracts. Eight percent of the terms within this corpus were multi-word entities.
美国国立医学图书馆(NLM)的词汇系统小组开发了一种词性(POS)标注器,将与专业自然语言处理工具一起免费分发。dTagger是专门为与专业词典配合使用而设计的,但它也可以与任意标签集一起使用。它能够进行单字或多字组块。它可以用先前标注的文本进行训练,并且正在开发一个可以用未标注文本进行调整的版本。该标注器允许用户添加本地词典内容。它可以报告每个标注句子的可能性。在标注过程中遇到的新词(未知词)通过形状识别来处理,包括基于训练期间收集的后缀统计信息的启发式方法。在MedPost人工标注的Medline摘要的修改版本上,监督训练的性能被记录为95%。该语料库中8%的术语是多字实体。