Kinoshita Shuhei, Cohen K Bretonnel, Ogren Philip V, Hunter Lawrence
Center for Computational Pharmacology, University of Colorado School of Medicine, Denver, Colorado, USA.
BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S4. doi: 10.1186/1471-2105-6-S1-S4. Epub 2005 May 24.
Our approach to Task 1A was inspired by Tanabe and Wilbur's ABGene system. Like Tanabe and Wilbur, we approached the problem as one of part-of-speech tagging, adding a GENE tag to the standard tag set. Where their system uses the Brill tagger, we used TnT, the Trigrams 'n' Tags HMM-based part-of-speech tagger. Based on careful error analysis, we implemented a set of post-processing rules to correct both false positives and false negatives. We participated in both the open and the closed divisions; for the open division, we made use of data from NCBI.
Our base system without post-processing achieved a precision and recall of 68.0% and 77.2%, respectively, giving an F-measure of 72.3%. The full system with post-processing achieved a precision and recall of 80.3% and 80.5% giving an F-measure of 80.4%. We achieved a slight improvement (F-measure = 80.9%) by employing a dictionary-based post-processing step for the open division. We placed third in both the open and the closed division.
Our results show that a part-of-speech tagger can be augmented with post-processing rules resulting in an entity identification system that competes well with other approaches.
我们处理任务1A的方法受到了田边和威尔伯的ABGene系统的启发。和田边与威尔伯一样,我们将这个问题视为词性标注问题之一,在标准标签集中添加了一个“基因”(GENE)标签。他们的系统使用的是布里尔标注器,而我们使用的是TnT,即基于隐马尔可夫模型的三元组词性标注器。基于细致的错误分析,我们实施了一套后处理规则来纠正误报和漏报。我们参加了开放组和封闭组的比赛;对于开放组,我们利用了来自美国国立医学图书馆的数据库的数据。
我们未经后处理的基础系统的精确率和召回率分别为68.0%和77.2%,F值为72.3%。经过后处理的完整系统的精确率和召回率分别为80.3%和80.5%,F值为80.4%。通过对开放组采用基于词典的后处理步骤,我们略有改进(F值 = 80.9%)。我们在开放组和封闭组比赛中均获得了第三名。
我们的结果表明,词性标注器可以通过后处理规则得到增强,从而形成一个能与其他方法相媲美的实体识别系统。