Yamamoto Kaoru, Kudo Taku, Konagaya Akihiko, Matsumoto Yuji
CREST, Japan Science and Technology Agency, Japan.
J Biomed Inform. 2004 Dec;37(6):471-82. doi: 10.1016/j.jbi.2004.08.001.
Protein name recognition aims to detect each and every protein names appearing in a PubMed abstract. The task is not simple, as the graphic word boundary (space separator) assumed in conventional preprocessing does not necessarily coincide with the protein name boundary. Such boundary disagreement caused by tokenization ambiguity has usually been ignored in conventional preprocessing of general English. In this paper, we argue that boundary disagreement poses serious limitations in biomedical English text processing, not to mention protein name recognition. Our key idea for dealing with the boundary disagreement is to apply techniques used in Japanese morphological analysis where there are no word boundaries. Having evaluated the proposed method with GENIA corpus 3.02, we obtain F-measure of 69.01 on a strict criterion and 79.32 on a relaxed criterion. The result is comparable to other published work in protein name recognition, without resorting to manually prepared ad hoc feature engineering. Further, compared to the conventional preprocessing, the use of morphological analysis as preprocessing improves the performance of protein name recognition and reduces the execution time.
蛋白质名称识别旨在检测出出现在PubMed摘要中的每一个蛋白质名称。这项任务并不简单,因为传统预处理中假定的图形词边界(空格分隔符)不一定与蛋白质名称边界一致。由分词歧义导致的这种边界不一致在一般英语的传统预处理中通常被忽略。在本文中,我们认为边界不一致在生物医学英语文本处理中造成了严重限制,更不用说蛋白质名称识别了。我们处理边界不一致的关键思路是应用日语形态分析中使用的技术,日语中不存在词边界。通过使用GENIA语料库3.02对所提出的方法进行评估,我们在严格标准下获得了69.01的F值,在宽松标准下获得了79.32的F值。该结果与蛋白质名称识别领域其他已发表的工作相当,且无需借助人工准备的特殊特征工程。此外,与传统预处理相比,使用形态分析作为预处理提高了蛋白质名称识别的性能并减少了执行时间。