Collier Nigel, Takeuchi Koichi
National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan.
J Biomed Inform. 2004 Dec;37(6):423-35. doi: 10.1016/j.jbi.2004.08.008.
The immense volume of data which is now available from experiments in molecular biology has led to an explosion in reported results most of which are available only in unstructured text format. For this reason there has been great interest in the task of text mining to aid in fact extraction, document screening, citation analysis, and linkage with large gene and gene-product databases. In particular there has been an intensive investigation into the named entity (NE) task as a core technology in all of these tasks which has been driven by the availability of high volume training sets such as the GENIA v3.02 corpus. Despite such large training sets accuracy for biology NE has proven to be consistently far below the high levels of performance in the news domain where F scores above 90 are commonly reported which can be considered near to human performance. We argue that it is crucial that more rigorous analysis of the factors that contribute to the model's performance be applied to discover where the underlying limitations are and what our future research direction should be. Our investigation in this paper reports on variations of two widely used feature types, part of speech (POS) tags and character-level orthographic features, and makes a comparison of how these variations influence performance. We base our experiments on a proven state-of-the-art model, support vector machines using a high quality subset of 100 annotated MEDLINE abstracts. Experiments reveal that the best performing features are orthographic features with F score of 72.6. Although the Brill tagger trained in-domain on the GENIA v3.02p POS corpus gives the best overall performance of any POS tagger, at an F score of 68.6, this is still significantly below the orthographic features. In combination these two features types appear to interfere with each other and degrade performance slightly to an F score of 72.3.
如今,分子生物学实验产生了海量数据,这导致报告结果呈爆发式增长,其中大部分结果仅以非结构化文本格式存在。因此,文本挖掘任务备受关注,它有助于事实提取、文档筛选、引文分析以及与大型基因和基因产物数据库的关联。特别是,作为所有这些任务的核心技术,命名实体(NE)任务受到了深入研究,这是由大量训练集(如GENIA v3.02语料库)的可用性推动的。尽管有如此大的训练集,但事实证明,生物学NE的准确率一直远低于新闻领域的高水平表现,新闻领域通常报告F分数高于90,这可被视为接近人类表现。我们认为,至关重要的是,要对影响模型性能的因素进行更严格的分析,以发现潜在的局限性所在以及我们未来的研究方向应该是什么。我们在本文中的研究报告了两种广泛使用的特征类型(词性(POS)标签和字符级拼写特征)的变化,并比较了这些变化如何影响性能。我们的实验基于一个经过验证的先进模型,即使用100篇带注释的MEDLINE摘要的高质量子集的支持向量机。实验表明,表现最佳的特征是拼写特征,F分数为72.6。尽管在GENIA v3.02p POS语料库上进行领域内训练的Brill标注器在所有POS标注器中总体性能最佳,F分数为68.6,但这仍明显低于拼写特征。这两种特征类型结合起来似乎会相互干扰,性能略有下降,F分数为72.3。