University of Aveiro, IEETA/DETI, Campus Universitário de Santiago, Aveiro, Portugal.
Bioinformatics. 2012 May 1;28(9):1253-61. doi: 10.1093/bioinformatics/bts125. Epub 2012 Mar 13.
The recognition of named entities (NER) is an elementary task in biomedical text mining. A number of NER solutions have been proposed in recent years, taking advantage of available annotated corpora, terminological resources and machine-learning techniques. Currently, the best performing solutions combine the outputs from selected annotation solutions measured against a single corpus. However, little effort has been spent on a systematic analysis of methods harmonizing the annotation results and measuring against a combination of Gold Standard Corpora (GSCs).
We present Totum, a machine learning solution that harmonizes gene/protein annotations provided by heterogeneous NER solutions. It has been optimized and measured against a combination of manually curated GSCs. The performed experiments show that our approach improves the F-measure of state-of-the-art solutions by up to 10% (achieving ≈70%) in exact alignment and 22% (achieving ≈82%) in nested alignment. We demonstrate that our solution delivers reliable annotation results across the GSCs and it is an important contribution towards a homogeneous annotation of MEDLINE abstracts.
Totum is implemented in Java and its resources are available at http://bioinformatics.ua.pt/totum
命名实体识别(NER)是生物医学文本挖掘中的基本任务。近年来,利用可用的带注释语料库、术语资源和机器学习技术,已经提出了许多 NER 解决方案。目前,性能最好的解决方案是结合针对单个语料库的选定注释解决方案的输出。然而,很少有人致力于系统地分析协调注释结果并针对组合的黄金标准语料库(GSCs)进行测量的方法。
我们提出了 Totum,这是一种机器学习解决方案,可协调来自异构 NER 解决方案的基因/蛋白质注释。它已经针对人工编辑的 GSCs 进行了优化和测量。所进行的实验表明,我们的方法可以将最先进解决方案的 F 度量提高多达 10%(达到≈70%)的精确对齐,22%(达到≈82%)的嵌套对齐。我们证明了我们的解决方案可以在 GSCs 之间提供可靠的注释结果,这是对 MEDLINE 摘要进行统一注释的重要贡献。
Totum 是用 Java 实现的,其资源可在 http://bioinformatics.ua.pt/totum 上获得。