Furrer Lenz, Jancso Anna, Colic Nicola, Rinaldi Fabio
Institute of Computational Linguistics, University of Zurich, Andreasstr. 15, 8050, Zürich, Switzerland.
Fondazione Bruno Kessler, Via Sommarive, 18, 38123, Trento, Italy.
J Cheminform. 2019 Jan 21;11(1):7. doi: 10.1186/s13321-018-0326-3.
We present a text-mining tool for recognizing biomedical entities in scientific literature. OGER++ is a hybrid system for named entity recognition and concept recognition (linking), which combines a dictionary-based annotator with a corpus-based disambiguation component. The annotator uses an efficient look-up strategy combined with a normalization method for matching spelling variants. The disambiguation classifier is implemented as a feed-forward neural network which acts as a postfilter to the previous step.
We evaluated the system in terms of processing speed and annotation quality. In the speed benchmarks, the OGER++ web service processes 9.7 abstracts or 0.9 full-text documents per second. On the CRAFT corpus, we achieved 71.4% and 56.7% F1 for named entity recognition and concept recognition, respectively.
Combining knowledge-based and data-driven components allows creating a system with competitive performance in biomedical text mining.
我们展示了一种用于识别科学文献中生物医学实体的文本挖掘工具。OGER++是一个用于命名实体识别和概念识别(链接)的混合系统,它将基于字典的注释器与基于语料库的消歧组件相结合。该注释器使用一种高效的查找策略并结合一种归一化方法来匹配拼写变体。消歧分类器被实现为一个前馈神经网络,它作为上一步的后置过滤器。
我们从处理速度和注释质量方面对该系统进行了评估。在速度基准测试中,OGER++网络服务每秒可处理9.7篇摘要或0.9篇全文文档。在CRAFT语料库上,我们在命名实体识别和概念识别方面分别取得了71.4%和56.7%的F1值。
将基于知识的组件和数据驱动的组件相结合,可以创建一个在生物医学文本挖掘中具有竞争力的系统。