Basaldella Marco, Furrer Lenz, Tasso Carlo, Rinaldi Fabio
Università degli Studi di Udine, Via delle Scienze 208, Udine, 33100, Italy.
University of Zurich, Institute of Computational Linguistics and Swiss Institute of Bioinformatics, Andreasstrasse 15, Zürich, CH-8050, Switzerland.
J Biomed Semantics. 2017 Nov 9;8(1):51. doi: 10.1186/s13326-017-0157-6.
This article describes a high-recall, high-precision approach for the extraction of biomedical entities from scientific articles.
The approach uses a two-stage pipeline, combining a dictionary-based entity recognizer with a machine-learning classifier. First, the OGER entity recognizer, which has a bias towards high recall, annotates the terms that appear in selected domain ontologies. Subsequently, the Distiller framework uses this information as a feature for a machine learning algorithm to select the relevant entities only. For this step, we compare two different supervised machine-learning algorithms: Conditional Random Fields and Neural Networks.
In an in-domain evaluation using the CRAFT corpus, we test the performance of the combined systems when recognizing chemicals, cell types, cellular components, biological processes, molecular functions, organisms, proteins, and biological sequences. Our best system combines dictionary-based candidate generation with Neural-Network-based filtering. It achieves an overall precision of 86% at a recall of 60% on the named entity recognition task, and a precision of 51% at a recall of 49% on the concept recognition task.
These results are to our knowledge the best reported so far in this particular task.
本文描述了一种从科学文章中提取生物医学实体的高召回率、高精度方法。
该方法采用两阶段流程,将基于字典的实体识别器与机器学习分类器相结合。首先,对高召回率有偏好的OGER实体识别器对选定领域本体中出现的术语进行标注。随后,Distiller框架将此信息用作机器学习算法的一个特征,仅选择相关实体。对于这一步骤,我们比较了两种不同的监督机器学习算法:条件随机场和神经网络。
在使用CRAFT语料库进行的领域内评估中,我们测试了组合系统在识别化学物质、细胞类型、细胞成分、生物过程、分子功能、生物体、蛋白质和生物序列时的性能。我们最好的系统将基于字典的候选生成与基于神经网络的过滤相结合。在命名实体识别任务中,召回率为60%时,其总体精度达到86%;在概念识别任务中,召回率为49%时,精度为51%。
据我们所知,这些结果是迄今为止在该特定任务中报告的最佳结果。