Suppr超能文献

使用混合方法进行生物医学领域的实体识别。

Entity recognition in the biomedical domain using a hybrid approach.

作者信息

Basaldella Marco, Furrer Lenz, Tasso Carlo, Rinaldi Fabio

机构信息

Università degli Studi di Udine, Via delle Scienze 208, Udine, 33100, Italy.

University of Zurich, Institute of Computational Linguistics and Swiss Institute of Bioinformatics, Andreasstrasse 15, Zürich, CH-8050, Switzerland.

出版信息

J Biomed Semantics. 2017 Nov 9;8(1):51. doi: 10.1186/s13326-017-0157-6.

Abstract

BACKGROUND

This article describes a high-recall, high-precision approach for the extraction of biomedical entities from scientific articles.

METHOD

The approach uses a two-stage pipeline, combining a dictionary-based entity recognizer with a machine-learning classifier. First, the OGER entity recognizer, which has a bias towards high recall, annotates the terms that appear in selected domain ontologies. Subsequently, the Distiller framework uses this information as a feature for a machine learning algorithm to select the relevant entities only. For this step, we compare two different supervised machine-learning algorithms: Conditional Random Fields and Neural Networks.

RESULTS

In an in-domain evaluation using the CRAFT corpus, we test the performance of the combined systems when recognizing chemicals, cell types, cellular components, biological processes, molecular functions, organisms, proteins, and biological sequences. Our best system combines dictionary-based candidate generation with Neural-Network-based filtering. It achieves an overall precision of 86% at a recall of 60% on the named entity recognition task, and a precision of 51% at a recall of 49% on the concept recognition task.

CONCLUSION

These results are to our knowledge the best reported so far in this particular task.

摘要

背景

本文描述了一种从科学文章中提取生物医学实体的高召回率、高精度方法。

方法

该方法采用两阶段流程,将基于字典的实体识别器与机器学习分类器相结合。首先,对高召回率有偏好的OGER实体识别器对选定领域本体中出现的术语进行标注。随后,Distiller框架将此信息用作机器学习算法的一个特征,仅选择相关实体。对于这一步骤,我们比较了两种不同的监督机器学习算法:条件随机场和神经网络。

结果

在使用CRAFT语料库进行的领域内评估中,我们测试了组合系统在识别化学物质、细胞类型、细胞成分、生物过程、分子功能、生物体、蛋白质和生物序列时的性能。我们最好的系统将基于字典的候选生成与基于神经网络的过滤相结合。在命名实体识别任务中,召回率为60%时,其总体精度达到86%;在概念识别任务中,召回率为49%时,精度为51%。

结论

据我们所知,这些结果是迄今为止在该特定任务中报告的最佳结果。

相似文献

2
OGER++: hybrid multi-type entity recognition.OGER++:混合多类型实体识别
J Cheminform. 2019 Jan 21;11(1):7. doi: 10.1186/s13321-018-0326-3.
6
Recognizing names in biomedical texts: a machine learning approach.识别生物医学文本中的名称:一种机器学习方法。
Bioinformatics. 2004 May 1;20(7):1178-90. doi: 10.1093/bioinformatics/bth060. Epub 2004 Feb 10.

引用本文的文献

10
OGER++: hybrid multi-type entity recognition.OGER++:混合多类型实体识别
J Cheminform. 2019 Jan 21;11(1):7. doi: 10.1186/s13321-018-0326-3.

本文引用的文献

5
Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations.利用词向量将领域知识融入化学和生物医学命名实体识别。
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S9. doi: 10.1186/1758-2946-7-S1-S9. eCollection 2015.
6
tmChem: a high performance approach for chemical named entity recognition and normalization.tmChem:一种用于化学命名实体识别和标准化的高性能方法。
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S3. doi: 10.1186/1758-2946-7-S1-S3. eCollection 2015.
8
OntoGene web services for biomedical text mining.OntoGene 生物医学文本挖掘网络服务。
BMC Bioinformatics. 2014;15 Suppl 14(Suppl 14):S6. doi: 10.1186/1471-2105-15-S14-S6. Epub 2014 Nov 27.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验