Suppr超能文献

CRFVoter:使用基于条件随机场工具集合的基因和蛋白质相关对象识别

CRFVoter: gene and protein related object recognition using a conglomerate of CRF-based tools.

作者信息

Hemati Wahed, Mehler Alexander

机构信息

Text Technology Lab, Goethe-University Frankfurt, Robert-Mayer-Straße 10, 60325, Frankfurt am Main, Germany.

出版信息

J Cheminform. 2019 Mar 14;11(1):21. doi: 10.1186/s13321-019-0343-x.

Abstract

BACKGROUND

Gene and protein related objects are an important class of entities in biomedical research, whose identification and extraction from scientific articles is attracting increasing interest. In this work, we describe an approach to the BioCreative V.5 challenge regarding the recognition and classification of gene and protein related objects. For this purpose, we transform the task as posed by BioCreative V.5 into a sequence labeling problem. We present a series of sequence labeling systems that we used and adapted in our experiments for solving this task. Our experiments show how to optimize the hyperparameters of the classifiers involved. To this end, we utilize various algorithms for hyperparameter optimization. Finally, we present CRFVoter, a two-stage application of Conditional Random Field (CRF) that integrates the optimized sequence labelers from our study into one ensemble classifier.

RESULTS

We analyze the impact of hyperparameter optimization regarding named entity recognition in biomedical research and show that this optimization results in a performance increase of up to 60%. In our evaluation, our ensemble classifier based on multiple sequence labelers, called CRFVoter, outperforms each individual extractor's performance. For the blinded test set provided by the BioCreative organizers, CRFVoter achieves an F-score of 75%, a recall of 71% and a precision of 80%. For the GPRO type 1 evaluation, CRFVoter achieves an F-Score of 73%, a recall of 70% and achieved the best precision (77%) among all task participants.

CONCLUSION

CRFVoter is effective when multiple sequence labeling systems are to be used and performs better then the individual systems collected by it.

摘要

背景

基因和蛋白质相关对象是生物医学研究中一类重要的实体,从科学文献中识别和提取这些对象正引起越来越多的关注。在这项工作中,我们描述了一种针对生物创意V.5挑战赛中基因和蛋白质相关对象的识别与分类的方法。为此,我们将生物创意V.5提出的任务转化为一个序列标注问题。我们展示了一系列在实验中使用和调整的用于解决此任务的序列标注系统。我们的实验展示了如何优化相关分类器的超参数。为此,我们利用各种算法进行超参数优化。最后,我们提出了CRFVoter,这是一种条件随机场(CRF)的两阶段应用,它将我们研究中优化后的序列标注器集成到一个集成分类器中。

结果

我们分析了超参数优化对生物医学研究中命名实体识别的影响,并表明这种优化可使性能提高多达60%。在我们的评估中,我们基于多个序列标注器的集成分类器CRFVoter的性能优于每个单独提取器。对于生物创意组织者提供的盲测集,CRFVoter的F值为75%,召回率为71%,精确率为80%。对于GPRO 1型评估,CRFVoter的F值为73%,召回率为70%,并在所有任务参与者中取得了最佳精确率(77%)。

结论

当使用多个序列标注系统时,CRFVoter是有效的,并且其性能优于它所收集的单个系统。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bf65/6419804/e2a78a3a3dd4/13321_2019_343_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验