Hemati Wahed, Mehler Alexander
Text Technology Lab, Goethe-University Frankfurt, Robert-Mayer-Straße 10, 60325, Frankfurt am Main, Germany.
J Cheminform. 2019 Mar 14;11(1):21. doi: 10.1186/s13321-019-0343-x.
Gene and protein related objects are an important class of entities in biomedical research, whose identification and extraction from scientific articles is attracting increasing interest. In this work, we describe an approach to the BioCreative V.5 challenge regarding the recognition and classification of gene and protein related objects. For this purpose, we transform the task as posed by BioCreative V.5 into a sequence labeling problem. We present a series of sequence labeling systems that we used and adapted in our experiments for solving this task. Our experiments show how to optimize the hyperparameters of the classifiers involved. To this end, we utilize various algorithms for hyperparameter optimization. Finally, we present CRFVoter, a two-stage application of Conditional Random Field (CRF) that integrates the optimized sequence labelers from our study into one ensemble classifier.
We analyze the impact of hyperparameter optimization regarding named entity recognition in biomedical research and show that this optimization results in a performance increase of up to 60%. In our evaluation, our ensemble classifier based on multiple sequence labelers, called CRFVoter, outperforms each individual extractor's performance. For the blinded test set provided by the BioCreative organizers, CRFVoter achieves an F-score of 75%, a recall of 71% and a precision of 80%. For the GPRO type 1 evaluation, CRFVoter achieves an F-Score of 73%, a recall of 70% and achieved the best precision (77%) among all task participants.
CRFVoter is effective when multiple sequence labeling systems are to be used and performs better then the individual systems collected by it.
基因和蛋白质相关对象是生物医学研究中一类重要的实体,从科学文献中识别和提取这些对象正引起越来越多的关注。在这项工作中,我们描述了一种针对生物创意V.5挑战赛中基因和蛋白质相关对象的识别与分类的方法。为此,我们将生物创意V.5提出的任务转化为一个序列标注问题。我们展示了一系列在实验中使用和调整的用于解决此任务的序列标注系统。我们的实验展示了如何优化相关分类器的超参数。为此,我们利用各种算法进行超参数优化。最后,我们提出了CRFVoter,这是一种条件随机场(CRF)的两阶段应用,它将我们研究中优化后的序列标注器集成到一个集成分类器中。
我们分析了超参数优化对生物医学研究中命名实体识别的影响,并表明这种优化可使性能提高多达60%。在我们的评估中,我们基于多个序列标注器的集成分类器CRFVoter的性能优于每个单独提取器。对于生物创意组织者提供的盲测集,CRFVoter的F值为75%,召回率为71%,精确率为80%。对于GPRO 1型评估,CRFVoter的F值为73%,召回率为70%,并在所有任务参与者中取得了最佳精确率(77%)。
当使用多个序列标注系统时,CRFVoter是有效的,并且其性能优于它所收集的单个系统。