Akhondi Saber A, Pons Ewoud, Afzal Zubair, van Haagen Herman, Becker Benedikt F H, Hettne Kristina M, van Mulligen Erik M, Kors Jan A
Department of Medical Informatics, Erasmus University Medical Center, PO Box 2040, 3000 CA Rotterdam.
Department of Human Genetics, Leiden University Medical Center, PO Box 9600, 2300 RC Leiden, The Netherlands.
Database (Oxford). 2016 May 2;2016. doi: 10.1093/database/baw061. Print 2016.
We describe the development of a chemical entity recognition system and its application in the CHEMDNER-patent track of BioCreative 2015. This community challenge includes a Chemical Entity Mention in Patents (CEMP) recognition task and a Chemical Passage Detection (CPD) classification task. We addressed both tasks by an ensemble system that combines a dictionary-based approach with a statistical one. For this purpose the performance of several lexical resources was assessed using Peregrine, our open-source indexing engine. We combined our dictionary-based results on the patent corpus with the results of tmChem, a chemical recognizer using a conditional random field classifier. To improve the performance of tmChem, we utilized three additional features, viz. part-of-speech tags, lemmas and word-vector clusters. When evaluated on the training data, our final system obtained an F-score of 85.21% for the CEMP task, and an accuracy of 91.53% for the CPD task. On the test set, the best system ranked sixth among 21 teams for CEMP with an F-score of 86.82%, and second among nine teams for CPD with an accuracy of 94.23%. The differences in performance between the best ensemble system and the statistical system separately were small.Database URL: http://biosemantics.org/chemdner-patents.
我们描述了一个化学实体识别系统的开发及其在2015年生物创意(BioCreative)的CHEMDNER专利赛道中的应用。这个社区挑战赛包括专利中的化学实体提及(CEMP)识别任务和化学段落检测(CPD)分类任务。我们通过一个将基于字典的方法与统计方法相结合的集成系统来处理这两个任务。为此,我们使用我们的开源索引引擎Peregrine评估了几种词汇资源的性能。我们将基于字典的专利语料库结果与tmChem(一种使用条件随机场分类器的化学识别器)的结果相结合。为了提高tmChem的性能,我们利用了另外三个特征,即词性标签、词元及词向量簇。在训练数据上进行评估时,我们的最终系统在CEMP任务中获得了85.21%的F值,在CPD任务中获得了91.53%的准确率。在测试集上,最佳系统在21个团队中CEMP任务排名第六,F值为86.82%,在九个团队中CPD任务排名第二,准确率为94.23%。最佳集成系统与单独的统计系统之间的性能差异很小。数据库网址:http://biosemantics.org/chemdner-patents 。