Habibi Maryam, Wiegandt David Luis, Schmedding Florian, Leser Ulf
Knowledge Management in Bioinformatics, Humboldt-Universität zu Berlin, 12489 Berlin, Germany.
Averbis GmbH, 79106 Freiburg, Germany.
J Cheminform. 2016 Oct 28;8:59. doi: 10.1186/s13321-016-0172-0. eCollection 2016.
Recently, methods for Chemical Named Entity Recognition (NER) have gained substantial interest, driven by the need for automatically analyzing todays ever growing collections of biomedical text. Chemical NER for patents is particularly essential due to the high economic importance of pharmaceutical findings. However, NER on patents has essentially been neglected by the research community for long, mostly because of the lack of enough annotated corpora. A recent international competition specifically targeted this task, but evaluated tools only on gold standard patent abstracts instead of full patents; furthermore, results from such competitions are often difficult to extrapolate to real-life settings due to the relatively high homogeneity of training and test data. Here, we evaluate the two state-of-the-art chemical NER tools, tmChem and ChemSpot, on four different annotated patent corpora, two of which consist of full texts. We study the overall performance of the tools, compare their results at the instance level, report on high-recall and high-precision ensembles, and perform cross-corpus and intra-corpus evaluations. Our findings indicate that full patents are considerably harder to analyze than patent abstracts and clearly confirm the common wisdom that using the same text genre (patent vs. scientific) and text type (abstract vs. full text) for training and testing is a pre-requisite for achieving high quality text mining results.
近年来,由于需要自动分析当今不断增长的生物医学文本集合,化学命名实体识别(NER)方法受到了广泛关注。由于药物研究成果具有很高的经济重要性,专利中的化学NER尤为重要。然而,长期以来,研究界基本上忽视了专利上的NER,主要原因是缺乏足够的注释语料库。最近的一项国际竞赛专门针对这项任务,但仅在黄金标准专利摘要而非完整专利上评估工具;此外,由于训练和测试数据的同质性相对较高,此类竞赛的结果往往难以推广到实际应用场景。在此,我们在四个不同的注释专利语料库上评估了两种最先进的化学NER工具tmChem和ChemSpot,其中两个语料库由全文组成。我们研究了工具的整体性能,在实例级别比较了它们的结果,报告了高召回率和高精度集成,并进行了跨语料库和语料库内评估。我们的研究结果表明,完整专利比专利摘要更难分析,并明确证实了一个普遍观点,即使用相同的文本类型(专利与科学)和文本形式(摘要与全文)进行训练和测试是获得高质量文本挖掘结果的先决条件。