HUNER：通过预训练改进生物医学命名实体识别。

HUNER: improving biomedical NER with pretraining.

机构信息

Computer Science Department, Humboldt-Universität zu Berlin, Berlin 10099, Germany.

Seismology Section, Helmholtzzentrum Potsdam, Deutsches GeoForschungsZentrum GFZ, Potsdam 14473, Germany.

出版信息

Bioinformatics. 2020 Jan 1;36(1):295-302. doi: 10.1093/bioinformatics/btz528.

DOI:10.1093/bioinformatics/btz528

PMID:31243432

Abstract

MOTIVATION

Several recent studies showed that the application of deep neural networks advanced the state-of-the-art in named entity recognition (NER), including biomedical NER. However, the impact on performance and the robustness of improvements crucially depends on the availability of sufficiently large training corpora, which is a problem in the biomedical domain with its often rather small gold standard corpora.

RESULTS

We evaluate different methods for alleviating the data sparsity problem by pretraining a deep neural network (LSTM-CRF), followed by a rather short fine-tuning phase focusing on a particular corpus. Experiments were performed using 34 different corpora covering five different biomedical entity types, yielding an average increase in F1-score of ∼2 pp compared to learning without pretraining. We experimented both with supervised and semi-supervised pretraining, leading to interesting insights into the precision/recall trade-off. Based on our results, we created the stand-alone NER tool HUNER incorporating fully trained models for five entity types. On the independent CRAFT corpus, which was not used for creating HUNER, it outperforms the state-of-the-art tools GNormPlus and tmChem by 5-13 pp on the entity types chemicals, species and genes.

AVAILABILITY AND IMPLEMENTATION

HUNER is freely available at https://hu-ner.github.io. HUNER comes in containers, making it easy to install and use, and it can be applied off-the-shelf to arbitrary texts. We also provide an integrated tool for obtaining and converting all 34 corpora used in our evaluation, including fixed training, development and test splits to enable fair comparisons in the future.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

最近的几项研究表明，深度学习神经网络的应用提高了命名实体识别（NER）的最新水平，包括生物医学 NER。然而，性能的提高和改进的稳健性在很大程度上取决于是否有足够大的训练语料库，这在生物医学领域是一个问题，因为其黄金标准语料库通常很小。

结果

我们通过对深度神经网络（LSTM-CRF）进行预训练来评估缓解数据稀疏问题的不同方法，然后进行一个相对较短的微调阶段，重点关注特定的语料库。实验使用了 34 个不同的语料库，涵盖了 5 种不同的生物医学实体类型，与没有预训练的学习相比，平均 F1 得分提高了约 2 个百分点。我们分别进行了监督和半监督预训练实验，对精度/召回率的权衡有了有趣的见解。基于我们的结果，我们创建了独立的 HUNER NER 工具，其中包含针对 5 种实体类型的完全训练模型。在独立的 CRAFT 语料库上，它在化学品、物种和基因这 3 个实体类型上的表现优于 GNormPlus 和 tmChem 等最新工具，领先幅度为 5-13 个百分点。