Suppr超能文献

HUNER:通过预训练改进生物医学命名实体识别。

HUNER: improving biomedical NER with pretraining.

机构信息

Computer Science Department, Humboldt-Universität zu Berlin, Berlin 10099, Germany.

Seismology Section, Helmholtzzentrum Potsdam, Deutsches GeoForschungsZentrum GFZ, Potsdam 14473, Germany.

出版信息

Bioinformatics. 2020 Jan 1;36(1):295-302. doi: 10.1093/bioinformatics/btz528.

Abstract

MOTIVATION

Several recent studies showed that the application of deep neural networks advanced the state-of-the-art in named entity recognition (NER), including biomedical NER. However, the impact on performance and the robustness of improvements crucially depends on the availability of sufficiently large training corpora, which is a problem in the biomedical domain with its often rather small gold standard corpora.

RESULTS

We evaluate different methods for alleviating the data sparsity problem by pretraining a deep neural network (LSTM-CRF), followed by a rather short fine-tuning phase focusing on a particular corpus. Experiments were performed using 34 different corpora covering five different biomedical entity types, yielding an average increase in F1-score of ∼2 pp compared to learning without pretraining. We experimented both with supervised and semi-supervised pretraining, leading to interesting insights into the precision/recall trade-off. Based on our results, we created the stand-alone NER tool HUNER incorporating fully trained models for five entity types. On the independent CRAFT corpus, which was not used for creating HUNER, it outperforms the state-of-the-art tools GNormPlus and tmChem by 5-13 pp on the entity types chemicals, species and genes.

AVAILABILITY AND IMPLEMENTATION

HUNER is freely available at https://hu-ner.github.io. HUNER comes in containers, making it easy to install and use, and it can be applied off-the-shelf to arbitrary texts. We also provide an integrated tool for obtaining and converting all 34 corpora used in our evaluation, including fixed training, development and test splits to enable fair comparisons in the future.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

最近的几项研究表明,深度学习神经网络的应用提高了命名实体识别(NER)的最新水平,包括生物医学 NER。然而,性能的提高和改进的稳健性在很大程度上取决于是否有足够大的训练语料库,这在生物医学领域是一个问题,因为其黄金标准语料库通常很小。

结果

我们通过对深度神经网络(LSTM-CRF)进行预训练来评估缓解数据稀疏问题的不同方法,然后进行一个相对较短的微调阶段,重点关注特定的语料库。实验使用了 34 个不同的语料库,涵盖了 5 种不同的生物医学实体类型,与没有预训练的学习相比,平均 F1 得分提高了约 2 个百分点。我们分别进行了监督和半监督预训练实验,对精度/召回率的权衡有了有趣的见解。基于我们的结果,我们创建了独立的 HUNER NER 工具,其中包含针对 5 种实体类型的完全训练模型。在独立的 CRAFT 语料库上,它在化学品、物种和基因这 3 个实体类型上的表现优于 GNormPlus 和 tmChem 等最新工具,领先幅度为 5-13 个百分点。

可用性和实现

HUNER 可在 https://hu-ner.github.io 上免费获得。HUNER 有容器,可以方便地安装和使用,并且可以直接应用于任意文本。我们还提供了一个集成的工具,用于获取和转换我们评估中使用的所有 34 个语料库,包括固定的训练、开发和测试分割,以在未来实现公平的比较。

补充信息

补充数据可在《生物信息学》在线获得。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验