• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

HUNER:通过预训练改进生物医学命名实体识别。

HUNER: improving biomedical NER with pretraining.

机构信息

Computer Science Department, Humboldt-Universität zu Berlin, Berlin 10099, Germany.

Seismology Section, Helmholtzzentrum Potsdam, Deutsches GeoForschungsZentrum GFZ, Potsdam 14473, Germany.

出版信息

Bioinformatics. 2020 Jan 1;36(1):295-302. doi: 10.1093/bioinformatics/btz528.

DOI:10.1093/bioinformatics/btz528
PMID:31243432
Abstract

MOTIVATION

Several recent studies showed that the application of deep neural networks advanced the state-of-the-art in named entity recognition (NER), including biomedical NER. However, the impact on performance and the robustness of improvements crucially depends on the availability of sufficiently large training corpora, which is a problem in the biomedical domain with its often rather small gold standard corpora.

RESULTS

We evaluate different methods for alleviating the data sparsity problem by pretraining a deep neural network (LSTM-CRF), followed by a rather short fine-tuning phase focusing on a particular corpus. Experiments were performed using 34 different corpora covering five different biomedical entity types, yielding an average increase in F1-score of ∼2 pp compared to learning without pretraining. We experimented both with supervised and semi-supervised pretraining, leading to interesting insights into the precision/recall trade-off. Based on our results, we created the stand-alone NER tool HUNER incorporating fully trained models for five entity types. On the independent CRAFT corpus, which was not used for creating HUNER, it outperforms the state-of-the-art tools GNormPlus and tmChem by 5-13 pp on the entity types chemicals, species and genes.

AVAILABILITY AND IMPLEMENTATION

HUNER is freely available at https://hu-ner.github.io. HUNER comes in containers, making it easy to install and use, and it can be applied off-the-shelf to arbitrary texts. We also provide an integrated tool for obtaining and converting all 34 corpora used in our evaluation, including fixed training, development and test splits to enable fair comparisons in the future.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

最近的几项研究表明,深度学习神经网络的应用提高了命名实体识别(NER)的最新水平,包括生物医学 NER。然而,性能的提高和改进的稳健性在很大程度上取决于是否有足够大的训练语料库,这在生物医学领域是一个问题,因为其黄金标准语料库通常很小。

结果

我们通过对深度神经网络(LSTM-CRF)进行预训练来评估缓解数据稀疏问题的不同方法,然后进行一个相对较短的微调阶段,重点关注特定的语料库。实验使用了 34 个不同的语料库,涵盖了 5 种不同的生物医学实体类型,与没有预训练的学习相比,平均 F1 得分提高了约 2 个百分点。我们分别进行了监督和半监督预训练实验,对精度/召回率的权衡有了有趣的见解。基于我们的结果,我们创建了独立的 HUNER NER 工具,其中包含针对 5 种实体类型的完全训练模型。在独立的 CRAFT 语料库上,它在化学品、物种和基因这 3 个实体类型上的表现优于 GNormPlus 和 tmChem 等最新工具,领先幅度为 5-13 个百分点。

可用性和实现

HUNER 可在 https://hu-ner.github.io 上免费获得。HUNER 有容器,可以方便地安装和使用,并且可以直接应用于任意文本。我们还提供了一个集成的工具,用于获取和转换我们评估中使用的所有 34 个语料库,包括固定的训练、开发和测试分割,以在未来实现公平的比较。

补充信息

补充数据可在《生物信息学》在线获得。

相似文献

1
HUNER: improving biomedical NER with pretraining.HUNER:通过预训练改进生物医学命名实体识别。
Bioinformatics. 2020 Jan 1;36(1):295-302. doi: 10.1093/bioinformatics/btz528.
2
Transfer learning for biomedical named entity recognition with neural networks.基于神经网络的生物医学命名实体识别的迁移学习。
Bioinformatics. 2018 Dec 1;34(23):4087-4094. doi: 10.1093/bioinformatics/bty449.
3
Deep learning with word embeddings improves biomedical named entity recognition.使用词嵌入的深度学习可改善生物医学命名实体识别。
Bioinformatics. 2017 Jul 15;33(14):i37-i48. doi: 10.1093/bioinformatics/btx228.
4
HunFlair: an easy-to-use tool for state-of-the-art biomedical named entity recognition.HunFlair:一种用于最先进生物医学命名实体识别的易于使用的工具。
Bioinformatics. 2021 Sep 9;37(17):2792-2794. doi: 10.1093/bioinformatics/btab042.
5
Cross-type biomedical named entity recognition with deep multi-task learning.基于深度多任务学习的跨类型生物医学命名实体识别。
Bioinformatics. 2019 May 15;35(10):1745-1752. doi: 10.1093/bioinformatics/bty869.
6
D3NER: biomedical named entity recognition using CRF-biLSTM improved with fine-tuned embeddings of various linguistic information.D3NER:基于条件随机场-双向长短期记忆网络的生物医学命名实体识别,通过各种语言信息的微调嵌入得到改进。
Bioinformatics. 2018 Oct 15;34(20):3539-3546. doi: 10.1093/bioinformatics/bty356.
7
GRAM-CNN: a deep learning approach with local context for named entity recognition in biomedical text.GRAM-CNN:一种基于局部上下文的深度学习方法,用于生物医学文本中的命名实体识别。
Bioinformatics. 2018 May 1;34(9):1547-1554. doi: 10.1093/bioinformatics/btx815.
8
Dataset-aware multi-task learning approaches for biomedical named entity recognition.基于数据集的多任务学习方法在生物医学命名实体识别中的应用。
Bioinformatics. 2020 Aug 1;36(15):4331-4338. doi: 10.1093/bioinformatics/btaa515.
9
Biomedical named entity recognition using deep neural networks with contextual information.基于上下文信息的深度神经网络的生物医学命名实体识别。
BMC Bioinformatics. 2019 Dec 27;20(1):735. doi: 10.1186/s12859-019-3321-4.
10
DTranNER: biomedical named entity recognition with deep learning-based label-label transition model.DTranNER:基于深度学习的标签-标签转换模型的生物医学命名实体识别。
BMC Bioinformatics. 2020 Feb 11;21(1):53. doi: 10.1186/s12859-020-3393-1.

引用本文的文献

1
Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study.用于命名实体识别任务的大语言模型微调的样本量考量:方法学研究
JMIR AI. 2024 May 16;3:e52095. doi: 10.2196/52095.
2
Advancing entity recognition in biomedicine via instruction tuning of large language models.通过指令调整大型语言模型推进生物医学中的实体识别。
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae163.
3
A Review on Electronic Health Record Text-Mining for Biomedical Name Entity Recognition in Healthcare Domain.
医疗领域中用于生物医学命名实体识别的电子健康记录文本挖掘综述
Healthcare (Basel). 2023 Apr 28;11(9):1268. doi: 10.3390/healthcare11091268.
4
The New Version of the ANDDigest Tool with Improved AI-Based Short Names Recognition.新版本的 ANDDigest 工具,具有改进的基于人工智能的短名称识别功能。
Int J Mol Sci. 2022 Nov 29;23(23):14934. doi: 10.3390/ijms232314934.
5
We are not ready yet: limitations of state-of-the-art disease named entity recognizers.我们还没有准备好:最先进的疾病命名实体识别器的局限性。
J Biomed Semantics. 2022 Oct 27;13(1):26. doi: 10.1186/s13326-022-00280-6.
6
Assigning species information to corresponding genes by a sequence labeling framework.通过序列标注框架为相应的基因分配物种信息。
Database (Oxford). 2022 Oct 13;2022. doi: 10.1093/database/baac090.
7
The Construction Model of the TCM Clinical Knowledge Coding Database Based on Knowledge Organization.基于知识组织的中医临床知识编码数据库构建模型。
Biomed Res Int. 2022 Jan 17;2022:2503779. doi: 10.1155/2022/2503779. eCollection 2022.
8
OryzaGP 2021 update: a rice gene and protein dataset for named-entity recognition.OryzaGP 2021更新:用于命名实体识别的水稻基因和蛋白质数据集。
Genomics Inform. 2021 Sep;19(3):e27. doi: 10.5808/gi.21015. Epub 2021 Sep 30.
9
Reconstruction of the Cytokine Signaling in Lysosomal Storage Diseases by Literature Mining and Network Analysis.通过文献挖掘和网络分析重建溶酶体贮积症中的细胞因子信号传导
Front Cell Dev Biol. 2021 Aug 20;9:703489. doi: 10.3389/fcell.2021.703489. eCollection 2021.
10
A pre-training and self-training approach for biomedical named entity recognition.一种用于生物医学命名实体识别的预训练和自训练方法。
PLoS One. 2021 Feb 9;16(2):e0246310. doi: 10.1371/journal.pone.0246310. eCollection 2021.