HunFlair:一种用于最先进生物医学命名实体识别的易于使用的工具。

HunFlair: an easy-to-use tool for state-of-the-art biomedical named entity recognition.

作者信息

Weber Leon, Sänger Mario, Münchmeyer Jannes, Habibi Maryam, Leser Ulf, Akbik Alan

机构信息

Computer Science Department, Humboldt-Universität zu Berlin, Berlin 10099, Germany.

Group Mathematical Modelling of Cellular Processes, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin 13125, Germany.

出版信息

Bioinformatics. 2021 Sep 9;37(17):2792-2794. doi: 10.1093/bioinformatics/btab042.

Abstract

SUMMARY

Named entity recognition (NER) is an important step in biomedical information extraction pipelines. Tools for NER should be easy to use, cover multiple entity types, be highly accurate and be robust toward variations in text genre and style. We present HunFlair, a NER tagger fulfilling these requirements. HunFlair is integrated into the widely used NLP framework Flair, recognizes five biomedical entity types, reaches or overcomes state-of-the-art performance on a wide set of evaluation corpora, and is trained in a cross-corpus setting to avoid corpus-specific bias. Technically, it uses a character-level language model pretrained on roughly 24 million biomedical abstracts and three million full texts. It outperforms other off-the-shelf biomedical NER tools with an average gain of 7.26 pp over the next best tool in a cross-corpus setting and achieves on-par results with state-of-the-art research prototypes in in-corpus experiments. HunFlair can be installed with a single command and is applied with only four lines of code. Furthermore, it is accompanied by harmonized versions of 23 biomedical NER corpora.

AVAILABILITY AND IMPLEMENTATION

HunFlair ist freely available through the Flair NLP framework (https://github.com/flairNLP/flair) under an MIT license and is compatible with all major operating systems.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

摘要

命名实体识别(NER)是生物医学信息提取流程中的重要一步。NER工具应易于使用,涵盖多种实体类型,具有高度准确性,并能应对文本体裁和风格的变化。我们展示了HunFlair,这是一个满足这些要求的NER标记器。HunFlair集成到广泛使用的自然语言处理框架Flair中,识别五种生物医学实体类型,在广泛的评估语料库上达到或超越了当前的最佳性能,并且在跨语料库设置中进行训练以避免特定语料库的偏差。从技术上讲,它使用了一个在大约2400万篇生物医学摘要和300万篇全文上预训练的字符级语言模型。在跨语料库设置中,它比其他现成的生物医学NER工具表现更优,比次优工具平均提升7.26个百分点,并且在语料库实验中与当前的最佳研究原型取得了相当的结果。HunFlair可以通过一条命令安装,并且只需四行代码即可应用。此外,它还附带了23个生物医学NER语料库的统一版本。

可用性与实现

HunFlair可通过Flair自然语言处理框架(https://github.com/flairNLP/flair)在MIT许可下免费获取,并且与所有主流操作系统兼容。

补充信息

补充数据可在《生物信息学》在线获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/482f/8428609/90a8c96c2dfc/btab042f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索