生物多样性命名实体识别与关系抽取的黄金标准语料库：BiodivNERE

BiodivNERE: Gold standard corpora for named entity recognition and relation extraction in the biodiversity domain.

作者信息

Abdelmageed Nora, Löffler Felicitas, Feddoul Leila, Algergawy Alsayed, Samuel Sheeba, Gaikwad Jitendra, Kazem Anahita, König-Ries Birgitta

机构信息

Heinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University Jena, Jena, Germany Heinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University Jena Jena Germany.

Michael-Stifel-Center for Data-Driven and Simulation Science, Jena, Germany Michael-Stifel-Center for Data-Driven and Simulation Science Jena Germany.

出版信息

Biodivers Data J. 2022 Oct 7;10:e89481. doi: 10.3897/BDJ.10.e89481. eCollection 2022.

DOI:10.3897/BDJ.10.e89481

PMID:36761617

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9836593/

Abstract

BACKGROUND

Biodiversity is the assortment of life on earth covering evolutionary, ecological, biological, and social forms. To preserve life in all its variety and richness, it is imperative to monitor the current state of biodiversity and its change over time and to understand the forces driving it. This need has resulted in numerous works being published in this field. With this, a large amount of textual data (publications) and metadata (e.g. dataset description) has been generated. To support the management and analysis of these data, two techniques from computer science are of interest, namely Named Entity Recognition (NER) and Relation Extraction (RE). While the former enables better content discovery and understanding, the latter fosters the analysis by detecting connections between entities and, thus, allows us to draw conclusions and answer relevant domain-specific questions. To automatically predict entities and their relations, machine/deep learning techniques could be used. The training and evaluation of those techniques require labelled corpora.

NEW INFORMATION

In this paper, we present two gold-standard corpora for Named Entity Recognition (NER) and Relation Extraction (RE) generated from biodiversity datasets metadata and abstracts that can be used as evaluation benchmarks for the development of new computer-supported tools that require machine learning or deep learning techniques. These corpora are manually labelled and verified by biodiversity experts. In addition, we explain the detailed steps of constructing these datasets. Moreover, we demonstrate the underlying ontology for the classes and relations used to annotate such corpora.

摘要

背景

生物多样性是地球上生命的分类，涵盖进化、生态、生物和社会形式。为了保护生命的多样性和丰富性，必须监测生物多样性的现状及其随时间的变化，并了解驱动其变化的因素。这一需求导致该领域发表了大量著作。由此产生了大量文本数据（出版物）和元数据（如数据集描述）。为了支持对这些数据的管理和分析，计算机科学中的两种技术很值得关注，即命名实体识别（NER）和关系提取（RE）。前者有助于更好地发现和理解内容，后者则通过检测实体之间的联系促进分析，从而使我们能够得出结论并回答相关领域的特定问题。为了自动预测实体及其关系，可以使用机器学习/深度学习技术。这些技术的训练和评估需要有标记的语料库。

新信息

在本文中，我们展示了两个由生物多样性数据集元数据和摘要生成的用于命名实体识别（NER）和关系提取（RE）的黄金标准语料库，它们可作为开发需要机器学习或深度学习技术的新型计算机支持工具的评估基准。这些语料库由生物多样性专家手动标记和验证。此外，我们解释了构建这些数据集的详细步骤。而且，我们展示了用于注释此类语料库的类和关系的基础本体。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d5d3/9836593/7ea4468d4950/bdj-10-e89481-g001.jpg

相似文献

BiodivNERE: Gold standard corpora for named entity recognition and relation extraction in the biodiversity domain.

Biodivers Data J. 2022 Oct 7;10:e89481. doi: 10.3897/BDJ.10.e89481. eCollection 2022.

COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature.

Biodivers Data J. 2019 Jan 22(7):e29626. doi: 10.3897/BDJ.7.e29626. eCollection 2019.

Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora.

Front Res Metr Anal. 2021 Nov 19;6:689803. doi: 10.3389/frma.2021.689803. eCollection 2021.

NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding.

NPJ Syst Biol Appl. 2021 Oct 20;7(1):38. doi: 10.1038/s41540-021-00200-x.

Boosting drug named entity recognition using an aggregate classifier.

Artif Intell Med. 2015 Oct;65(2):145-53. doi: 10.1016/j.artmed.2015.05.007. Epub 2015 Jun 17.

The CHEMDNER corpus of chemicals and drugs and its annotation principles.

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S2. doi: 10.1186/1758-2946-7-S1-S2. eCollection 2015.

Exploiting and assessing multi-source data for supervised biomedical named entity recognition.

Bioinformatics. 2018 Jul 15;34(14):2474-2482. doi: 10.1093/bioinformatics/bty152.

Extracting entities with attributes in clinical text via joint deep learning.

J Am Med Inform Assoc. 2019 Dec 1;26(12):1584-1591. doi: 10.1093/jamia/ocz158.

Partial Annotation Learning for Biomedical Entity Recognition.

IEEE J Biomed Health Inform. 2025 Feb;29(2):1409-1418. doi: 10.1109/JBHI.2024.3466294. Epub 2025 Feb 10.

Transfer learning for biomedical named entity recognition with neural networks.

Bioinformatics. 2018 Dec 1;34(23):4087-4094. doi: 10.1093/bioinformatics/bty449.

引用本文的文献

A vision of human-AI collaboration for enhanced biological collection curation and research.

Bioscience. 2025 Mar 28;75(6):457-471. doi: 10.1093/biosci/biaf021. eCollection 2025 Jun.

SpeciMate: Improving metadata extraction from digitised biological specimens.

Biodivers Data J. 2025 Jul 31;13:e160553. doi: 10.3897/BDJ.13.e160553. eCollection 2025.

Evaluating the method reproducibility of deep learning models in biodiversity research.

PeerJ Comput Sci. 2025 Feb 5;11:e2618. doi: 10.7717/peerj-cs.2618. eCollection 2025.

The changing landscape of text mining: a review of approaches for ecology and evolution.

Proc Biol Sci. 2024 Jul;291(2027):20240423. doi: 10.1098/rspb.2024.0423. Epub 2024 Jul 31.

Unsupervised literature mining approaches for extracting relationships pertaining to habitats and reproductive conditions of plant species.

Front Artif Intell. 2024 May 23;7:1371411. doi: 10.3389/frai.2024.1371411. eCollection 2024.

本文引用的文献

Dataset search in biodiversity research: Do metadata in data repositories reflect scholarly information needs?

PLoS One. 2021 Mar 24;16(3):e0246099. doi: 10.1371/journal.pone.0246099. eCollection 2021.

COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature.

Biodivers Data J. 2019 Jan 22(7):e29626. doi: 10.3897/BDJ.7.e29626. eCollection 2019.

A publicly available benchmark for biomedical dataset retrieval: the reference standard for the 2016 bioCADDIE dataset retrieval challenge.

Database (Oxford). 2017 Jan 1;2017. doi: 10.1093/database/bax061.

BioCreative V track 4: a shared task for the extraction of causal network information using the Biological Expression Language.

Database (Oxford). 2016 Jul 9;2016. doi: 10.1093/database/baw067. Print 2016.

Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research.

BMC Bioinformatics. 2015 Feb 21;16:55. doi: 10.1186/s12859-015-0472-9.

The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text.

PLoS One. 2013 Jun 18;8(6):e65390. doi: 10.1371/journal.pone.0065390. Print 2013.

Applications of natural language processing in biodiversity science.

Adv Bioinformatics. 2012;2012:391574. doi: 10.1155/2012/391574. Epub 2012 May 22.

Biodiversity loss and its impact on humanity.

Nature. 2012 Jun 6;486(7401):59-67. doi: 10.1038/nature11148.

The EU-ADR corpus: annotated drugs, diseases, targets, and their relationships.

J Biomed Inform. 2012 Oct;45(5):879-84. doi: 10.1016/j.jbi.2012.04.004. Epub 2012 Apr 25.

Global biodiversity: indicators of recent declines.

Science. 2010 May 28;328(5982):1164-8. doi: 10.1126/science.1187512. Epub 2010 Apr 29.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

生物多样性命名实体识别与关系抽取的黄金标准语料库：BiodivNERE

BiodivNERE: Gold standard corpora for named entity recognition and relation extraction in the biodiversity domain.

作者信息

机构信息

出版信息

BACKGROUND

NEW INFORMATION

背景

新信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献