Suppr超能文献

生物多样性命名实体识别与关系抽取的黄金标准语料库:BiodivNERE

BiodivNERE: Gold standard corpora for named entity recognition and relation extraction in the biodiversity domain.

作者信息

Abdelmageed Nora, Löffler Felicitas, Feddoul Leila, Algergawy Alsayed, Samuel Sheeba, Gaikwad Jitendra, Kazem Anahita, König-Ries Birgitta

机构信息

Heinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University Jena, Jena, Germany Heinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University Jena Jena Germany.

Michael-Stifel-Center for Data-Driven and Simulation Science, Jena, Germany Michael-Stifel-Center for Data-Driven and Simulation Science Jena Germany.

出版信息

Biodivers Data J. 2022 Oct 7;10:e89481. doi: 10.3897/BDJ.10.e89481. eCollection 2022.

Abstract

BACKGROUND

Biodiversity is the assortment of life on earth covering evolutionary, ecological, biological, and social forms. To preserve life in all its variety and richness, it is imperative to monitor the current state of biodiversity and its change over time and to understand the forces driving it. This need has resulted in numerous works being published in this field. With this, a large amount of textual data (publications) and metadata (e.g. dataset description) has been generated. To support the management and analysis of these data, two techniques from computer science are of interest, namely Named Entity Recognition (NER) and Relation Extraction (RE). While the former enables better content discovery and understanding, the latter fosters the analysis by detecting connections between entities and, thus, allows us to draw conclusions and answer relevant domain-specific questions. To automatically predict entities and their relations, machine/deep learning techniques could be used. The training and evaluation of those techniques require labelled corpora.

NEW INFORMATION

In this paper, we present two gold-standard corpora for Named Entity Recognition (NER) and Relation Extraction (RE) generated from biodiversity datasets metadata and abstracts that can be used as evaluation benchmarks for the development of new computer-supported tools that require machine learning or deep learning techniques. These corpora are manually labelled and verified by biodiversity experts. In addition, we explain the detailed steps of constructing these datasets. Moreover, we demonstrate the underlying ontology for the classes and relations used to annotate such corpora.

摘要

背景

生物多样性是地球上生命的分类,涵盖进化、生态、生物和社会形式。为了保护生命的多样性和丰富性,必须监测生物多样性的现状及其随时间的变化,并了解驱动其变化的因素。这一需求导致该领域发表了大量著作。由此产生了大量文本数据(出版物)和元数据(如数据集描述)。为了支持对这些数据的管理和分析,计算机科学中的两种技术很值得关注,即命名实体识别(NER)和关系提取(RE)。前者有助于更好地发现和理解内容,后者则通过检测实体之间的联系促进分析,从而使我们能够得出结论并回答相关领域的特定问题。为了自动预测实体及其关系,可以使用机器学习/深度学习技术。这些技术的训练和评估需要有标记的语料库。

新信息

在本文中,我们展示了两个由生物多样性数据集元数据和摘要生成的用于命名实体识别(NER)和关系提取(RE)的黄金标准语料库,它们可作为开发需要机器学习或深度学习技术的新型计算机支持工具的评估基准。这些语料库由生物多样性专家手动标记和验证。此外,我们解释了构建这些数据集的详细步骤。而且,我们展示了用于注释此类语料库的类和关系的基础本体。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d5d3/9836593/7ea4468d4950/bdj-10-e89481-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验