Abdelmageed Nora, Löffler Felicitas, Feddoul Leila, Algergawy Alsayed, Samuel Sheeba, Gaikwad Jitendra, Kazem Anahita, König-Ries Birgitta
Heinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University Jena, Jena, Germany Heinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University Jena Jena Germany.
Michael-Stifel-Center for Data-Driven and Simulation Science, Jena, Germany Michael-Stifel-Center for Data-Driven and Simulation Science Jena Germany.
Biodivers Data J. 2022 Oct 7;10:e89481. doi: 10.3897/BDJ.10.e89481. eCollection 2022.
Biodiversity is the assortment of life on earth covering evolutionary, ecological, biological, and social forms. To preserve life in all its variety and richness, it is imperative to monitor the current state of biodiversity and its change over time and to understand the forces driving it. This need has resulted in numerous works being published in this field. With this, a large amount of textual data (publications) and metadata (e.g. dataset description) has been generated. To support the management and analysis of these data, two techniques from computer science are of interest, namely Named Entity Recognition (NER) and Relation Extraction (RE). While the former enables better content discovery and understanding, the latter fosters the analysis by detecting connections between entities and, thus, allows us to draw conclusions and answer relevant domain-specific questions. To automatically predict entities and their relations, machine/deep learning techniques could be used. The training and evaluation of those techniques require labelled corpora.
In this paper, we present two gold-standard corpora for Named Entity Recognition (NER) and Relation Extraction (RE) generated from biodiversity datasets metadata and abstracts that can be used as evaluation benchmarks for the development of new computer-supported tools that require machine learning or deep learning techniques. These corpora are manually labelled and verified by biodiversity experts. In addition, we explain the detailed steps of constructing these datasets. Moreover, we demonstrate the underlying ontology for the classes and relations used to annotate such corpora.
生物多样性是地球上生命的分类,涵盖进化、生态、生物和社会形式。为了保护生命的多样性和丰富性,必须监测生物多样性的现状及其随时间的变化,并了解驱动其变化的因素。这一需求导致该领域发表了大量著作。由此产生了大量文本数据(出版物)和元数据(如数据集描述)。为了支持对这些数据的管理和分析,计算机科学中的两种技术很值得关注,即命名实体识别(NER)和关系提取(RE)。前者有助于更好地发现和理解内容,后者则通过检测实体之间的联系促进分析,从而使我们能够得出结论并回答相关领域的特定问题。为了自动预测实体及其关系,可以使用机器学习/深度学习技术。这些技术的训练和评估需要有标记的语料库。
在本文中,我们展示了两个由生物多样性数据集元数据和摘要生成的用于命名实体识别(NER)和关系提取(RE)的黄金标准语料库,它们可作为开发需要机器学习或深度学习技术的新型计算机支持工具的评估基准。这些语料库由生物多样性专家手动标记和验证。此外,我们解释了构建这些数据集的详细步骤。而且,我们展示了用于注释此类语料库的类和关系的基础本体。