维基数量与维基测量：来自维基百科的数量及其测量背景数据集。

Wiki-Quantities and Wiki-Measurements: Datasets of quantities and their measurement context from Wikipedia.

作者信息

Göpfert Jan, Kuckertz Patrick, Weinand Jann M, Stolten Detlef

机构信息

Forschungszentrum Jülich GmbH, Institute of Climate and Energy Systems, Jülich Systems Analysis, 52425, Jülich, Germany.

RWTH Aachen University, Chair for Fuel Cells, Faculty of Mechanical Engineering, 52062, Aachen, Germany.

出版信息

Sci Data. 2025 Jul 22;12(1):1277. doi: 10.1038/s41597-025-05499-3.

DOI:10.1038/s41597-025-05499-3

PMID:40695850

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12284226/

Abstract

To cope with the large number of publications, more and more researchers are automatically extracting data of interest using natural language processing methods based on supervised learning. Much data, especially in the natural and engineering sciences, is quantitative, but there is a lack of datasets for identifying quantities and their context in text. To address this issue, we present two large datasets based on Wikipedia and Wikidata: Wiki-Quantities is a dataset consisting of over 1.2 million annotated quantities in the English-language Wikipedia. Wiki-Measurements is a dataset of 38 738 annotated quantities in the English-language Wikipedia along with their respective measured entity, property, and optional qualifiers. Manual validation of 100 samples each of Wiki-Quantities and Wiki-Measurements found 100% and 84-94% correct, respectively. The datasets can be used in pipeline approaches to measurement extraction, where quantities are first identified and then their measurement context. To allow reproduction of this work using newer or different versions of Wikipedia and Wikidata, we publish the code used to create the datasets along with the data.

摘要

为了应对大量的出版物，越来越多的研究人员正在使用基于监督学习的自然语言处理方法自动提取感兴趣的数据。许多数据，特别是在自然科学和工程科学领域，是定量的，但缺乏用于识别文本中的数量及其上下文的数据集。为了解决这个问题，我们基于维基百科和维基数据呈现了两个大型数据集：Wiki-Quantities是一个由英语维基百科中超过120万个带注释的数量组成的数据集。Wiki-Measurements是一个包含英语维基百科中38738个带注释的数量及其各自测量实体、属性和可选限定词的数据集。对Wiki-Quantities和Wiki-Measurements各100个样本进行人工验证，发现正确率分别为100%和84%-94%。这些数据集可用于测量提取的流水线方法，即首先识别数量，然后确定其测量上下文。为了允许使用更新的或不同版本的维基百科和维基数据来重现这项工作，我们将用于创建数据集的代码与数据一起发布。