Porter John H, O'Brien Margaret, Frants Marina, Earl Stevan, Martin Mary, Laney Christine M
University of Virginia, Charlottesville, Virginia, USA.
University of California, Santa Barbara, Santa Barbara, California, USA.
Sci Data. 2025 Feb 20;12(1):304. doi: 10.1038/s41597-025-04587-8.
Automated processing of environmental data is hindered by the wide array of unit representations provided in the metadata of digital datasets. For example, gm/m2, g/m2, gm-2, g/m^2, g.m-2 and gramPerMeterSquared are all representations of a single complex unit that might be human-readable but are not machine-interpretable. Connecting ad hoc units to a single unit concept in an ontology permits the identification of datasets sharing units and provides additional information regarding labels, definitions, dimensions and transformations provided in the ontology. Here we use successive string transformations to link ad hoc unit representations to units in the QUDT ontology (e.g., unit: GM-PER-M2). Although only 896 of 7,110 distinct units in a corpus of ecological metadata from DataONE, the Environmental Data Initiative and the U.S. National Ecological Observatory Network were matched, 324,811 unit uses (instances) out of 355,057 of total unit uses were successfully mapped to QUDT units (91%). The resulting lookup table was used to enable a web service and R functions for adding annotation elements to Ecological Metadata Language documents.
数字数据集元数据中提供的大量单位表示形式阻碍了环境数据的自动化处理。例如,gm/m2、g/m2、gm-2、g/m^2、g.m-2和gramPerMeterSquared都是单个复合单位的表示形式,这些表示形式可能是人类可读的,但不是机器可解释的。在本体中将临时单位连接到单个单位概念,可以识别共享单位的数据集,并提供有关本体中提供的标签、定义、维度和转换的附加信息。在这里,我们使用连续的字符串转换将临时单位表示形式链接到QUDT本体中的单位(例如,单位:GM-PER-M2)。尽管在来自DataONE、环境数据倡议组织和美国国家生态观测网络的生态元数据语料库中的7110个不同单位中,只有896个被匹配,但在总单位使用量的355057个中,有324811个单位使用(实例)被成功映射到QUDT单位(91%)。生成的查找表用于启用一个网络服务和R函数,以便向生态元数据语言文档添加注释元素。