Führding-Potschkat Petra, Kreft Holger, Ickert-Bond Stefanie M
Biodiversity, Macroecology and Conservation Biogeography, Faculty of Forest Sciences University of Göttingen Göttingen Germany.
Department of Biology and Wildlife & UA Museum of the North University of Alaska Fairbanks Fairbanks Alaska USA.
Ecol Evol. 2022 Aug 4;12(8):e9168. doi: 10.1002/ece3.9168. eCollection 2022 Aug.
Digital point-occurrence records from the Global Biodiversity Information Facility (GBIF) and other data providers enable a wide range of research in macroecology and biogeography. However, data errors may hamper immediate use. Manual data cleaning is time-consuming and often unfeasible, given that the databases may contain thousands or millions of records. Automated data cleaning pipelines are therefore of high importance. Taking North American as a model, we examined how different data cleaning pipelines (using, e.g., the GBIF web application, and four different packages) affect downstream species distribution models (SDMs). We also assessed how data differed from expert data. From 13,889 North American observations in GBIF, the pipelines removed 31.7% to 62.7% false positives, invalid coordinates, and duplicates, leading to datasets between 9484 (GBIF application) and 5196 records (manual-guided filtering). The expert data consisted of 704 records, comparable to data from field studies. Although differences in the absolute numbers of records were relatively large, species richness models based on stacked SDMs (S-SDM) from pipeline and expert data were strongly correlated (mean Pearson's across the pipelines: .9986, vs. the expert data: .9173). Our results suggest that all package-based pipelines reliably identified invalid coordinates. In contrast, the GBIF-filtered data still contained both spatial and taxonomic errors. Major drawbacks emerge from the fact that no pipeline fully discovered misidentified specimens without the assistance of taxonomic expert knowledge. We conclude that application-filtered GBIF data will still need additional review to achieve higher spatial data quality. Achieving high-quality taxonomic data will require extra effort, probably by thoroughly analyzing the data for misidentified taxa, supported by experts.
来自全球生物多样性信息设施(GBIF)和其他数据提供者的数字点出现记录,使得宏观生态学和生物地理学领域能够开展广泛的研究。然而,数据错误可能会妨碍其立即使用。鉴于数据库可能包含数千或数百万条记录,手动数据清理既耗时又往往不可行。因此,自动化数据清理流程至关重要。以北美为例,我们研究了不同的数据清理流程(如使用GBIF网络应用程序和四个不同的软件包)如何影响下游物种分布模型(SDM)。我们还评估了数据与专家数据之间的差异。从GBIF中的13889条北美观测记录来看,这些流程去除了31.7%至62.7%的误报、无效坐标和重复记录,从而得到了记录数在9484条(GBIF应用程序)至5196条(手动引导过滤)之间的数据集。专家数据由704条记录组成,与实地研究的数据相当。尽管记录的绝对数量差异相对较大,但基于流程数据和专家数据的堆叠SDM(S-SDM)构建的物种丰富度模型具有很强的相关性(各流程的平均皮尔逊相关系数:0.9986,与专家数据相比:0.9173)。我们的结果表明,所有基于软件包的流程都能可靠地识别无效坐标。相比之下,经GBIF过滤的数据仍存在空间和分类学错误。主要问题在于,没有一个流程能够在没有分类学专家知识的帮助下完全发现错误鉴定的标本。我们得出结论,经应用程序过滤的GBIF数据仍需额外审查,以实现更高的空间数据质量。要获得高质量的分类学数据可能需要付出额外努力,或许需要在专家支持下,通过全面分析数据以找出错误鉴定的分类单元。