• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

点出现记录的不同数据清理方案对下游宏观生态多样性模型的影响。

Influence of different data cleaning solutions of point-occurrence records on downstream macroecological diversity models.

作者信息

Führding-Potschkat Petra, Kreft Holger, Ickert-Bond Stefanie M

机构信息

Biodiversity, Macroecology and Conservation Biogeography, Faculty of Forest Sciences University of Göttingen Göttingen Germany.

Department of Biology and Wildlife & UA Museum of the North University of Alaska Fairbanks Fairbanks Alaska USA.

出版信息

Ecol Evol. 2022 Aug 4;12(8):e9168. doi: 10.1002/ece3.9168. eCollection 2022 Aug.

DOI:10.1002/ece3.9168
PMID:35949539
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9351331/
Abstract

Digital point-occurrence records from the Global Biodiversity Information Facility (GBIF) and other data providers enable a wide range of research in macroecology and biogeography. However, data errors may hamper immediate use. Manual data cleaning is time-consuming and often unfeasible, given that the databases may contain thousands or millions of records. Automated data cleaning pipelines are therefore of high importance. Taking North American as a model, we examined how different data cleaning pipelines (using, e.g., the GBIF web application, and four different packages) affect downstream species distribution models (SDMs). We also assessed how data differed from expert data. From 13,889 North American observations in GBIF, the pipelines removed 31.7% to 62.7% false positives, invalid coordinates, and duplicates, leading to datasets between 9484 (GBIF application) and 5196 records (manual-guided filtering). The expert data consisted of 704 records, comparable to data from field studies. Although differences in the absolute numbers of records were relatively large, species richness models based on stacked SDMs (S-SDM) from pipeline and expert data were strongly correlated (mean Pearson's across the pipelines: .9986, vs. the expert data: .9173). Our results suggest that all package-based pipelines reliably identified invalid coordinates. In contrast, the GBIF-filtered data still contained both spatial and taxonomic errors. Major drawbacks emerge from the fact that no pipeline fully discovered misidentified specimens without the assistance of taxonomic expert knowledge. We conclude that application-filtered GBIF data will still need additional review to achieve higher spatial data quality. Achieving high-quality taxonomic data will require extra effort, probably by thoroughly analyzing the data for misidentified taxa, supported by experts.

摘要

来自全球生物多样性信息设施(GBIF)和其他数据提供者的数字点出现记录,使得宏观生态学和生物地理学领域能够开展广泛的研究。然而,数据错误可能会妨碍其立即使用。鉴于数据库可能包含数千或数百万条记录,手动数据清理既耗时又往往不可行。因此,自动化数据清理流程至关重要。以北美为例,我们研究了不同的数据清理流程(如使用GBIF网络应用程序和四个不同的软件包)如何影响下游物种分布模型(SDM)。我们还评估了数据与专家数据之间的差异。从GBIF中的13889条北美观测记录来看,这些流程去除了31.7%至62.7%的误报、无效坐标和重复记录,从而得到了记录数在9484条(GBIF应用程序)至5196条(手动引导过滤)之间的数据集。专家数据由704条记录组成,与实地研究的数据相当。尽管记录的绝对数量差异相对较大,但基于流程数据和专家数据的堆叠SDM(S-SDM)构建的物种丰富度模型具有很强的相关性(各流程的平均皮尔逊相关系数:0.9986,与专家数据相比:0.9173)。我们的结果表明,所有基于软件包的流程都能可靠地识别无效坐标。相比之下,经GBIF过滤的数据仍存在空间和分类学错误。主要问题在于,没有一个流程能够在没有分类学专家知识的帮助下完全发现错误鉴定的标本。我们得出结论,经应用程序过滤的GBIF数据仍需额外审查,以实现更高的空间数据质量。要获得高质量的分类学数据可能需要付出额外努力,或许需要在专家支持下,通过全面分析数据以找出错误鉴定的分类单元。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/572f/9351331/306edf4bb915/ECE3-12-e9168-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/572f/9351331/72fc1403bbb0/ECE3-12-e9168-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/572f/9351331/ccfb0406df0e/ECE3-12-e9168-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/572f/9351331/4ddb4a5e0ba4/ECE3-12-e9168-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/572f/9351331/306edf4bb915/ECE3-12-e9168-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/572f/9351331/72fc1403bbb0/ECE3-12-e9168-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/572f/9351331/ccfb0406df0e/ECE3-12-e9168-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/572f/9351331/4ddb4a5e0ba4/ECE3-12-e9168-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/572f/9351331/306edf4bb915/ECE3-12-e9168-g003.jpg

相似文献

1
Influence of different data cleaning solutions of point-occurrence records on downstream macroecological diversity models.点出现记录的不同数据清理方案对下游宏观生态多样性模型的影响。
Ecol Evol. 2022 Aug 4;12(8):e9168. doi: 10.1002/ece3.9168. eCollection 2022 Aug.
2
No one-size-fits-all solution to clean GBIF.没有适用于清理全球生物多样性信息设施(GBIF)的一刀切的解决方案。
PeerJ. 2020 Sep 28;8:e9916. doi: 10.7717/peerj.9916. eCollection 2020.
3
Geographic And Taxonomic Occurrence R-based Scrubbing (gatoRs): An R package and workflow for processing biodiversity data.基于地理和分类学出现情况的R语言清理(gatoRs):用于处理生物多样性数据的R包和工作流程。
Appl Plant Sci. 2024 Mar 21;12(2):e11575. doi: 10.1002/aps3.11575. eCollection 2024 Mar-Apr.
4
Estimating species diversity and distribution in the era of Big Data: to what extent can we trust public databases?大数据时代的物种多样性与分布估算:我们对公共数据库的信任度究竟有多高?
Glob Ecol Biogeogr. 2015 Aug;24(8):973-984. doi: 10.1111/geb.12326. Epub 2015 May 25.
5
New occurrence records on the rodent species inhabiting Vietnam, based on Joint Russian-Vietnamese Tropical Research and Test Center genetic samples collection.基于俄罗斯-越南热带联合研究与测试中心的基因样本收集,越南啮齿动物物种的新出现记录。
Biodivers Data J. 2022 Nov 23;10:e96062. doi: 10.3897/BDJ.10.e96062. eCollection 2022.
6
From GenBank to GBIF: Phylogeny-Based Predictive Niche Modeling Tests Accuracy of Taxonomic Identifications in Large Occurrence Data Repositories.从 GenBank 到 GBIF:基于系统发育的预测性生态位建模测试大型出现数据存储库中分类鉴定的准确性。
PLoS One. 2016 Mar 11;11(3):e0151232. doi: 10.1371/journal.pone.0151232. eCollection 2016.
7
Global patterns of fern species diversity: An evaluation of fern data in GBIF.蕨类植物物种多样性的全球格局:对全球生物多样性信息设施(GBIF)中蕨类植物数据的评估。
Plant Divers. 2021 Oct 27;44(2):135-140. doi: 10.1016/j.pld.2021.10.001. eCollection 2022 Mar.
8
Filling in the GAPS: evaluating completeness and coverage of open-access biodiversity databases in the United States.填补差距:评估美国开放获取生物多样性数据库的完整性和覆盖范围
Ecol Evol. 2016 Jun 12;6(14):4654-69. doi: 10.1002/ece3.2225. eCollection 2016 Jul.
9
Exploring snake occurrence records: Spatial biases and marginal gains from accessible social media.探索蛇类出现记录:空间偏差与来自可获取社交媒体的边际收益。
PeerJ. 2019 Dec 17;7:e8059. doi: 10.7717/peerj.8059. eCollection 2019.
10
A new R package to parse plant species occurrence records into unique collection events efficiently reduces data redundancy.一个新的 R 包可以有效地将植物物种出现记录解析为独特的采集事件,从而减少数据冗余。
Sci Rep. 2024 Mar 5;14(1):5450. doi: 10.1038/s41598-024-56158-3.

引用本文的文献

1
World of Crayfish™: a web platform towards real-time global mapping of freshwater crayfish and their pathogens.淡水螯虾世界™:一个实时全球淡水螯虾及其病原体地图绘制的网络平台。
PeerJ. 2024 Oct 14;12:e18229. doi: 10.7717/peerj.18229. eCollection 2024.
2
A dataset of cold-water coral distribution records.一个冷水珊瑚分布记录数据集。
Data Brief. 2023 May 11;48:109223. doi: 10.1016/j.dib.2023.109223. eCollection 2023 Jun.

本文引用的文献

1
No one-size-fits-all solution to clean GBIF.没有适用于清理全球生物多样性信息设施(GBIF)的一刀切的解决方案。
PeerJ. 2020 Sep 28;8:e9916. doi: 10.7717/peerj.9916. eCollection 2020.
2
Connecting data and expertise: a new alliance for biodiversity knowledge.连接数据与专业知识:生物多样性知识新联盟
Biodivers Data J. 2019 Mar 8;7:e33679. doi: 10.3897/BDJ.7.e33679. eCollection 2019.
3
Standards for distribution models in biodiversity assessments.生物多样性评估中分布模型的标准。
Sci Adv. 2019 Jan 16;5(1):eaat4858. doi: 10.1126/sciadv.aat4858. eCollection 2019 Jan.
4
Temporal degradation of data limits biodiversity research.数据的时效性退化限制了生物多样性研究。
Ecol Evol. 2017 Jul 27;7(17):6863-6870. doi: 10.1002/ece3.3259. eCollection 2017 Sep.
5
Climatologies at high resolution for the earth's land surface areas.高分辨率地球陆地区域气候概况。
Sci Data. 2017 Sep 5;4:170122. doi: 10.1038/sdata.2017.122.
6
SpeciesGeoCoder: Fast Categorization of Species Occurrences for Analyses of Biodiversity, Biogeography, Ecology, and Evolution.物种地理编码器:对物种出现情况进行快速分类,以用于生物多样性、生物地理学、生态学和进化分析。
Syst Biol. 2017 Mar 1;66(2):145-151. doi: 10.1093/sysbio/syw064.
7
Multidimensional biases, gaps and uncertainties in global plant occurrence information.全球植物分布信息中的多维偏差、差距和不确定性。
Ecol Lett. 2016 Aug;19(8):992-1006. doi: 10.1111/ele.12624. Epub 2016 Jun 2.
8
A Standardized Reference Data Set for Vertebrate Taxon Name Resolution.脊椎动物分类名称解析的标准化参考数据集。
PLoS One. 2016 Jan 13;11(1):e0146894. doi: 10.1371/journal.pone.0146894. eCollection 2016.
9
Widespread mistaken identity in tropical plant collections.热带植物标本采集中普遍存在的身份错误。
Curr Biol. 2015 Nov 16;25(22):R1066-7. doi: 10.1016/j.cub.2015.10.002.
10
Assessing the primary data hosted by the Spanish node of the Global Biodiversity Information Facility (GBIF).评估全球生物多样性信息设施(GBIF)西班牙节点托管的原始数据。
PLoS One. 2013;8(1):e55144. doi: 10.1371/journal.pone.0055144. Epub 2013 Jan 25.