Zizka Alexander, Antunes Carvalho Fernanda, Calvente Alice, Rocio Baez-Lizarazo Mabel, Cabral Andressa, Coelho Jéssica Fernanda Ramos, Colli-Silva Matheus, Fantinati Mariana Ramos, Fernandes Moabe F, Ferreira-Araújo Thais, Gondim Lambert Moreira Fernanda, Santos Nathália Michellyda Cunha, Santos Tiago Andrade Borges, Dos Santos-Costa Renata Clicia, Serrano Filipe C, Alves da Silva Ana Paula, de Souza Soares Arthur, Cavalcante de Souza Paolla Gabryelle, Calisto Tomaz Eduardo, Vale Valéria Fonseca, Vieira Tiago Luiz, Antonelli Alexandre
sDiv, German Centre for Integrative Biodiversity Research Halle-Jena-Leipzig (iDiv), Leipzig, Germany.
Naturalis Biodiversity Center, Leiden, The Netherlands.
PeerJ. 2020 Sep 28;8:e9916. doi: 10.7717/peerj.9916. eCollection 2020.
Species occurrence records provide the basis for many biodiversity studies. They derive from georeferenced specimens deposited in natural history collections and visual observations, such as those obtained through various mobile applications. Given the rapid increase in availability of such data, the control of quality and accuracy constitutes a particular concern. Automatic filtering is a scalable and reproducible means to identify potentially problematic records and tailor datasets from public databases such as the Global Biodiversity Information Facility (GBIF; http://www.gbif.org), for biodiversity analyses. However, it is unclear how much data may be lost by filtering, whether the same filters should be applied across all taxonomic groups, and what the effect of filtering is on common downstream analyses. Here, we evaluate the effect of 13 recently proposed filters on the inference of species richness patterns and automated conservation assessments for 18 Neotropical taxa, including terrestrial and marine animals, fungi, and plants downloaded from GBIF. We find that a total of 44.3% of the records are potentially problematic, with large variation across taxonomic groups (25-90%). A small fraction of records was identified as erroneous in the strict sense (4.2%), and a much larger proportion as unfit for most downstream analyses (41.7%). Filters of duplicated information, collection year, and basis of record, as well as coordinates in urban areas, or for terrestrial taxa in the sea or marine taxa on land, have the greatest effect. Automated filtering can help in identifying problematic records, but requires customization of which tests and thresholds should be applied to the taxonomic group and geographic area under focus. Our results stress the importance of thorough recording and exploration of the meta-data associated with species records for biodiversity research.
物种出现记录为许多生物多样性研究提供了基础。它们来源于保存在自然历史收藏中的地理参考标本以及视觉观察,比如通过各种移动应用程序获得的观察结果。鉴于此类数据的可获取性迅速增加,质量和准确性的控制成为了一个特别令人关注的问题。自动筛选是一种可扩展且可重复的方法,用于识别潜在有问题的记录,并从诸如全球生物多样性信息机构(GBIF;http://www.gbif.org)这样的公共数据库中定制数据集,以进行生物多样性分析。然而,尚不清楚通过筛选会损失多少数据,是否应对所有分类群应用相同的筛选条件,以及筛选对常见的下游分析有何影响。在此,我们评估了最近提出的13种筛选条件对18个新热带分类群的物种丰富度模式推断和自动保护评估的影响,这些分类群包括从GBIF下载的陆生和海洋动物、真菌及植物。我们发现,总共44.3%的记录可能存在问题,不同分类群之间差异很大(25 - 90%)。严格意义上只有一小部分记录被确定为错误的(4.2%),而不适合大多数下游分析的比例要大得多(41.7%)。重复信息、采集年份、记录依据的筛选条件,以及城市地区的坐标,或者陆地上的海洋分类群或海洋中的陆生分类群的坐标,影响最大。自动筛选有助于识别有问题的记录,但需要针对所关注的分类群和地理区域定制应应用哪些测试和阈值。我们的结果强调了全面记录和探索与物种记录相关的元数据对于生物多样性研究的重要性。