脊椎动物分类名称解析的标准化参考数据集。

A Standardized Reference Data Set for Vertebrate Taxon Name Resolution.

作者信息

Zermoglio Paula F, Guralnick Robert P, Wieczorek John R

机构信息

Departamento de Ecología, Genética y Evolución, Instituto IEGEBA (CONICET-UBA), Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Buenos Aires, Argentina.

Institut de Recherche sur la Biologie de l'Insecte, UMR 7261 CNRS, Université François Rabelais, Tours, France.

出版信息

PLoS One. 2016 Jan 13;11(1):e0146894. doi: 10.1371/journal.pone.0146894. eCollection 2016.

DOI:10.1371/journal.pone.0146894

PMID:26760296

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4711887/

Abstract

Taxonomic names associated with digitized biocollections labels have flooded into repositories such as GBIF, iDigBio and VertNet. The names on these labels are often misspelled, out of date, or present other problems, as they were often captured only once during accessioning of specimens, or have a history of label changes without clear provenance. Before records are reliably usable in research, it is critical that these issues be addressed. However, still missing is an assessment of the scope of the problem, the effort needed to solve it, and a way to improve effectiveness of tools developed to aid the process. We present a carefully human-vetted analysis of 1000 verbatim scientific names taken at random from those published via the data aggregator VertNet, providing the first rigorously reviewed, reference validation data set. In addition to characterizing formatting problems, human vetting focused on detecting misspelling, synonymy, and the incorrect use of Darwin Core. Our results reveal a sobering view of the challenge ahead, as less than 47% of name strings were found to be currently valid. More optimistically, nearly 97% of name combinations could be resolved to a currently valid name, suggesting that computer-aided approaches may provide feasible means to improve digitized content. Finally, we associated names back to biocollections records and fit logistic models to test potential drivers of issues. A set of candidate variables (geographic region, year collected, higher-level clade, and the institutional digitally accessible data volume) and their 2-way interactions all predict the probability of records having taxon name issues, based on model selection approaches. We strongly encourage further experiments to use this reference data set as a means to compare automated or computer-aided taxon name tools for their ability to resolve and improve the existing wealth of legacy data.

摘要

与数字化生物标本馆标签相关的分类学名称大量涌入了诸如全球生物多样性信息网络（GBIF）、iDigBio和VertNet等数据库。这些标签上的名称常常拼写错误、过时，或者存在其他问题，因为它们通常在标本入库时只被采集过一次，或者有标签变更的历史但来源不明。在记录能够可靠地用于研究之前，解决这些问题至关重要。然而，目前仍缺少对问题范围的评估、解决问题所需的工作量，以及提高辅助该过程所开发工具有效性的方法。我们对从数据聚合器VertNet发布的名称中随机抽取的1000个逐字科学名称进行了精心的人工审核分析，提供了首个经过严格审核的参考验证数据集。除了描述格式问题外，人工审核还着重于检测拼写错误、同义词以及达尔文核心的错误使用。我们的结果揭示了对未来挑战的严峻看法，因为发现目前只有不到47%的名称字符串是有效的。更乐观的是，近97%的名称组合可以解析为当前有效的名称，这表明计算机辅助方法可能提供改善数字化内容的可行手段。最后，我们将名称与生物标本记录关联起来，并拟合逻辑模型以测试问题的潜在驱动因素。根据模型选择方法，一组候选变量（地理区域、采集年份、高级分类单元和机构数字可访问数据量）及其双向交互作用都能预测记录出现分类学名称问题的概率。我们强烈鼓励进一步开展实验，使用这个参考数据集来比较自动化或计算机辅助分类学名称工具解决和改善现有大量遗留数据的能力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b39/4711887/713ab7ac8e81/pone.0146894.g001.jpg

相似文献

A Standardized Reference Data Set for Vertebrate Taxon Name Resolution.脊椎动物分类名称解析的标准化参考数据集。

PLoS One. 2016 Jan 13;11(1):e0146894. doi: 10.1371/journal.pone.0146894. eCollection 2016.

The taxonomic name resolution service: an online tool for automated standardization of plant names.分类名称解析服务：一个用于植物名称自动标准化的在线工具。

BMC Bioinformatics. 2013 Jan 16;14:16. doi: 10.1186/1471-2105-14-16.

Catalog to families, genera, and species of orders Actiniaria and Corallimorpharia (Cnidaria: Anthozoa).海葵目和珊瑚藻目（刺胞动物门：珊瑚纲）的科、属及物种名录。

Zootaxa. 2016 Aug 1;4145(1):1-449. doi: 10.11646/zootaxa.4145.1.1.

The importance of digitized biocollections as a source of trait data and a new VertNet resource.数字化生物样本库作为性状数据来源和VertNet新资源的重要性。

Database (Oxford). 2016 Dec 26;2016. doi: 10.1093/database/baw158. Print 2016.

Biological taxon names are descriptive names.生物分类群名称是描述性名称。

Hist Philos Life Sci. 2020 Jun 26;42(3):29. doi: 10.1007/s40656-020-00322-1.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

The galaxy of the non-Linnaean nomenclature.非林奈命名法的星系。

Hist Philos Life Sci. 2019 Aug 21;41(3):31. doi: 10.1007/s40656-019-0271-0.

Geographic And Taxonomic Occurrence R-based Scrubbing (gatoRs): An R package and workflow for processing biodiversity data.基于地理和分类学出现情况的R语言清理（gatoRs）：用于处理生物多样性数据的R包和工作流程。

Appl Plant Sci. 2024 Mar 21;12(2):e11575. doi: 10.1002/aps3.11575. eCollection 2024 Mar-Apr.

Geographic name resolution service: A tool for the standardization and indexing of world political division names, with applications to species distribution modeling.地理名称解析服务：一种用于世界政治区划名称标准化和索引编制的工具，可应用于物种分布建模。

PLoS One. 2022 Nov 14;17(11):e0268162. doi: 10.1371/journal.pone.0268162. eCollection 2022.

A survey of digitized data from U.S. fish collections in the iDigBio data aggregator.美国 iDigBio 数据聚合器中鱼类收藏的数字化数据调查。

PLoS One. 2018 Dec 19;13(12):e0207636. doi: 10.1371/journal.pone.0207636. eCollection 2018.

引用本文的文献

Influence of different data cleaning solutions of point-occurrence records on downstream macroecological diversity models.点出现记录的不同数据清理方案对下游宏观生态多样性模型的影响。

Ecol Evol. 2022 Aug 4;12(8):e9168. doi: 10.1002/ece3.9168. eCollection 2022 Aug.

WOODIV, a database of occurrences, functional traits, and phylogenetic data for all Euro-Mediterranean trees.WOODIV，一个关于所有欧洲-地中海树木的出现情况、功能性状和系统发育数据的数据库。

Sci Data. 2021 Mar 23;8(1):89. doi: 10.1038/s41597-021-00873-3.

A novel curation system to facilitate data integration across regional citizen science survey programs.一种新型的管理系统，以促进跨区域公民科学调查项目的数据整合。

PeerJ. 2020 Jul 29;8:e9219. doi: 10.7717/peerj.9219. eCollection 2020.

Research applications of primary biodiversity databases in the digital age.生物多样性基础数据库在数字时代的研究应用。

PLoS One. 2019 Sep 11;14(9):e0215794. doi: 10.1371/journal.pone.0215794. eCollection 2019.

Developing a vocabulary and ontology for modeling insect natural history data: example data, use cases, and competency questions.开发用于昆虫自然历史数据建模的词汇表和本体：示例数据、用例及能力问题。

Biodivers Data J. 2019 Mar 13;7:e33303. doi: 10.3897/BDJ.7.e33303. eCollection 2019.

The history and impact of digitization and digital data mobilization on biodiversity research.数字化和数字数据动员对生物多样性研究的历史和影响。

Philos Trans R Soc Lond B Biol Sci. 2018 Nov 19;374(1763):20170391. doi: 10.1098/rstb.2017.0391.

The Plant Phenology Ontology: A New Informatics Resource for Large-Scale Integration of Plant Phenology Data.植物物候本体论：用于大规模整合植物物候数据的新信息学资源。

Front Plant Sci. 2018 May 1;9:517. doi: 10.3389/fpls.2018.00517. eCollection 2018.

To increase trust, change the social design behind aggregated biodiversity data.为了增加信任，改变聚合生物多样性数据背后的社会设计。

Database (Oxford). 2018 Jan 1;2018. doi: 10.1093/database/bax100.

Temporal degradation of data limits biodiversity research.数据的时效性退化限制了生物多样性研究。

Ecol Evol. 2017 Jul 27;7(17):6863-6870. doi: 10.1002/ece3.3259. eCollection 2017 Sep.

The importance of digitized biocollections as a source of trait data and a new VertNet resource.数字化生物样本库作为性状数据来源和VertNet新资源的重要性。

Database (Oxford). 2016 Dec 26;2016. doi: 10.1093/database/baw158. Print 2016.

本文引用的文献

Estimating species diversity and distribution in the era of Big Data: to what extent can we trust public databases?大数据时代的物种多样性与分布估算：我们对公共数据库的信任度究竟有多高？

Glob Ecol Biogeogr. 2015 Aug;24(8):973-984. doi: 10.1111/geb.12326. Epub 2015 May 25.

Global priorities for an effective information basis of biodiversity distributions.建立有效生物多样性分布信息基础的全球优先事项。

Nat Commun. 2015 Sep 8;6:8221. doi: 10.1038/ncomms9221.

Taxamatch, an algorithm for near ('fuzzy') matching of scientific names in taxonomic databases.Taxamatch，一种用于分类数据库中科学名称近（“模糊”）匹配的算法。

PLoS One. 2014 Sep 23;9(9):e107510. doi: 10.1371/journal.pone.0107510. eCollection 2014.

The GBIF integrated publishing toolkit: facilitating the efficient publishing of biodiversity data on the internet.GBIF 集成出版工具包：促进互联网上生物多样性数据的高效出版。

PLoS One. 2014 Aug 6;9(8):e102623. doi: 10.1371/journal.pone.0102623. eCollection 2014.

Avibase - a database system for managing and organizing taxonomic concepts.Avibase - 一个用于管理和组织分类学概念的数据库系统。

Zookeys. 2014 Jun 25(420):117-35. doi: 10.3897/zookeys.420.7089. eCollection 2014.

Taxonome: a software package for linking biological species data.分类群：一个用于链接生物物种数据的软件包。

Ecol Evol. 2013 May;3(5):1262-5. doi: 10.1002/ece3.529. Epub 2013 Apr 1.

Global coordination and standardisation in marine biodiversity through the World Register of Marine Species (WoRMS) and related databases.通过世界海洋物种登记处（WoRMS）和相关数据库实现海洋生物多样性的全球协调和标准化。

PLoS One. 2013;8(1):e51629. doi: 10.1371/journal.pone.0051629. Epub 2013 Jan 9.

The taxonomic name resolution service: an online tool for automated standardization of plant names.分类名称解析服务：一个用于植物名称自动标准化的在线工具。

BMC Bioinformatics. 2013 Jan 16;14:16. doi: 10.1186/1471-2105-14-16.

Darwin Core: an evolving community-developed biodiversity data standard.达尔文核心：一个不断发展的社区开发的生物多样性数据标准。

PLoS One. 2012;7(1):e29715. doi: 10.1371/journal.pone.0029715. Epub 2012 Jan 6.

Integrating biodiversity distribution knowledge: toward a global map of life.整合生物多样性分布知识：绘制全球生命地图。

Trends Ecol Evol. 2012 Mar;27(3):151-9. doi: 10.1016/j.tree.2011.09.007. Epub 2011 Oct 21.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

脊椎动物分类名称解析的标准化参考数据集。

A Standardized Reference Data Set for Vertebrate Taxon Name Resolution.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献