Suppr超能文献

脊椎动物分类名称解析的标准化参考数据集。

A Standardized Reference Data Set for Vertebrate Taxon Name Resolution.

作者信息

Zermoglio Paula F, Guralnick Robert P, Wieczorek John R

机构信息

Departamento de Ecología, Genética y Evolución, Instituto IEGEBA (CONICET-UBA), Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Buenos Aires, Argentina.

Institut de Recherche sur la Biologie de l'Insecte, UMR 7261 CNRS, Université François Rabelais, Tours, France.

出版信息

PLoS One. 2016 Jan 13;11(1):e0146894. doi: 10.1371/journal.pone.0146894. eCollection 2016.

Abstract

Taxonomic names associated with digitized biocollections labels have flooded into repositories such as GBIF, iDigBio and VertNet. The names on these labels are often misspelled, out of date, or present other problems, as they were often captured only once during accessioning of specimens, or have a history of label changes without clear provenance. Before records are reliably usable in research, it is critical that these issues be addressed. However, still missing is an assessment of the scope of the problem, the effort needed to solve it, and a way to improve effectiveness of tools developed to aid the process. We present a carefully human-vetted analysis of 1000 verbatim scientific names taken at random from those published via the data aggregator VertNet, providing the first rigorously reviewed, reference validation data set. In addition to characterizing formatting problems, human vetting focused on detecting misspelling, synonymy, and the incorrect use of Darwin Core. Our results reveal a sobering view of the challenge ahead, as less than 47% of name strings were found to be currently valid. More optimistically, nearly 97% of name combinations could be resolved to a currently valid name, suggesting that computer-aided approaches may provide feasible means to improve digitized content. Finally, we associated names back to biocollections records and fit logistic models to test potential drivers of issues. A set of candidate variables (geographic region, year collected, higher-level clade, and the institutional digitally accessible data volume) and their 2-way interactions all predict the probability of records having taxon name issues, based on model selection approaches. We strongly encourage further experiments to use this reference data set as a means to compare automated or computer-aided taxon name tools for their ability to resolve and improve the existing wealth of legacy data.

摘要

与数字化生物标本馆标签相关的分类学名称大量涌入了诸如全球生物多样性信息网络(GBIF)、iDigBio和VertNet等数据库。这些标签上的名称常常拼写错误、过时,或者存在其他问题,因为它们通常在标本入库时只被采集过一次,或者有标签变更的历史但来源不明。在记录能够可靠地用于研究之前,解决这些问题至关重要。然而,目前仍缺少对问题范围的评估、解决问题所需的工作量,以及提高辅助该过程所开发工具有效性的方法。我们对从数据聚合器VertNet发布的名称中随机抽取的1000个逐字科学名称进行了精心的人工审核分析,提供了首个经过严格审核的参考验证数据集。除了描述格式问题外,人工审核还着重于检测拼写错误、同义词以及达尔文核心的错误使用。我们的结果揭示了对未来挑战的严峻看法,因为发现目前只有不到47%的名称字符串是有效的。更乐观的是,近97%的名称组合可以解析为当前有效的名称,这表明计算机辅助方法可能提供改善数字化内容的可行手段。最后,我们将名称与生物标本记录关联起来,并拟合逻辑模型以测试问题的潜在驱动因素。根据模型选择方法,一组候选变量(地理区域、采集年份、高级分类单元和机构数字可访问数据量)及其双向交互作用都能预测记录出现分类学名称问题的概率。我们强烈鼓励进一步开展实验,使用这个参考数据集来比较自动化或计算机辅助分类学名称工具解决和改善现有大量遗留数据的能力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b39/4711887/713ab7ac8e81/pone.0146894.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验