Department of Terrestrial Zoology, Understanding Evolution group, Naturalis Biodiversity Center, Darwinweg 2, 2333CR, Leiden, The Netherlands.
Institute of Biology Leiden (IBL), Leiden University, Sylviusweg 72, 2333BE, Leiden, The Netherlands.
Sci Rep. 2020 Sep 25;10(1):15787. doi: 10.1038/s41598-020-72549-8.
Taxonomic literature contains information about virtually ever known species on Earth. In many cases, all that is known about a taxon is contained in this kind of literature, particularly for the most diverse and understudied groups. Taxonomic publications in the aggregate have documented a vast amount of specimen data. Among other things, these data constitute evidence of the existence of a particular taxon within a spatial and temporal context. When knowledge about a particular taxonomic group is rudimentary, investigators motivated to contribute new knowledge can use legacy records to guide them in their search for new specimens in the field. However, these legacy data are in the form of unstructured text, making it difficult to extract and analyze without a human interpreter. Here, we used a combination of semi-automatic tools to extract and categorize specimen data from taxonomic literature of one family of ground spiders (Liocranidae). We tested the application of these data on fieldwork optimization, using the relative abundance of adult specimens reported in literature as a proxy to find the best times and places for collecting the species (Teutamus politus) and its relatives (Teutamus group, TG) within Southeast Asia. Based on these analyses we decided to collect in three provinces in Thailand during the months of June and August. With our approach, we were able to collect more specimens of T. politus (188 specimens, 95 adults) than all the previous records in literature combined (102 specimens). Our approach was also effective for sampling other representatives of the TG, yielding at least one representative of every TG genus previously reported for Thailand. In total, our samples contributed 231 specimens (134 adults) to the 351 specimens previously reported in the literature for this country. Our results exemplify one application of mined literature data that allows investigators to more efficiently allocate effort and resources for the study of neglected, endangered, or interesting taxa and geographic areas. Furthermore, the integrative workflow demonstrated here shares specimen data with global online resources like Plazi and GBIF, meaning that others can freely reuse these data and contribute to them in the future. The contributions of the present study represent an increase of more than 35% on the taxonomic coverage of the TG in GBIF based on the number of species. Also, our extracted data represents 72% of the occurrences now available through GBIF for the TG and more than 85% of occurrences of T. politus. Taxonomic literature is a key source of undigitized biodiversity data for taxonomic groups that are underrepresented in the current biodiversity data sphere. Mobilizing these data is key to understanding and protecting some of the less well-known domains of biodiversity.
分类学文献包含了地球上几乎所有已知物种的信息。在许多情况下,关于一个分类单元的所有信息都包含在这种文献中,尤其是对于最多样化和研究最少的分类群。分类学出版物汇总记录了大量的标本数据。除其他外,这些数据构成了在特定时空背景下存在特定分类群的证据。当对特定分类群的了解还很基础时,有意愿提供新知识的调查人员可以利用遗留记录来指导他们在实地寻找新标本。然而,这些遗留数据是无结构文本的形式,没有人类解释器很难提取和分析。在这里,我们使用半自动工具从地面蜘蛛科(Liocranidae)的分类学文献中提取和分类标本数据。我们测试了这些数据在野外工作优化中的应用,使用文献中报告的成年标本的相对丰度作为代理,来确定在东南亚采集该物种(Teutamus politus)及其亲缘种(Teutamus 组,TG)的最佳时间和地点。基于这些分析,我们决定在泰国的三个省在 6 月和 8 月进行采集。通过我们的方法,我们能够收集到更多的 T. politus 标本(188 个标本,95 个成虫),比文献中所有以前的记录总和(102 个标本)还要多。我们的方法也能有效地对 TG 的其他代表进行采样,获得了泰国以前报道的每个 TG 属的至少一个代表。总共,我们的样本为这个国家以前在文献中报告的 351 个标本增加了 231 个标本(134 个成虫)。我们的结果是挖掘文献数据的一种应用示例,使调查人员能够更有效地分配努力和资源来研究被忽视、濒危或有趣的分类群和地理区域。此外,这里展示的综合工作流程与 Plazi 和 GBIF 等全球在线资源共享标本数据,这意味着其他人可以自由地重复使用这些数据并在未来对其进行贡献。本研究的贡献代表了 TG 在 GBIF 中的分类覆盖率增加了 35%以上,基于物种数量。此外,我们提取的数据代表了 TG 在 GBIF 中现在可用的 72%的出现,以及 T. politus 的 85%以上的出现。分类学文献是当前生物多样性数据领域中代表性不足的分类群的未数字化生物多样性数据的主要来源。调动这些数据对于理解和保护一些不太知名的生物多样性领域至关重要。