Suppr超能文献

通过蜘蛛测量矩阵构建的案例研究介绍分类单元概念探索器。

Introducing Explorer of Taxon Concepts with a case study on spider measurement matrix building.

作者信息

Cui Hong, Xu Dongfang, Chong Steven S, Ramirez Martin, Rodenhausen Thomas, Macklin James A, Ludäscher Bertram, Morris Robert A, Soto Eduardo M, Koch Nicolás Mongiardino

机构信息

University of Arizona, Tucson, AZ, USA.

Museo Argentino de Ciencias, Naturales, CONICET, Buenos Aires, Argentina.

出版信息

BMC Bioinformatics. 2016 Nov 17;17(1):471. doi: 10.1186/s12859-016-1352-7.

Abstract

BACKGROUND

Taxonomic descriptions are traditionally composed in natural language and published in a format that cannot be directly used by computers. The Exploring Taxon Concepts (ETC) project has been developing a set of web-based software tools that convert morphological descriptions published in telegraphic style to character data that can be reused and repurposed. This paper introduces the first semi-automated pipeline, to our knowledge, that converts morphological descriptions into taxon-character matrices to support systematics and evolutionary biology research. We then demonstrate and evaluate the use of the ETC Input Creation - Text Capture - Matrix Generation pipeline to generate body part measurement matrices from a set of 188 spider morphological descriptions and report the findings.

RESULTS

From the given set of spider taxonomic publications, two versions of input (original and normalized) were generated and used by the ETC Text Capture and ETC Matrix Generation tools. The tools produced two corresponding spider body part measurement matrices, and the matrix from the normalized input was found to be much more similar to a gold standard matrix hand-curated by the scientist co-authors. Special conventions utilized in the original descriptions (e.g., the omission of measurement units) were attributed to the lower performance of using the original input. The results show that simple normalization of the description text greatly increased the quality of the machine-generated matrix and reduced edit effort. The machine-generated matrix also helped identify issues in the gold standard matrix.

CONCLUSIONS

ETC Text Capture and ETC Matrix Generation are low-barrier and effective tools for extracting measurement values from spider taxonomic descriptions and are more effective when the descriptions are self-contained. Special conventions that make the description text less self-contained challenge automated extraction of data from biodiversity descriptions and hinder the automated reuse of the published knowledge. The tools will be updated to support new requirements revealed in this case study.

摘要

背景

传统的分类学描述采用自然语言撰写,并以计算机无法直接使用的格式发布。探索分类单元概念(ETC)项目一直在开发一套基于网络的软件工具,可将以电报风格发布的形态学描述转换为可重复使用和重新利用的特征数据。据我们所知,本文介绍了首个将形态学描述转换为分类单元-特征矩阵以支持系统学和进化生物学研究的半自动流程。然后,我们展示并评估了ETC输入创建-文本捕获-矩阵生成流程用于从188篇蜘蛛形态学描述中生成身体部位测量矩阵的情况,并报告研究结果。

结果

从给定蜘蛛分类学出版物集中生成了两个版本的输入(原始版本和标准化版本),并由ETC文本捕获和ETC矩阵生成工具使用。这些工具生成了两个相应的蜘蛛身体部位测量矩阵,发现标准化输入生成的矩阵与共同撰写的科学家精心策划的黄金标准矩阵更为相似。原始描述中使用的特殊惯例(例如,测量单位的省略)被认为是使用原始输入时性能较低的原因。结果表明,描述文本的简单标准化极大地提高了机器生成矩阵的质量并减少了编辑工作量。机器生成的矩阵还有助于识别黄金标准矩阵中的问题。

结论

ETC文本捕获和ETC矩阵生成是从蜘蛛分类学描述中提取测量值的低门槛且有效的工具,当描述内容完整时效果更佳。使描述文本不够完整的特殊惯例对从生物多样性描述中自动提取数据构成挑战,并阻碍已发表知识的自动再利用。这些工具将进行更新,以支持本案例研究中揭示的新要求。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0573/5114841/681dab413b25/12859_2016_1352_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验