Muñoz Gabriel, Kissling W Daniel, van Loon E Emiel
NASUA, Biodiversity research and conservation section, Quito, Ecuador NASUA, Biodiversity research and conservation section Quito Ecuador.
Faculty of Arts and Science, Department of Biology, Concordia University, Montreal, Canada Faculty of Arts and Science, Department of Biology, Concordia University Montreal Canada.
Biodivers Data J. 2019 Jan 16(7):e28737. doi: 10.3897/BDJ.7.e28737. eCollection 2019.
A considerable portion of primary biodiversity data is digitally locked inside published literature which is often stored as pdf files. Large-scale approaches to biodiversity science could benefit from retrieving this information and making it digitally accessible and machine-readable. Nonetheless, the amount and diversity of digitally published literature pose many challenges for knowledge discovery and retrieval. Text mining has been extensively used for data discovery tasks in large quantities of documents. However, text mining approaches for knowledge discovery and retrieval have been limited in biodiversity science compared to other disciplines.
Here, we present a novel, open source text mining tool, the This web application, written in R, allows the semi-automated discovery of punctual biodiversity observations (e.g. biotic interactions, functional or behavioural traits and natural history descriptions) associated with the scientific names present inside a corpus of scientific literature. Furthermore, BOM enable users the rapid screening of large quantities of literature based on word co-occurrences that match custom biodiversity dictionaries. This tool aims to increase the digital mobilisation of primary biodiversity data and is freely accessible via GitHub or through a web server.
相当一部分原始生物多样性数据被数字锁定在已发表的文献中,这些文献通常以PDF文件形式存储。生物多样性科学的大规模研究方法可能会从检索这些信息并使其数字化可访问和机器可读中受益。尽管如此,数字出版文献的数量和多样性给知识发现和检索带来了许多挑战。文本挖掘已被广泛用于大量文档中的数据发现任务。然而,与其他学科相比,生物多样性科学中用于知识发现和检索的文本挖掘方法一直受到限制。
在此,我们展示了一种新颖的开源文本挖掘工具——BOM。这个用R编写的网络应用程序允许半自动发现与科学文献语料库中出现的科学名称相关的点状生物多样性观察结果(例如生物相互作用、功能或行为特征以及自然历史描述)。此外,BOM使用户能够基于与自定义生物多样性词典匹配的词共现情况快速筛选大量文献。该工具旨在提高原始生物多样性数据的数字流通性,可通过GitHub或网络服务器免费访问。