Singh Gurnoor, Kuzniar Arnold, van Mulligen Erik M, Gavai Anand, Bachem Christian W, Visser Richard G F, Finkers Richard
Plant Breeding, Wageningen University and Research, Wageningen, The Netherlands.
Netherlands eScience Center (NLeSC), Amsterdam, The Netherlands.
BMC Bioinformatics. 2018 May 25;19(1):183. doi: 10.1186/s12859-018-2165-7.
A quantitative trait locus (QTL) is a genomic region that correlates with a phenotype. Most of the experimental information about QTL mapping studies is described in tables of scientific publications. Traditional text mining techniques aim to extract information from unstructured text rather than from tables. We present QTLTableMiner (QTM), a table mining tool that extracts and semantically annotates QTL information buried in (heterogeneous) tables of plant science literature. QTM is a command line tool written in the Java programming language. This tool takes scientific articles from the Europe PMC repository as input, extracts QTL tables using keyword matching and ontology-based concept identification. The tables are further normalized using rules derived from table properties such as captions, column headers and table footers. Furthermore, table columns are classified into three categories namely column descriptors, properties and values based on column headers and data types of cell entries. Abbreviations found in the tables are expanded using the Schwartz and Hearst algorithm. Finally, the content of QTL tables is semantically enriched with domain-specific ontologies (e.g. Crop Ontology, Plant Ontology and Trait Ontology) using the Apache Solr search platform and the results are stored in a relational database and a text file.
The performance of the QTM tool was assessed by precision and recall based on the information retrieved from two manually annotated corpora of open access articles, i.e. QTL mapping studies in tomato (Solanum lycopersicum) and in potato (S. tuberosum). In summary, QTM detected QTL statements in tomato with 74.53% precision and 92.56% recall and in potato with 82.82% precision and 98.94% recall.
QTM is a unique tool that aids in providing QTL information in machine-readable and semantically interoperable formats.
数量性状基因座(QTL)是与表型相关的基因组区域。关于QTL定位研究的大多数实验信息都在科学出版物的表格中描述。传统的文本挖掘技术旨在从非结构化文本而不是表格中提取信息。我们提出了QTLTableMiner(QTM),这是一种表格挖掘工具,可提取并语义注释隐藏在植物科学文献(异构)表格中的QTL信息。QTM是一个用Java编程语言编写的命令行工具。该工具以欧洲PMC知识库中的科学文章为输入,使用关键字匹配和基于本体的概念识别来提取QTL表格。这些表格会使用从表格属性(如图注、列标题和表格页脚)派生的规则进一步规范化。此外,根据列标题和单元格条目的数据类型,将表格列分为三类,即列描述符、属性和值。使用施瓦茨和赫斯特算法扩展表格中发现的缩写。最后,使用Apache Solr搜索平台,用特定领域的本体(如作物本体、植物本体和性状本体)对QTL表格的内容进行语义丰富,并将结果存储在关系数据库和文本文件中。
基于从两个开放获取文章的人工注释语料库(即番茄(Solanum lycopersicum)和马铃薯(S. tuberosum)中的QTL定位研究)检索到的信息,通过精确率和召回率对QTM工具的性能进行了评估。总之,QTM在番茄中检测到QTL陈述的精确率为74.53%,召回率为92.56%;在马铃薯中检测到QTL陈述的精确率为82.82%,召回率为98.94%。
QTM是一个独特的工具,有助于以机器可读和语义可互操作的格式提供QTL信息。