Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, USA.
Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA.
Database (Oxford). 2018 Jan 1;2018:bay128. doi: 10.1093/database/bay128.
Numerous efforts have been made for developing text-mining tools to extract information from biomedical text automatically. They have assisted in many biological tasks, such as database curation and hypothesis generation. Text-mining tools are usually different from each other in terms of programming language, system dependency and input/output format. There are few previous works that concern the integration of different text-mining tools and their results from large-scale text processing. In this paper, we describe the iTextMine system with an automated workflow to run multiple text-mining tools on large-scale text for knowledge extraction. We employ parallel processing with dockerized text-mining tools with a standardized JSON output format and implement a text alignment algorithm to solve the text discrepancy for result integration. iTextMine presently integrates four relation extraction tools, which have been used to process all the Medline abstracts and PMC open access full-length articles. The website allows users to browse the text evidence and view integrated results for knowledge discovery through a network view. We demonstrate the utilities of iTextMine with two use cases involving the gene PTEN and breast cancer and the gene SATB1.
已经做出了许多努力来开发文本挖掘工具,以便自动从生物医学文本中提取信息。它们在许多生物学任务中都有帮助,例如数据库整理和假设生成。文本挖掘工具在编程语言、系统依赖性和输入/输出格式方面通常彼此不同。以前很少有关于整合不同文本挖掘工具及其从大规模文本处理中获得的结果的工作。在本文中,我们描述了 iTextMine 系统,该系统具有自动化工作流程,可在大规模文本上运行多个文本挖掘工具以进行知识提取。我们使用带有标准化 JSON 输出格式的 Docker 化文本挖掘工具进行并行处理,并实现了文本对齐算法来解决结果整合中的文本差异。iTextMine 目前集成了四个关系提取工具,这些工具已用于处理所有 Medline 摘要和 PMC 开放获取全文文章。该网站允许用户通过网络视图浏览文本证据并查看集成结果以进行知识发现。我们通过涉及基因 PTEN 和乳腺癌以及基因 SATB1 的两个用例展示了 iTextMine 的实用性。