文本挖掘在数据集成和网络生物学中的潜力及其在植物研究中的应用：以拟南芥为例。

The potential of text mining in data integration and network biology for plant research: a case study on Arabidopsis.

机构信息

Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium.

出版信息

Plant Cell. 2013 Mar;25(3):794-807. doi: 10.1105/tpc.112.108753. Epub 2013 Mar 26.

DOI:10.1105/tpc.112.108753

PMID:23532071

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3634689/

Abstract

Despite the availability of various data repositories for plant research, a wealth of information currently remains hidden within the biomolecular literature. Text mining provides the necessary means to retrieve these data through automated processing of texts. However, only recently has advanced text mining methodology been implemented with sufficient computational power to process texts at a large scale. In this study, we assess the potential of large-scale text mining for plant biology research in general and for network biology in particular using a state-of-the-art text mining system applied to all PubMed abstracts and PubMed Central full texts. We present extensive evaluation of the textual data for Arabidopsis thaliana, assessing the overall accuracy of this new resource for usage in plant network analyses. Furthermore, we combine text mining information with both protein-protein and regulatory interactions from experimental databases. Clusters of tightly connected genes are delineated from the resulting network, illustrating how such an integrative approach is essential to grasp the current knowledge available for Arabidopsis and to uncover gene information through guilt by association. All large-scale data sets, as well as the manually curated textual data, are made publicly available, hereby stimulating the application of text mining data in future plant biology studies.

摘要

尽管有各种植物研究数据存储库可供使用，但目前仍有大量信息隐藏在生物分子文献中。文本挖掘通过自动处理文本提供了检索这些数据的必要手段。然而，直到最近，先进的文本挖掘方法才结合足够的计算能力，以大规模处理文本。在这项研究中，我们使用最先进的文本挖掘系统评估了大规模文本挖掘在植物生物学研究中的潜力，特别是在网络生物学方面，该系统应用于所有 PubMed 摘要和 PubMed Central 全文。我们对拟南芥的文本数据进行了广泛的评估，评估了该新资源在植物网络分析中的使用的整体准确性。此外，我们将文本挖掘信息与来自实验数据库的蛋白质-蛋白质和调控相互作用相结合。从生成的网络中描绘出紧密连接的基因簇，说明了这种集成方法对于理解当前拟南芥可用的知识以及通过关联发现基因信息是至关重要的。所有大规模数据集以及经过人工编辑的文本数据都将公开提供，从而鼓励在未来的植物生物学研究中应用文本挖掘数据。

相似文献

The potential of text mining in data integration and network biology for plant research: a case study on Arabidopsis.文本挖掘在数据集成和网络生物学中的潜力及其在植物研究中的应用：以拟南芥为例。

Plant Cell. 2013 Mar;25(3):794-807. doi: 10.1105/tpc.112.108753. Epub 2013 Mar 26.

Global protein interactome exploration through mining genome-scale data in Arabidopsis thaliana.通过挖掘拟南芥基因组规模数据进行全球蛋白质相互作用组探索。

BMC Genomics. 2010 Nov 2;11 Suppl 2(Suppl 2):S2. doi: 10.1186/1471-2164-11-S2-S2.

LAITOR4HPC: A text mining pipeline based on HPC for building interaction networks.LAITOR4HPC：一个基于高性能计算的文本挖掘管道，用于构建交互网络。

BMC Bioinformatics. 2020 Aug 24;21(1):365. doi: 10.1186/s12859-020-03620-4.

CORNET 2.0: integrating plant coexpression, protein-protein interactions, regulatory interactions, gene associations and functional annotations.CORNET 2.0：整合植物共表达、蛋白质-蛋白质相互作用、调控相互作用、基因关联和功能注释。

New Phytol. 2012 Aug;195(3):707-720. doi: 10.1111/j.1469-8137.2012.04184.x. Epub 2012 May 31.

Large-scale event extraction from literature with multi-level gene normalization.从文献中进行多层次基因标准化的大规模事件提取。

PLoS One. 2013 Apr 17;8(4):e55814. doi: 10.1371/journal.pone.0055814. Print 2013.

Building an efficient curation workflow for the Arabidopsis literature corpus.构建拟南芥文献语料库的高效编目工作流程。

Database (Oxford). 2012 Dec 6;2012:bas047. doi: 10.1093/database/bas047. Print 2012.

Survey of Natural Language Processing Techniques in Bioinformatics.生物信息学中的自然语言处理技术综述

Comput Math Methods Med. 2015;2015:674296. doi: 10.1155/2015/674296. Epub 2015 Oct 7.

Dragon Plant Biology Explorer. A text-mining tool for integrating associations between genetic and biochemical entities with genome annotation and biochemical terms lists.龙舌兰植物生物学探索者。一种文本挖掘工具，用于整合遗传和生化实体之间的关联以及基因组注释和生化术语列表。

Plant Physiol. 2005 Aug;138(4):1914-25. doi: 10.1104/pp.105.060863.

Seed bioinformatics.种子生物信息学

Methods Mol Biol. 2011;773:403-19. doi: 10.1007/978-1-61779-231-1_23.

HPIminer: A text mining system for building and visualizing human protein interaction networks and pathways.HPIminer：一个用于构建和可视化人类蛋白质相互作用网络及通路的文本挖掘系统。

J Biomed Inform. 2015 Apr;54:121-31. doi: 10.1016/j.jbi.2015.01.006. Epub 2015 Feb 4.

引用本文的文献

Extracting knowledge networks from plant scientific literature: potato tuber flesh color as an exemplary trait.从植物科学文献中提取知识网络：以马铃薯块茎颜色为例证特征。

BMC Plant Biol. 2021 Apr 24;21(1):198. doi: 10.1186/s12870-021-02943-5.

WTO, an ontology for wheat traits and phenotypes in scientific publications.WTO，科学出版物中小麦性状和表型的本体论。

Genomics Inform. 2020 Jun;18(2):e14. doi: 10.5808/GI.2020.18.2.e14. Epub 2020 Jun 16.

The research on gene-disease association based on text-mining of PubMed.基于 PubMed 文本挖掘的基因-疾病关联研究。

BMC Bioinformatics. 2018 Feb 7;19(1):37. doi: 10.1186/s12859-018-2048-y.

CytoCluster: A Cytoscape Plugin for Cluster Analysis and Visualization of Biological Networks.CytoCluster：一款用于生物网络聚类分析和可视化的Cytoscape插件。

Int J Mol Sci. 2017 Aug 31;18(9):1880. doi: 10.3390/ijms18091880.

DES-TOMATO: A Knowledge Exploration System Focused On Tomato Species.DES-TOMATO：一个专注于番茄属物种的知识探索系统。

Sci Rep. 2017 Jul 20;7(1):5968. doi: 10.1038/s41598-017-05448-0.

Omics Data Complementarity Underlines Functional Cross-Communication in Yeast.组学数据互补突显酵母中的功能交叉通讯

J Integr Bioinform. 2017 Jun 10;14(2):20170018. doi: 10.1515/jib-2017-0018.

Cross-species Conservation of context-specific networks.特定上下文网络的跨物种保守性。

BMC Syst Biol. 2016 Aug 17;10(1):76. doi: 10.1186/s12918-016-0304-1.

CARFMAP: A Curated Pathway Map of Cardiac Fibroblasts.CARFMAP：心脏成纤维细胞的精选通路图。

PLoS One. 2015 Dec 16;10(12):e0143274. doi: 10.1371/journal.pone.0143274. eCollection 2015.

Application of the EVEX resource to event extraction and network construction: Shared Task entry and result analysis.EVEX资源在事件抽取与网络构建中的应用：共享任务参赛作品及结果分析

BMC Bioinformatics. 2015;16 Suppl 16(Suppl 16):S3. doi: 10.1186/1471-2105-16-S16-S3. Epub 2015 Oct 30.

RLIMS-P 2.0: A Generalizable Rule-Based Information Extraction System for Literature Mining of Protein Phosphorylation Information.RLIMS-P 2.0：一种用于蛋白质磷酸化信息文献挖掘的可通用的基于规则的信息提取系统。

IEEE/ACM Trans Comput Biol Bioinform. 2015 Jan-Feb;12(1):17-29. doi: 10.1109/TCBB.2014.2372765.

本文引用的文献

Large-scale event extraction from literature with multi-level gene normalization.从文献中进行多层次基因标准化的大规模事件提取。

PLoS One. 2013 Apr 17;8(4):e55814. doi: 10.1371/journal.pone.0055814. Print 2013.

Text mining in the biocuration workflow: applications for literature curation at WormBase, dictyBase and TAIR.生物注释工作流程中的文本挖掘：在 WormBase、dictyBase 和 TAIR 中进行文献注释的应用。

Database (Oxford). 2012 Nov 17;2012:bas040. doi: 10.1093/database/bas040. Print 2012.

Systems analysis of plant functional, transcriptional, physical interaction, and metabolic networks.植物功能、转录、物理相互作用和代谢网络的系统分析。

Plant Cell. 2012 Oct;24(10):3859-75. doi: 10.1105/tpc.112.100776. Epub 2012 Oct 30.

Integration of genome-wide association studies with biological knowledge identifies six novel genes related to kidney function.全基因组关联研究与生物学知识的整合确定了与肾功能相关的六个新基因。

Hum Mol Genet. 2012 Dec 15;21(24):5329-43. doi: 10.1093/hmg/dds369. Epub 2012 Sep 8.

A general G1/S-phase cell-cycle control module in the flowering plant Arabidopsis thaliana.在开花植物拟南芥中普遍存在的 G1/S 期细胞周期控制模块。

PLoS Genet. 2012;8(8):e1002847. doi: 10.1371/journal.pgen.1002847. Epub 2012 Aug 2.

University of Turku in the BioNLP'11 Shared Task.图尔库大学在 BioNLP'11 共享任务中的贡献。

BMC Bioinformatics. 2012 Jun 26;13 Suppl 11(Suppl 11):S4. doi: 10.1186/1471-2105-13-S11-S4.

Exploring Biomolecular Literature with EVEX: Connecting Genes through Events, Homology, and Indirect Associations.使用EVEX探索生物分子文献：通过事件、同源性和间接关联连接基因。

Adv Bioinformatics. 2012;2012:582765. doi: 10.1155/2012/582765. Epub 2012 Jun 6.

SR4GN: a species recognition software tool for gene normalization.SR4GN：一种用于基因标准化的物种识别软件工具。

PLoS One. 2012;7(6):e38460. doi: 10.1371/journal.pone.0038460. Epub 2012 Jun 5.

New Phytol. 2012 Aug;195(3):707-720. doi: 10.1111/j.1469-8137.2012.04184.x. Epub 2012 May 31.

Systematic identification of functional plant modules through the integration of complementary data sources.通过整合互补数据源系统地识别功能植物模块。

Plant Physiol. 2012 Jul;159(3):884-901. doi: 10.1104/pp.112.196725. Epub 2012 May 15.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验