Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium.
Plant Cell. 2013 Mar;25(3):794-807. doi: 10.1105/tpc.112.108753. Epub 2013 Mar 26.
Despite the availability of various data repositories for plant research, a wealth of information currently remains hidden within the biomolecular literature. Text mining provides the necessary means to retrieve these data through automated processing of texts. However, only recently has advanced text mining methodology been implemented with sufficient computational power to process texts at a large scale. In this study, we assess the potential of large-scale text mining for plant biology research in general and for network biology in particular using a state-of-the-art text mining system applied to all PubMed abstracts and PubMed Central full texts. We present extensive evaluation of the textual data for Arabidopsis thaliana, assessing the overall accuracy of this new resource for usage in plant network analyses. Furthermore, we combine text mining information with both protein-protein and regulatory interactions from experimental databases. Clusters of tightly connected genes are delineated from the resulting network, illustrating how such an integrative approach is essential to grasp the current knowledge available for Arabidopsis and to uncover gene information through guilt by association. All large-scale data sets, as well as the manually curated textual data, are made publicly available, hereby stimulating the application of text mining data in future plant biology studies.
尽管有各种植物研究数据存储库可供使用,但目前仍有大量信息隐藏在生物分子文献中。文本挖掘通过自动处理文本提供了检索这些数据的必要手段。然而,直到最近,先进的文本挖掘方法才结合足够的计算能力,以大规模处理文本。在这项研究中,我们使用最先进的文本挖掘系统评估了大规模文本挖掘在植物生物学研究中的潜力,特别是在网络生物学方面,该系统应用于所有 PubMed 摘要和 PubMed Central 全文。我们对拟南芥的文本数据进行了广泛的评估,评估了该新资源在植物网络分析中的使用的整体准确性。此外,我们将文本挖掘信息与来自实验数据库的蛋白质-蛋白质和调控相互作用相结合。从生成的网络中描绘出紧密连接的基因簇,说明了这种集成方法对于理解当前拟南芥可用的知识以及通过关联发现基因信息是至关重要的。所有大规模数据集以及经过人工编辑的文本数据都将公开提供,从而鼓励在未来的植物生物学研究中应用文本挖掘数据。