Proost Sebastian, Krawczyk Agnieszka, Mutwil Marek
Max-Planck Institute for Molecular Plant Physiology, Am Muehlenberg 1, 14476, Potsdam, Germany.
BMC Bioinformatics. 2017 Oct 10;18(1):444. doi: 10.1186/s12859-017-1861-z.
Since experimental elucidation of gene function is often laborious, various in silico methods have been developed to predict gene function of uncharacterized genes. Since functionally related genes are often expressed in the same tissues, conditions and developmental stages (co-expressed), functional annotation of characterized genes can be transferred to co-expressed genes lacking annotation. With genome-wide expression data available, the construction of co-expression networks, where genes are nodes and edges connect significantly co-expressed genes, provides unprecedented opportunities to predict gene function. However, the construction of such networks requires large volumes of high-quality data, multiple processing steps and a considerable amount of computation power. While efficient tools exist to process RNA-Seq data, pipelines which combine them to construct co-expression networks efficiently are currently lacking.
LSTrAP (Large-Scale Transcriptome Analysis Pipeline), presented here, combines all essential tools to construct co-expression networks based on RNA-Seq data into a single, efficient workflow. By supporting parallel computing on computer cluster infrastructure, processing hundreds of samples becomes feasible as shown here for Arabidopsis thaliana and Sorghum bicolor, which comprised 876 and 215 samples respectively. The former was used here to show how the quality control, included in LSTrAP, can detect spurious or low-quality samples. The latter was used to show how co-expression networks are able to group known photosynthesis genes and imply a role in this process of several, currently uncharacterized, genes.
LSTrAP combines the most popular and performant methods to construct co-expression networks from RNA-Seq data into a single workflow. This allows large amounts of expression data, required to construct co-expression networks, to be processed efficiently and consistently across hundreds of samples. LSTrAP is implemented in Python 3.4 (or higher) and available under MIT license from https://github.molgen.mpg.de/proost/LSTrAP.
由于通过实验阐明基因功能往往很费力,因此已开发出各种计算机方法来预测未表征基因的功能。由于功能相关的基因通常在相同的组织、条件和发育阶段表达(共表达),因此已表征基因的功能注释可以转移到缺乏注释的共表达基因上。随着全基因组表达数据的可得,共表达网络的构建(其中基因是节点,边连接显著共表达的基因)为预测基因功能提供了前所未有的机会。然而,构建这样的网络需要大量高质量的数据、多个处理步骤和相当大的计算能力。虽然存在处理RNA-Seq数据的有效工具,但目前缺乏将它们组合起来有效构建共表达网络的流程。
本文介绍的LSTrAP(大规模转录组分析流程)将基于RNA-Seq数据构建共表达网络的所有基本工具组合成一个高效的工作流程。通过支持在计算机集群基础设施上进行并行计算,处理数百个样本变得可行,如本文针对拟南芥和双色高粱所示,它们分别包含876个和215个样本。前者用于展示LSTrAP中包含的质量控制如何检测虚假或低质量样本。后者用于展示共表达网络如何能够将已知的光合作用基因分组,并暗示几个目前未表征的基因在此过程中的作用。
LSTrAP将从RNA-Seq数据构建共表达网络的最流行且性能最佳的方法组合成一个工作流程。这使得构建共表达网络所需的大量表达数据能够在数百个样本中高效且一致地进行处理。LSTrAP用Python 3.4(或更高版本)实现,可在https://github.molgen.mpg.de/proost/LSTrAP上根据MIT许可获取。