Peng Shaoliang, Yang Shunyun, Bo Xiaochen, Li Fei
College of Computer Science and Electronic Engineering & National Supercomputer Centre in Changsha, Hunan University, Changsha 410082, China.
School of Computer Science, National University of Defense Technology, Changsha 410073, China.
Nucleic Acids Res. 2017 Sep 29;45(17):e155. doi: 10.1093/nar/gkx679.
More studies have been conducted using gene expression similarity to identify functional connections among genes, diseases and drugs. Gene Set Enrichment Analysis (GSEA) is a powerful analytical method for interpreting gene expression data. However, due to its enormous computational overhead in the estimation of significance level step and multiple hypothesis testing step, the computation scalability and efficiency are poor on large-scale datasets. We proposed paraGSEA for efficient large-scale transcriptome data analysis. By optimization, the overall time complexity of paraGSEA is reduced from O(mn) to O(m+n), where m is the length of the gene sets and n is the length of the gene expression profiles, which contributes more than 100-fold increase in performance compared with other popular GSEA implementations such as GSEA-P, SAM-GS and GSEA2. By further parallelization, a near-linear speed-up is gained on both workstations and clusters in an efficient manner with high scalability and performance on large-scale datasets. The analysis time of whole LINCS phase I dataset (GSE92742) was reduced to nearly half hour on a 1000 node cluster on Tianhe-2, or within 120 hours on a 96-core workstation. The source code of paraGSEA is licensed under the GPLv3 and available at http://github.com/ysycloud/paraGSEA.
已经开展了更多利用基因表达相似性来识别基因、疾病和药物之间功能联系的研究。基因集富集分析(GSEA)是一种用于解释基因表达数据的强大分析方法。然而,由于其在显著性水平估计步骤和多重假设检验步骤中存在巨大的计算开销,在大规模数据集上的计算可扩展性和效率较差。我们提出了用于高效大规模转录组数据分析的并行GSEA(paraGSEA)。通过优化,paraGSEA的整体时间复杂度从O(mn)降低到了O(m + n),其中m是基因集的长度,n是基因表达谱的长度,与其他流行的GSEA实现(如GSEA-P、SAM-GS和GSEA2)相比,性能提升了100多倍。通过进一步并行化,在工作站和集群上都能以高效的方式实现近线性加速,在大规模数据集上具有高可扩展性和高性能。在天河二号的1000节点集群上,整个LINCS第一阶段数据集(GSE92742)的分析时间缩短至近半小时,或者在96核工作站上在120小时内完成。paraGSEA的源代码遵循GPLv3许可,可在http://github.com/ysycloud/paraGSEA获取。