• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

并行基因集富集分析(paraGSEA):一种用于大规模基因表达谱分析的可扩展方法。

paraGSEA: a scalable approach for large-scale gene expression profiling.

作者信息

Peng Shaoliang, Yang Shunyun, Bo Xiaochen, Li Fei

机构信息

College of Computer Science and Electronic Engineering & National Supercomputer Centre in Changsha, Hunan University, Changsha 410082, China.

School of Computer Science, National University of Defense Technology, Changsha 410073, China.

出版信息

Nucleic Acids Res. 2017 Sep 29;45(17):e155. doi: 10.1093/nar/gkx679.

DOI:10.1093/nar/gkx679
PMID:28973463
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5737394/
Abstract

More studies have been conducted using gene expression similarity to identify functional connections among genes, diseases and drugs. Gene Set Enrichment Analysis (GSEA) is a powerful analytical method for interpreting gene expression data. However, due to its enormous computational overhead in the estimation of significance level step and multiple hypothesis testing step, the computation scalability and efficiency are poor on large-scale datasets. We proposed paraGSEA for efficient large-scale transcriptome data analysis. By optimization, the overall time complexity of paraGSEA is reduced from O(mn) to O(m+n), where m is the length of the gene sets and n is the length of the gene expression profiles, which contributes more than 100-fold increase in performance compared with other popular GSEA implementations such as GSEA-P, SAM-GS and GSEA2. By further parallelization, a near-linear speed-up is gained on both workstations and clusters in an efficient manner with high scalability and performance on large-scale datasets. The analysis time of whole LINCS phase I dataset (GSE92742) was reduced to nearly half hour on a 1000 node cluster on Tianhe-2, or within 120 hours on a 96-core workstation. The source code of paraGSEA is licensed under the GPLv3 and available at http://github.com/ysycloud/paraGSEA.

摘要

已经开展了更多利用基因表达相似性来识别基因、疾病和药物之间功能联系的研究。基因集富集分析(GSEA)是一种用于解释基因表达数据的强大分析方法。然而,由于其在显著性水平估计步骤和多重假设检验步骤中存在巨大的计算开销,在大规模数据集上的计算可扩展性和效率较差。我们提出了用于高效大规模转录组数据分析的并行GSEA(paraGSEA)。通过优化,paraGSEA的整体时间复杂度从O(mn)降低到了O(m + n),其中m是基因集的长度,n是基因表达谱的长度,与其他流行的GSEA实现(如GSEA-P、SAM-GS和GSEA2)相比,性能提升了100多倍。通过进一步并行化,在工作站和集群上都能以高效的方式实现近线性加速,在大规模数据集上具有高可扩展性和高性能。在天河二号的1000节点集群上,整个LINCS第一阶段数据集(GSE92742)的分析时间缩短至近半小时,或者在96核工作站上在120小时内完成。paraGSEA的源代码遵循GPLv3许可,可在http://github.com/ysycloud/paraGSEA获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5e73/5737394/a210d71d40e4/gkx679fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5e73/5737394/fa774a2e9061/gkx679fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5e73/5737394/9b455d4f7ea2/gkx679fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5e73/5737394/25ddbfd43fcb/gkx679fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5e73/5737394/47a33a4210eb/gkx679fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5e73/5737394/a210d71d40e4/gkx679fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5e73/5737394/fa774a2e9061/gkx679fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5e73/5737394/9b455d4f7ea2/gkx679fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5e73/5737394/25ddbfd43fcb/gkx679fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5e73/5737394/47a33a4210eb/gkx679fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5e73/5737394/a210d71d40e4/gkx679fig5.jpg

相似文献

1
paraGSEA: a scalable approach for large-scale gene expression profiling.并行基因集富集分析(paraGSEA):一种用于大规模基因表达谱分析的可扩展方法。
Nucleic Acids Res. 2017 Sep 29;45(17):e155. doi: 10.1093/nar/gkx679.
2
BubbleGUM: automatic extraction of phenotype molecular signatures and comprehensive visualization of multiple Gene Set Enrichment Analyses.BubbleGUM:表型分子特征的自动提取及多种基因集富集分析的综合可视化
BMC Genomics. 2015 Oct 19;16:814. doi: 10.1186/s12864-015-2012-4.
3
Gene expression analysis in clear cell renal cell carcinoma using gene set enrichment analysis for biostatistical management.基于基因集富集分析的 clear cell 肾细胞癌基因表达分析用于生物统计学管理。
BJU Int. 2011 Jul;108(2 Pt 2):E29-35. doi: 10.1111/j.1464-410X.2010.09794.x. Epub 2011 Mar 16.
4
Improving Gene-Set Enrichment Analysis of RNA-Seq Data with Small Replicates.利用小样本重复改进RNA测序数据的基因集富集分析
PLoS One. 2016 Nov 9;11(11):e0165919. doi: 10.1371/journal.pone.0165919. eCollection 2016.
5
Transcription network construction for large-scale microarray datasets using a high-performance computing approach.使用高性能计算方法构建大规模微阵列数据集的转录网络
BMC Genomics. 2008;9 Suppl 1(Suppl 1):S5. doi: 10.1186/1471-2164-9-S1-S5.
6
Gene set enrichment analysis made simple.基因集富集分析变得简单。
Stat Methods Med Res. 2009 Dec;18(6):565-75. doi: 10.1177/0962280209351908.
7
K-Boost: a scalable algorithm for high-quality clustering of microarray gene expression data.K-Boost:一种用于微阵列基因表达数据高质量聚类的可扩展算法。
J Comput Biol. 2009 Jun;16(6):859-73. doi: 10.1089/cmb.2008.0201.
8
Tight clustering for large datasets with an application to gene expression data.针对大型数据集的紧密聚类及其在基因表达数据中的应用。
Sci Rep. 2019 Feb 28;9(1):3053. doi: 10.1038/s41598-019-39459-w.
9
Towards precise classification of cancers based on robust gene functional expression profiles.基于稳健的基因功能表达谱实现癌症的精准分类
BMC Bioinformatics. 2005 Mar 17;6:58. doi: 10.1186/1471-2105-6-58.
10
Avoiding the pitfalls of gene set enrichment analysis with SetRank.使用SetRank避免基因集富集分析的陷阱。
BMC Bioinformatics. 2017 Mar 4;18(1):151. doi: 10.1186/s12859-017-1571-6.

引用本文的文献

1
LAMTOR3 is a prognostic biomarker in kidney renal clear cell carcinoma.LAMTOR3 是肾透明细胞癌的预后生物标志物。
J Clin Lab Anal. 2022 Sep;36(9):e24648. doi: 10.1002/jcla.24648. Epub 2022 Aug 10.
2
Identification of Ferroptotic Genes in Spinal Cord Injury at Different Time Points: Bioinformatics and Experimental Validation.不同时间点脊髓损伤中铁死亡基因的鉴定:生物信息学和实验验证。
Mol Neurobiol. 2022 Sep;59(9):5766-5784. doi: 10.1007/s12035-022-02935-y. Epub 2022 Jul 7.
3
Identification of Regeneration and Hub Genes and Pathways at Different Time Points after Spinal Cord Injury.

本文引用的文献

1
L1000CDS: LINCS L1000 characteristic direction signatures search engine.L1000CDS:连通性图谱L1000特征方向签名搜索引擎。
NPJ Syst Biol Appl. 2016;2:16015-. doi: 10.1038/npjsba.2016.15. Epub 2016 Aug 4.
2
jSplice: a high-performance method for accurate prediction of alternative splicing events and its application to large-scale renal cancer transcriptome data.jSplice:一种用于准确预测剪接事件的高性能方法及其在大规模肾癌转录组数据中的应用。
Bioinformatics. 2016 Jul 15;32(14):2111-9. doi: 10.1093/bioinformatics/btw145. Epub 2016 Mar 21.
3
ICM: a web server for integrated clustering of multi-dimensional biomedical data.
脊髓损伤后不同时间点的再生和枢纽基因及通路的鉴定。
Mol Neurobiol. 2021 Jun;58(6):2643-2662. doi: 10.1007/s12035-021-02289-x. Epub 2021 Jan 23.
4
Identification of Core Genes and Pathways in Medulloblastoma by Integrated Bioinformatics Analysis.基于综合生物信息学分析鉴定髓母细胞瘤的核心基因和通路。
J Mol Neurosci. 2020 Nov;70(11):1702-1712. doi: 10.1007/s12031-020-01556-1. Epub 2020 Jun 13.
5
Identification of Hub Genes in Pediatric Medulloblastoma by Multiple-Microarray Analysis.多基因芯片分析鉴定小儿髓母细胞瘤的枢纽基因。
J Mol Neurosci. 2020 Apr;70(4):522-531. doi: 10.1007/s12031-019-01451-4. Epub 2019 Dec 9.
6
VIGLA-M: visual gene expression data analytics.VIGLA-M:可视化基因表达数据分析。
BMC Bioinformatics. 2019 Apr 18;20(Suppl 4):150. doi: 10.1186/s12859-019-2695-7.
ICM:一个用于多维生物医学数据集成聚类的网络服务器。
Nucleic Acids Res. 2016 Jul 8;44(W1):W154-9. doi: 10.1093/nar/gkw378. Epub 2016 Apr 30.
4
ParDRe: faster parallel duplicated reads removal tool for sequencing studies.ParDRe:用于测序研究的更快的并行重复读数去除工具。
Bioinformatics. 2016 May 15;32(10):1562-4. doi: 10.1093/bioinformatics/btw038. Epub 2016 Jan 22.
5
Network fingerprint: a knowledge-based characterization of biomedical networks.网络指纹:基于知识的生物医学网络特征描述
Sci Rep. 2015 Aug 26;5:13286. doi: 10.1038/srep13286.
6
Halvade: scalable sequence analysis with MapReduce.Halvade:使用MapReduce进行可扩展序列分析。
Bioinformatics. 2015 Aug 1;31(15):2482-8. doi: 10.1093/bioinformatics/btv179. Epub 2015 Mar 26.
7
The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge.癌症基因组图谱(TCGA):一个不可估量的知识来源。
Contemp Oncol (Pozn). 2015;19(1A):A68-77. doi: 10.5114/wo.2014.47136.
8
Starcode: sequence clustering based on all-pairs search.星码:基于全对搜索的序列聚类。
Bioinformatics. 2015 Jun 15;31(12):1913-9. doi: 10.1093/bioinformatics/btv053. Epub 2015 Jan 31.
9
Cancer in silico drug discovery: a systems biology tool for identifying candidate drugs to target specific molecular tumor subtypes.癌症的计算机辅助药物发现:一种用于识别针对特定分子肿瘤亚型的候选药物的系统生物学工具。
Mol Cancer Ther. 2014 Dec;13(12):3230-40. doi: 10.1158/1535-7163.MCT-14-0260. Epub 2014 Oct 27.
10
Large-scale integration of small molecule-induced genome-wide transcriptional responses, Kinome-wide binding affinities and cell-growth inhibition profiles reveal global trends characterizing systems-level drug action.大规模整合小分子诱导的全基因组转录反应、激酶组结合亲和力和细胞生长抑制谱,揭示了表征系统水平药物作用的全局趋势。
Front Genet. 2014 Sep 30;5:342. doi: 10.3389/fgene.2014.00342. eCollection 2014.