Hemandhar Kumar Shamini, Tapken Ines, Kuhn Daniela, Claus Peter, Jung Klaus
Institute for Animal Genomics, University of Veterinary Medicine, Foundation, Hannover, Germany.
Center for Systems Neuroscience (ZSN), University of Veterinary Medicine, Foundation, Hannover, Germany.
Front Bioinform. 2024 Apr 3;4:1380928. doi: 10.3389/fbinf.2024.1380928. eCollection 2024.
Gene set enrichment analysis (GSEA) subsequent to differential expression analysis is a standard step in transcriptomics and proteomics data analysis. Although many tools for this step are available, the results are often difficult to reproduce because set annotations can change in the databases, that is, new features can be added or existing features can be removed. Finally, such changes in set compositions can have an impact on biological interpretation. We present bootGSEA, a novel computational pipeline, to study the robustness of GSEA. By repeating GSEA based on bootstrap samples, the variability and robustness of results can be studied. In our pipeline, not all genes or proteins are involved in the different bootstrap replicates of the analyses. Finally, we aggregate the ranks from the bootstrap replicates to obtain a score per gene set that shows whether it gains or loses evidence compared to the ranking of the standard GSEA. Rank aggregation is also used to combine GSEA results from different omics levels or from multiple independent studies at the same omics level. By applying our approach to six independent cancer transcriptomics datasets, we showed that bootstrap GSEA can aid in the selection of more robust enriched gene sets. Additionally, we applied our approach to paired transcriptomics and proteomics data obtained from a mouse model of spinal muscular atrophy (SMA), a neurodegenerative and neurodevelopmental disease associated with multi-system involvement. After obtaining a robust ranking at both omics levels, both ranking lists were combined to aggregate the findings from the transcriptomics and proteomics results. Furthermore, we constructed the new R-package "bootGSEA," which implements the proposed methods and provides graphical views of the findings. Bootstrap-based GSEA was able in the example datasets to identify gene or protein sets that were less robust when the set composition changed during bootstrap analysis. The rank aggregation step was useful for combining bootstrap results and making them comparable to the original findings on the single-omics level or for combining findings from multiple different omics levels.
差异表达分析后的基因集富集分析(GSEA)是转录组学和蛋白质组学数据分析中的一个标准步骤。尽管有许多工具可用于此步骤,但结果往往难以重现,因为数据库中的集合注释可能会发生变化,也就是说,可能会添加新特征或删除现有特征。最后,集合组成的这种变化可能会对生物学解释产生影响。我们提出了bootGSEA,一种新颖的计算流程,用于研究GSEA的稳健性。通过基于自举样本重复进行GSEA,可以研究结果的变异性和稳健性。在我们的流程中,并非所有基因或蛋白质都参与分析的不同自举重复。最后,我们汇总自举重复的排名,以获得每个基因集的分数,该分数显示与标准GSEA的排名相比,它是获得还是失去了证据。排名汇总还用于组合来自不同组学水平或来自同一组学水平的多个独立研究的GSEA结果。通过将我们的方法应用于六个独立的癌症转录组学数据集,我们表明自举GSEA有助于选择更稳健的富集基因集。此外,我们将我们的方法应用于从脊髓性肌萎缩症(SMA)小鼠模型获得的配对转录组学和蛋白质组学数据,SMA是一种与多系统受累相关的神经退行性和神经发育疾病。在两个组学水平上获得稳健排名后,将两个排名列表合并以汇总转录组学和蛋白质组学结果的发现。此外,我们构建了新的R包“bootGSEA”,它实现了所提出的方法并提供了结果的图形视图。在示例数据集中,基于自举的GSEA能够识别在自举分析期间集合组成发生变化时稳健性较差的基因或蛋白质集。排名汇总步骤对于组合自举结果并使其与单一组学水平上的原始发现具有可比性或用于组合来自多个不同组学水平的发现很有用。