Towle-Miller Lorin, Jordan William, Lockhart Alexandre, Freudenburg Johannes, Virmani Aman, Bergquist Mandy, Miecznikowski Jeffrey, Powley Will
GSK, Biostatistics, Collegeville, USA.
GSK, Computational Biology, Collegeville, USA.
BMC Genomics. 2025 Jul 1;26(1):596. doi: 10.1186/s12864-025-11769-6.
Biological pathways are sets of genes that jointly drive biological processes. Rather than analyzing genes individually, it is common practice to summarize sets of related genes using gene set variation analysis (GSVA). In short, GSVA summarizes a set of genes into a single score bounded between -1 and 1, where negative values suggest downregulation and positive values suggest upregulation. Although this interpretation is simple in theory, it depends on unbiased estimation of individual gene distributions. In the current version of GSVA, gene distributions are estimated using the input dataset (i.e., the scores are calculated based on the gene distributions from the same dataset). This becomes a major issue when study data does not adequately represent the full distribution of the population. For example, if RNA-seq data was collected on an imbalanced sample (e.g., more disease samples than healthy controls), it would be difficult to discern abnormalities in pathway activity since the gene distributions were estimated on a biased population. Therefore, we propose reference stabilizing GSVA (rsGSVA), a solution to this commonly ignored limitation by using reference datasets to estimate the gene distributions for a more stable GSVA score.
rsGSVA shows comparable power to classic GSVA, singscore, and ssGSEA under ideal settings while demonstrating stable scores on sample subsets. An application on irritable bowel disease highlights interpretational advantages of rsGSVA to other methods in up/down regulation of inflammation signatures.
The rsGSVA technique enhances the GSVA functionality by incorporating a reference dataset. This integration of a reference dataset makes the enrichment scores independent of the input distribution and ensures their stability and reproducibility, even as samples are added or removed.
生物通路是共同驱动生物过程的基因集。与单独分析基因不同,使用基因集变异分析(GSVA)总结相关基因集是常见做法。简而言之,GSVA将一组基因总结为一个介于 -1 和 1 之间的单一分数,其中负值表明下调,正值表明上调。虽然这种解释在理论上很简单,但它依赖于对单个基因分布的无偏估计。在当前版本的 GSVA 中,基因分布是使用输入数据集估计的(即分数是基于同一数据集的基因分布计算的)。当研究数据不能充分代表总体的完整分布时,这就成为一个主要问题。例如,如果在不平衡样本(如疾病样本多于健康对照)上收集 RNA 测序数据,由于基因分布是在有偏差的总体上估计的,就很难辨别通路活性的异常。因此,我们提出参考稳定化 GSVA(rsGSVA),通过使用参考数据集估计基因分布以获得更稳定的 GSVA 分数,来解决这个普遍被忽视的限制。
在理想设置下,rsGSVA 与经典 GSVA、singscore 和 ssGSEA 具有相当的效能,同时在样本子集上表现出稳定的分数。在肠易激综合征上的应用突出了 rsGSVA 在炎症特征上调/下调方面相对于其他方法的解释优势。
rsGSVA 技术通过纳入参考数据集增强了 GSVA 的功能。参考数据集的这种整合使富集分数独立于输入分布,并确保其稳定性和可重复性,即使添加或移除样本也是如此。