Rahmatallah Yasir, Glazko Galina
Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA.
BMC Bioinformatics. 2025 Apr 14;26(1):103. doi: 10.1186/s12859-025-06117-0.
Gene set analysis methods have played a major role in generating biological interpretations of omics data such as gene expression datasets. However, most methods focus on detecting homogenous pattern changes in mean expression while methods detecting pattern changes in variance remain poorly explored. While a few studies attempted to use gene-level variance analysis, such approach remains under-utilized. When comparing two phenotypes, gene sets with distinct changes in subgroups under one phenotype are overlooked by available methods although they reflect meaningful biological differences between two phenotypes. Multivariate sample-level variance analysis methods are needed to detect such pattern changes.
We used ranking schemes based on minimum spanning tree to generalize the Cramer-Von Mises and Anderson-Darling univariate statistics into multivariate gene set analysis methods to detect differential sample variance or mean. We characterized the detection power and Type I error rate of these methods in addition to two methods developed earlier using simulation results with different parameters. We applied the developed methods to microarray gene expression dataset of prednisolone-resistant and prednisolone-sensitive children diagnosed with B-lineage acute lymphoblastic leukemia and bulk RNA-sequencing gene expression dataset of benign hyperplastic polyps and potentially malignant sessile serrated adenoma/polyps. One or both of the two compared phenotypes in each of these datasets have distinct molecular subtypes that contribute to within phenotype variability and to heterogeneous differences between two compared phenotypes. Our results show that methods designed to detect differential sample variance provide meaningful biological interpretations by detecting specific hallmark gene sets associated with the two compared phenotypes as documented in available literature.
The results of this study demonstrate the usefulness of methods designed to detect differential sample variance in providing biological interpretations when biologically relevant but heterogeneous changes between two phenotypes are prevalent in specific signaling pathways. Software implementation of the methods is available with detailed documentation from Bioconductor package GSAR. The available methods are applicable to gene expression datasets in a normalized matrix form and could be used with other omics datasets in a normalized matrix form with available collection of feature sets.
基因集分析方法在对诸如基因表达数据集等组学数据进行生物学解释方面发挥了重要作用。然而,大多数方法侧重于检测平均表达中的同质模式变化,而检测方差模式变化的方法仍未得到充分探索。虽然有一些研究尝试使用基因水平的方差分析,但这种方法仍未得到充分利用。在比较两种表型时,现有方法会忽略在一种表型下亚组中具有明显变化的基因集,尽管它们反映了两种表型之间有意义的生物学差异。需要多变量样本水平的方差分析方法来检测这种模式变化。
我们使用基于最小生成树的排序方案,将克莱默 - 冯·米塞斯和安德森 - 达林单变量统计量推广为多变量基因集分析方法,以检测差异样本方差或均值。除了早期使用不同参数的模拟结果开发的两种方法外,我们还对这些方法的检测能力和I型错误率进行了表征。我们将开发的方法应用于诊断为B系急性淋巴细胞白血病的泼尼松龙耐药和泼尼松龙敏感儿童的微阵列基因表达数据集,以及良性增生性息肉和潜在恶性无蒂锯齿状腺瘤/息肉的批量RNA测序基因表达数据集。在这些数据集中的每一个中,两种比较的表型中的一种或两种都具有不同的分子亚型,这些亚型导致表型内的变异性以及两种比较表型之间的异质性差异。我们的结果表明,设计用于检测差异样本方差的方法通过检测与两种比较表型相关的特定标志性基因集,如现有文献中所记录的,提供了有意义的生物学解释。
本研究结果证明,当两种表型之间生物学相关但异质性变化在特定信号通路中普遍存在时,设计用于检测差异样本方差的方法在提供生物学解释方面是有用的。这些方法的软件实现可从Bioconductor包GSAR获得详细文档。现有方法适用于归一化矩阵形式的基因表达数据集,并且可以与具有可用特征集集合的归一化矩阵形式的其他组学数据集一起使用。