Subirana-Granés Marc, Nandi Sutanu, Zhang Haoyu, Chikina Maria, Pividori Milton
Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
Department of Pharmacology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
bioRxiv. 2025 Jun 8:2025.06.05.658122. doi: 10.1101/2025.06.05.658122.
Gene expression analysis has long been fundamental for elucidating molecular pathways and gene-disease relationships, but traditional single-gene approaches cannot capture the coordinated regulatory networks underlying complex phenotypes; although unsupervised matrix factorization methods (e.g., PCA, NMF) reveal coexpression patterns, they lack the ability to incorporate prior biological knowledge and often struggle with interpretability and technical noise correction. Semi-supervised strategies such as PLIER have improved interpretability by integrating pathway annotations during latent variable extraction, yet the original PLIER implementation is prohibitively slow and memory-intensive, making it impractical for modern large-scale resources like ARCHS4 or recount3. Here, we introduce PLIERv2, which overcomes these constraints through a two-phase algorithmic design (an unsupervised "PLIERbase" initialization followed by a "PLIERfull" regression that incorporates priors via glmnet), rigorous internal cross-validation to tune regularization parameters for each latent variable, and efficient on-disk data handling using memory-mapped matrices from the bigstatsr package. Benchmarking on GTEx, recount2, and ARCHS4 demonstrates that PLIERv2 achieves 7×-41× speedups over PLIERv1, succeeds in modeling hundreds of thousands of samples that PLIERv1 cannot handle, and maintains or improves biological specificity of latent variables as shown by tissue-alignment and pathway enrichment analyses. By filling the gap in scalable, biologically informed latent variable extraction, PLIERv2 enables comprehensive analysis of modern transcriptomic compendia and paves the way for deeper insights into gene regulatory networks and downstream applications in translational genomics.
长期以来,基因表达分析一直是阐明分子途径和基因与疾病关系的基础,但传统的单基因方法无法捕捉复杂表型背后的协同调控网络;尽管无监督矩阵分解方法(如主成分分析、非负矩阵分解)揭示了共表达模式,但它们缺乏整合先验生物学知识的能力,并且在可解释性和技术噪声校正方面常常面临困难。诸如PLIER之类的半监督策略通过在潜在变量提取过程中整合通路注释提高了可解释性,然而原始的PLIER实现速度极慢且内存占用量大,这使得它对于像ARCHS4或recount3这样的现代大规模资源来说不切实际。在此,我们引入了PLIERv2,它通过两阶段算法设计(一个无监督的“PLIERbase”初始化,随后是一个通过glmnet纳入先验的“PLIERfull”回归)、对每个潜在变量调整正则化参数的严格内部交叉验证以及使用来自bigstatsr包的内存映射矩阵进行高效的磁盘数据处理来克服这些限制。在GTEx、recount2和ARCHS4上的基准测试表明,PLIERv2比PLIERv1实现了7倍至41倍的加速,成功地对PLIERv1无法处理的数十万样本进行建模,并且通过组织比对和通路富集分析表明,它保持或提高了潜在变量的生物学特异性。通过填补可扩展的、具有生物学信息的潜在变量提取方面的空白,PLIERv2能够对现代转录组数据集进行全面分析,并为深入了解基因调控网络以及转化基因组学中的下游应用铺平道路。