Frost H Robert
Department of Biomedical Data Science, Dartmouth College, Hanover, NH 03755.
bioRxiv. 2025 May 26:2025.05.21.655279. doi: 10.1101/2025.05.21.655279.
A common approach for exploring pathway dysregulation in cancer involves the gene set or pathway analysis of tumor transcriptomic data. Unfortunately, the effectiveness of cancer gene set testing is limited by the fact that most gene set collections model gene activity in normal tissue, which can differ significantly from gene activity found within tumors. To address this challenge, we have developed a bioinformatics approach based on sparse principal component analysis (PCA) for optimizing existing gene set collections to reflect the pattern of gene activity in dysplastic tissue and have used this technique to optimize the Molecular Signatures Database (MSigDB) Hallmark collection for 21 solid human cancers profiled via bulk RNA-seq by The Tumor Genome Atlas (TCGA). Demonstrating the biological utility of our approach, the average survival association of gene set members is improved after optimization for nearly all cancer types and Hallmark gene sets.
探索癌症中信号通路失调的一种常见方法涉及对肿瘤转录组数据进行基因集或信号通路分析。不幸的是,癌症基因集测试的有效性受到以下事实的限制:大多数基因集集合模拟的是正常组织中的基因活性,而这可能与肿瘤中发现的基因活性有显著差异。为应对这一挑战,我们开发了一种基于稀疏主成分分析(PCA)的生物信息学方法,用于优化现有的基因集集合,以反映发育异常组织中的基因活性模式,并使用该技术对通过肿瘤基因组图谱(TCGA)的批量RNA测序分析的21种人类实体癌的分子特征数据库(MSigDB)标志性集合进行了优化。我们的方法在生物学上的实用性得到了证明,在对几乎所有癌症类型和标志性基因集进行优化后,基因集成员的平均生存关联得到了改善。