Systems Computational Biology Lab, Department of Bioinformatics, School of Chemical and Biotechnology, SASTRA Deemed University, Thanjavur, Tamil Nadu, India.
Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.
PeerJ. 2024 Oct 28;12:e18347. doi: 10.7717/peerj.18347. eCollection 2024.
Colorectal cancer is a common condition with an uncommon burden of disease, heterogeneity in manifestation, and no definitive treatment in the advanced stages. Renewed efforts to unravel the genetic drivers of colorectal cancer progression are paramount. Early-stage detection contributes to the success of cancer therapy and increases the likelihood of a favorable prognosis. Here, we have executed a comprehensive computational workflow aimed at uncovering the discrete stagewise genomic drivers of colorectal cancer progression.
Using the TCGA COADREAD expression data and clinical metadata, we constructed stage-specific linear models as well as contrast models to identify stage-salient differentially expressed genes. Stage-salient differentially expressed genes with a significant monotone trend of expression across the stages were identified as progression-significant biomarkers. The stage-salient genes were benchmarked using normals-augmented dataset, and cross-referenced with existing knowledge. The candidate biomarkers were used to construct the feature space for learning an optimal model for the digital screening of early-stage colorectal cancers. The candidate biomarkers were also examined for constructing a prognostic model based on survival analysis.
Among the biomarkers identified are: CRLF1, CALB2, STAC2, UCHL1, KCNG1 (stage-I salient), KLHL34, LPHN3, GREM2, ADCY5, PLAC2, DMRT3 (stage-II salient), PIGR, HABP2, SLC26A9 (stage-III salient), GABRD, DKK1, DLX3, CST6, HOTAIR (stage-IV salient), and CDH3, KRT80, AADACL2, OTOP2, FAM135B, HSP90AB1 (top linear model genes). In particular the study yielded 31 genes that are progression-significant such as ESM1, DKK1, SPDYC, IGFBP1, BIRC7, NKD1, CXCL13, VGLL1, PLAC1, SPERT, UPK2, and interestingly three members of the LY6G6 family. Significant monotonic linear model genes included HIGD1A, ACADS, PEX26, and SPIB. A feature space of just seven biomarkers, namely ESM1, DHRS7C, OTOP3, AADACL2, LPHN3, GABRD, and LPAR1, was sufficient to optimize a RandomForest model that achieved > 98% balanced accuracy (and performant recall) of cancer vs. normal on external validation. Design of an optimal multivariate model based on survival analysis yielded a prognostic panel of three stage-IV salient genes, namely HOTAIR, GABRD, and DKK1. Based on the above sparse signatures, we have developed COADREADx, a web-server for potentially assisting colorectal cancer screening and patient risk stratification. COADREADx provides uncertainty measures for its predictions and needs clinical validation. It has been deployed for experimental non-commercial use at: https://apalanialab.shinyapps.io/coadreadx/.
结直肠癌是一种常见疾病,具有罕见的疾病负担、表现的异质性以及在晚期阶段没有明确的治疗方法。解开结直肠癌进展的遗传驱动因素的努力至关重要。早期发现有助于癌症治疗的成功,并增加预后良好的可能性。在这里,我们执行了一个全面的计算工作流程,旨在揭示结直肠癌进展的离散阶段基因组驱动因素。
使用 TCGA COADREAD 表达数据和临床元数据,我们构建了阶段特异性线性模型和对比模型,以识别阶段明显的差异表达基因。在各阶段表现出显著单调表达趋势的阶段明显差异表达基因被鉴定为进展显著的生物标志物。使用正常增强数据集对阶段明显的基因进行基准测试,并与现有知识进行交叉引用。候选生物标志物用于构建特征空间,以学习用于早期结直肠癌数字筛查的最佳模型。候选生物标志物也被检查用于构建基于生存分析的预后模型。
在鉴定出的生物标志物中,包括:CRLF1、CALB2、STAC2、UCHL1、KCNG1(I 期明显)、KLHL34、LPHN3、GREM2、ADCY5、PLAC2、DMRT3(II 期明显)、PIGR、HABP2、SLC26A9(III 期明显)、GABRD、DKK1、DLX3、CST6、HOTAIR(IV 期明显)和 CDH3、KRT80、AADACL2、OTOP2、FAM135B、HSP90AB1(线性模型基因)。特别是,该研究产生了 31 个具有进展意义的基因,例如 ESM1、DKK1、SPDYC、IGFBP1、BIRC7、NKD1、CXCL13、VGLL1、PLAC1、SPERT、UPK2,以及有趣的是 LY6G6 家族的三个成员。显著的单调线性模型基因包括 HIGD1A、ACADS、PEX26 和 SPIB。仅使用七个生物标志物(即 ESM1、DHRS7C、OTOP3、AADACL2、LPHN3、GABRD 和 LPAR1)的特征空间就足以优化随机森林模型,该模型在外部验证中实现了癌症与正常之间>98%的平衡准确性(和高性能召回率)。基于生存分析设计的最佳多变量模型产生了三个 IV 期明显基因的预后面板,即 HOTAIR、GABRD 和 DKK1。基于上述稀疏特征,我们开发了 COADREADx,这是一个用于潜在辅助结直肠癌筛查和患者风险分层的网络服务器。COADREADx 为其预测提供不确定性度量,需要临床验证。它已部署用于实验性非商业用途,网址为:https://apalanialab.shinyapps.io/coadreadx/。