School of Life Sciences, Shanghai University, Shanghai, 200444, China.
College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, China.
Comb Chem High Throughput Screen. 2024;27(19):2921-2934. doi: 10.2174/0113862073266300231026103844.
Colorectal cancer (CRC) has a very high incidence and lethality rate and is one of the most dangerous cancer types. Timely diagnosis can effectively reduce the incidence of colorectal cancer. Changes in para-cancerous tissues may serve as an early signal for tumorigenesis. Comparison of the differences in gene expression between para-cancerous and normal mucosa can help in the diagnosis of CRC and understanding the mechanisms of development.
This study aimed to identify specific genes at the level of gene expression, which are expressed in normal mucosa and may be predictive of CRC risk.
A machine learning approach was used to analyze transcriptomic data in 459 samples of normal colonic mucosal tissue from 322 CRC cases and 137 non-CRC, in which each sample contained 28,706 gene expression levels. The genes were ranked using four ranking methods based on importance estimation (LASSO, LightGBM, MCFS, and mRMR) and four classification algorithms (decision tree [DT], K-nearest neighbor [KNN], random forest [RF], and support vector machine [SVM]) were combined with incremental feature selection [IFS] methods to construct a prediction model with excellent performance.
The top-ranked genes, namely, , and , were associated with tumorigenesis based on previous studies.
This study summarized four sets of quantitative classification rules based on the DT algorithm, providing clues for understanding the microenvironmental changes caused by CRC. According to the rules, the effect of CRC on normal mucosa can be determined.
结直肠癌(CRC)发病率和致死率极高,是最危险的癌症类型之一。及时诊断可以有效降低结直肠癌的发生率。癌旁组织的变化可能是肿瘤发生的早期信号。癌旁组织和正常黏膜之间基因表达差异的比较有助于 CRC 的诊断和发病机制的理解。
本研究旨在鉴定在基因表达水平上表达的特定基因,这些基因在正常黏膜中表达,可能预示 CRC 风险。
使用机器学习方法分析了 322 例 CRC 病例和 137 例非 CRC 病例中 459 例正常结肠黏膜组织的转录组数据,每个样本包含 28706 个基因表达水平。使用基于重要性估计的四种排序方法(LASSO、LightGBM、MCFS 和 mRMR)对基因进行排序,并结合四种分类算法(决策树[DT]、K-最近邻[KNN]、随机森林[RF]和支持向量机[SVM])与增量特征选择[IFS]方法相结合,构建具有优异性能的预测模型。
排名靠前的基因,即、和,根据先前的研究与肿瘤发生有关。
本研究总结了基于 DT 算法的四套定量分类规则,为理解 CRC 引起的微环境变化提供了线索。根据这些规则,可以确定 CRC 对正常黏膜的影响。