Department of Biotechnology, Motilal Nehru National Institute of Technology Allahabad, Prayagraj 211004, India.
National Institute of Animal Biotechnology, Hyderabad 500032, India.
Genes (Basel). 2023 Sep 22;14(10):1836. doi: 10.3390/genes14101836.
Colorectal cancer affects the colon or rectum and is a common global health issue, with 1.1 million new cases occurring yearly. The study aimed to identify gene signatures for the early detection of CRC using machine learning (ML) algorithms utilizing gene expression data. The TCGA-CRC and GSE50760 datasets were pre-processed and subjected to feature selection using the LASSO method in combination with five ML algorithms: Adaboost, Random Forest (RF), Logistic Regression (LR), Gaussian Naive Bayes (GNB), and Support Vector Machine (SVM). The important features were further analyzed for gene expression, correlation, and survival analyses. Validation of the external dataset GSE142279 was also performed. The RF model had the best classification accuracy for both datasets. A feature selection process resulted in the identification of 12 candidate genes, which were subsequently reduced to 3 (CA2, CA7, and ITM2C) through gene expression and correlation analyses. These three genes achieved 100% accuracy in an external dataset. The AUC values for these genes were 99.24%, 100%, and 99.5%, respectively. The survival analysis showed a significant logrank -value of 0.044 for the final gene signatures. The analysis of tumor immunocyte infiltration showed a weak correlation with the expression of the gene signatures. CA2, CA7, and ITM2C can serve as gene signatures for the early detection of CRC and may provide valuable information for prognostic and therapeutic decision making. Further research is needed to fully understand the potential of these genes in the context of CRC.
结直肠癌影响结肠或直肠,是一个常见的全球健康问题,每年有 110 万新发病例。本研究旨在利用机器学习(ML)算法,通过基因表达数据,识别用于结直肠癌早期检测的基因特征。TCGA-CRC 和 GSE50760 数据集经过预处理,采用 LASSO 方法结合 5 种 ML 算法(Adaboost、随机森林(RF)、逻辑回归(LR)、高斯朴素贝叶斯(GNB)和支持向量机(SVM))进行特征选择。进一步对重要特征进行基因表达、相关性和生存分析。还对外部数据集 GSE142279 进行了验证。RF 模型对两个数据集的分类准确率最高。通过特征选择过程,确定了 12 个候选基因,通过基因表达和相关性分析进一步减少到 3 个(CA2、CA7 和 ITM2C)。这三个基因在外部数据集达到了 100%的准确率。这些基因的 AUC 值分别为 99.24%、100%和 99.5%。生存分析显示最终基因特征的对数秩检验值为 0.044。肿瘤免疫细胞浸润分析显示与基因特征的表达有微弱相关性。CA2、CA7 和 ITM2C 可作为结直肠癌早期检测的基因特征,可为预后和治疗决策提供有价值的信息。需要进一步研究以充分了解这些基因在结直肠癌中的潜在作用。