Department of Biotechnology, Motilal Nehru National Institute of Technology Allahabad, Prayagraj, 211004, India.
National Institute of Animal Biotechnology, Hyderabad, 500032, India.
Sci Rep. 2021 Jul 12;11(1):14304. doi: 10.1038/s41598-021-92692-0.
Colorectal cancer (CRC) is a common cause of cancer-related deaths worldwide. The CRC mRNA gene expression dataset containing 644 CRC tumor and 51 normal samples from the cancer genome atlas (TCGA) was pre-processed to identify the significant differentially expressed genes (DEGs). Feature selection techniques Least absolute shrinkage and selection operator (LASSO) and Relief were used along with class balancing for obtaining features (genes) of high importance. The classification of the CRC dataset was done by ML algorithms namely, random forest (RF), K-nearest neighbour (KNN), and artificial neural networks (ANN). The significant DEGs were 2933, having 1832 upregulated and 1101 downregulated genes. The CRC gene expression dataset had 23,186 features. LASSO had performed better than Relief for classifying tumor and normal samples through ML algorithms namely RF, KNN, and ANN with an accuracy of 100%, while Relief had given 79.5%, 85.05%, and 100% respectively. Common features between LASSO and DEGs were 38, from them only 5 common genes namely, VSTM2A, NR5A2, TMEM236, GDLN, and ETFDH had shown statistically significant survival analysis. Functional review and analysis of the selected genes helped in downsizing the 5 genes to 2, which are VSTM2A and TMEM236. Differential expression of TMEM236 was statistically significant and was markedly reduced in the dataset which solicits appreciation for assessment as a novel biomarker for CRC diagnosis.
结直肠癌(CRC)是全球癌症相关死亡的常见原因。从癌症基因组图谱(TCGA)中预处理了包含 644 个 CRC 肿瘤和 51 个正常样本的 CRC mRNA 基因表达数据集,以鉴定显著差异表达基因(DEG)。使用最小绝对收缩和选择算子(LASSO)和 Relief 特征选择技术以及类别平衡来获得具有重要性的特征(基因)。使用机器学习算法(即随机森林(RF)、K 最近邻(KNN)和人工神经网络(ANN))对 CRC 数据集进行分类。显著的 DEG 有 2933 个,其中有 1832 个上调和 1101 个下调基因。CRC 基因表达数据集有 23186 个特征。通过机器学习算法(即 RF、KNN 和 ANN),LASSO 比 Relief 更能准确地对肿瘤和正常样本进行分类,准确率为 100%,而 Relief 的准确率分别为 79.5%、85.05%和 100%。LASSO 和 DEG 之间的共有特征为 38 个,其中只有 5 个共有基因,即 VSTM2A、NR5A2、TMEM236、GDLN 和 ETFDH,它们的生存分析显示具有统计学意义。对选定基因的功能综述和分析有助于将这 5 个基因缩小到 2 个,即 VSTM2A 和 TMEM236。TMEM236 的差异表达具有统计学意义,并且在数据集中明显减少,这引起了对其作为 CRC 诊断新生物标志物评估的赞赏。
Front Neurol. 2022-12-5
Curr Mol Med. 2020
Bioimpacts. 2024-11-5
J Mol Cell Cardiol Plus. 2023-9-12
Brief Bioinform. 2024-11-22
Theranostics. 2019-8-21
Front Pharmacol. 2018-12-6
Mol Oncol. 2018-12-22
Nucleic Acids Res. 2016-5-5
J Natl Cancer Inst. 2015-8-1