基于机器学习的结肠癌候选诊断基因识别

Machine Learning-Based Identification of Colon Cancer Candidate Diagnostics Genes.

作者信息

Koppad Saraswati, Basava Annappa, Nash Katrina, Gkoutos Georgios V, Acharjee Animesh

机构信息

Department of Computer Science and Engineering, National Institute of Technology Karnataka, Mangalore 575025, India.

College of Medical and Dental Sciences, University of Birmingham, Birmingham B15 2TT, UK.

出版信息

Biology (Basel). 2022 Feb 25;11(3):365. doi: 10.3390/biology11030365.

DOI:10.3390/biology11030365

PMID:35336739

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8944988/

Abstract

BACKGROUND

Colorectal cancer (CRC) is the third leading cause of cancer-related death and the fourth most commonly diagnosed cancer worldwide. Due to a lack of diagnostic biomarkers and understanding of the underlying molecular mechanisms, CRC's mortality rate continues to grow. CRC occurrence and progression are dynamic processes. The expression levels of specific molecules vary at various stages of CRC, rendering its early detection and diagnosis challenging and the need for identifying accurate and meaningful CRC biomarkers more pressing. The advances in high-throughput sequencing technologies have been used to explore novel gene expression, targeted treatments, and colon cancer pathogenesis. Such approaches are routinely being applied and result in large datasets whose analysis is increasingly becoming dependent on machine learning (ML) algorithms that have been demonstrated to be computationally efficient platforms for the identification of variables across such high-dimensional datasets.

METHODS

We developed a novel ML-based experimental design to study CRC gene associations. Six different machine learning methods were employed as classifiers to identify genes that can be used as diagnostics for CRC using gene expression and clinical datasets. The accuracy, sensitivity, specificity, F1 score, and area under receiver operating characteristic (AUROC) curve were derived to explore the differentially expressed genes (DEGs) for CRC diagnosis. Gene ontology enrichment analyses of these DEGs were performed and predicted gene signatures were linked with miRNAs.

RESULTS

We evaluated six machine learning classification methods (Adaboost, ExtraTrees, logistic regression, naïve Bayes classifier, random forest, and XGBoost) across different combinations of training and test datasets over GEO datasets. The accuracy and the AUROC of each combination of training and test data with different algorithms were used as comparison metrics. Random forest (RF) models consistently performed better than other models. In total, 34 genes were identified and used for pathway and gene set enrichment analysis. Further mapping of the 34 genes with miRNA identified interesting miRNA hubs genes.

CONCLUSIONS

We identified 34 genes with high accuracy that can be used as a diagnostics panel for CRC.

摘要

背景

结直肠癌（CRC）是全球癌症相关死亡的第三大原因，也是第四大最常被诊断出的癌症。由于缺乏诊断生物标志物以及对潜在分子机制的了解，CRC的死亡率持续上升。CRC的发生和发展是动态过程。特定分子的表达水平在CRC的各个阶段有所不同，这使得其早期检测和诊断具有挑战性，也使得识别准确且有意义的CRC生物标志物的需求更加迫切。高通量测序技术的进步已被用于探索新的基因表达、靶向治疗和结肠癌发病机制。此类方法正在常规应用，并产生了大量数据集，其分析越来越依赖于机器学习（ML）算法，这些算法已被证明是用于识别此类高维数据集中变量的计算高效平台。

方法

我们开发了一种基于ML的新型实验设计来研究CRC基因关联。使用六种不同的机器学习方法作为分类器，利用基因表达和临床数据集识别可用于CRC诊断的基因。得出准确性、敏感性、特异性、F1分数和受试者操作特征曲线下面积（AUROC），以探索用于CRC诊断的差异表达基因（DEG）。对这些DEG进行基因本体富集分析，并将预测的基因特征与miRNA相关联。

结果

我们在GEO数据集上，对六种机器学习分类方法（Adaboost、ExtraTrees、逻辑回归、朴素贝叶斯分类器、随机森林和XGBoost）在训练和测试数据集的不同组合上进行了评估。将不同算法的训练和测试数据的每种组合的准确性和AUROC用作比较指标。随机森林（RF）模型始终比其他模型表现更好。总共鉴定出34个基因，并用于通路和基因集富集分析。将这34个基因与miRNA进一步映射，确定了有趣的miRNA中心基因。

结论

我们高精度地鉴定出34个基因，可作为CRC的诊断指标。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e1bd/8944988/d8b3a2630072/biology-11-00365-g001.jpg

相似文献

Machine Learning-Based Identification of Colon Cancer Candidate Diagnostics Genes.基于机器学习的结肠癌候选诊断基因识别

Biology (Basel). 2022 Feb 25;11(3):365. doi: 10.3390/biology11030365.

Identification of potential biomarkers with colorectal cancer based on bioinformatics analysis and machine learning.基于生物信息学分析和机器学习的结直肠癌潜在生物标志物的鉴定。

Math Biosci Eng. 2021 Oct 19;18(6):8997-9015. doi: 10.3934/mbe.2021443.

Unlocking the Potential of the CA2, CA7, and ITM2C Gene Signatures for the Early Detection of Colorectal Cancer: A Comprehensive Analysis of RNA-Seq Data by Utilizing Machine Learning Algorithms.利用机器学习算法对 RNA-Seq 数据进行综合分析，揭示 CA2、CA7 和 ITM2C 基因标志物在结直肠癌早期检测中的潜力。

Genes (Basel). 2023 Sep 22;14(10):1836. doi: 10.3390/genes14101836.

Identifying the key genes and microRNAs in colorectal cancer liver metastasis by bioinformatics analysis and in vitro experiments.通过生物信息学分析和体外实验鉴定结直肠癌肝转移的关键基因和 microRNAs。

Oncol Rep. 2019 Jan;41(1):279-291. doi: 10.3892/or.2018.6840. Epub 2018 Nov 1.

Blood Biomarkers Panels for Screening of Colorectal Cancer and Adenoma on a Machine Learning-Assisted Detection Platform.基于机器学习辅助检测平台的用于结直肠癌和腺瘤筛查的血液生物标志物检测面板。

Cancer Control. 2023 Jan-Dec;30:10732748231222109. doi: 10.1177/10732748231222109.

Statistical characterization and classification of colon microarray gene expression data using multiple machine learning paradigms.使用多种机器学习范例对结肠微阵列基因表达数据进行统计特征描述和分类。

Comput Methods Programs Biomed. 2019 Jul;176:173-193. doi: 10.1016/j.cmpb.2019.04.008. Epub 2019 Apr 10.

Employing bioinformatics analysis to identify hub genes and microRNAs involved in colorectal cancer.运用生物信息学分析鉴定结直肠癌相关的枢纽基因和 microRNAs。

Med Oncol. 2021 Aug 14;38(9):114. doi: 10.1007/s12032-021-01543-5.

The Prognostic Value of and in Colorectal Cancer: A Machine Learning-Based Integrated Bioinformatics Approach.[具体指标]与[具体指标]在结直肠癌中的预后价值：基于机器学习的综合生物信息学方法

Cancers (Basel). 2023 Aug 28;15(17):4300. doi: 10.3390/cancers15174300.

Using Machine Learning Approaches to Predict Short-Term Risk of Cardiotoxicity Among Patients with Colorectal Cancer After Starting Fluoropyrimidine-Based Chemotherapy.使用机器学习方法预测开始氟嘧啶类化疗后结直肠癌患者短期心毒性风险。

Cardiovasc Toxicol. 2022 Feb;22(2):130-140. doi: 10.1007/s12012-021-09708-4. Epub 2021 Nov 18.

A Gene-Based Machine Learning Classifier Associated to the Colorectal Adenoma-Carcinoma Sequence.一种与结直肠腺瘤-癌序列相关的基于基因的机器学习分类器。

Biomedicines. 2021 Dec 17;9(12):1937. doi: 10.3390/biomedicines9121937.

引用本文的文献

Machine learning-driven multi-targeted drug discovery in colon cancer using biomarker signatures.基于生物标志物特征的机器学习驱动的结肠癌多靶点药物发现

NPJ Precis Oncol. 2025 Aug 22;9(1):297. doi: 10.1038/s41698-025-01058-6.

Identification of a 10-species microbial signature of inflammatory bowel disease by machine learning and external validation.通过机器学习和外部验证识别炎症性肠病的10种微生物特征

Cell Regen. 2025 Jul 14;14(1):32. doi: 10.1186/s13619-025-00246-w.

A supervised machine learning approach with feature selection for sex-specific biomarker prediction.一种用于性别特异性生物标志物预测的带特征选择的监督式机器学习方法。

NPJ Syst Biol Appl. 2025 Jul 1;11(1):69. doi: 10.1038/s41540-025-00523-z.

Integrating machine learning and genetic evidence to uncover novel gene biomarkers for colorectal cancer diagnosis.整合机器学习和遗传证据以发现用于结直肠癌诊断的新型基因生物标志物。

Discov Oncol. 2025 May 6;16(1):675. doi: 10.1007/s12672-025-02435-0.

The diagnostic and prognostic value of in colorectal cancer.[此处“in”前面应还有具体内容]在结直肠癌中的诊断和预后价值。

Bioimpacts. 2024 Nov 5;15:30566. doi: 10.34172/bi.30566. eCollection 2025.

Diagnostic Accuracy of a Blood-Based Biomarker Panel for Colorectal Cancer Detection: A Pilot Study.基于血液的生物标志物检测板用于结直肠癌检测的诊断准确性：一项初步研究。

Cancers (Basel). 2024 Dec 15;16(24):4176. doi: 10.3390/cancers16244176.

Analysis of translesion polymerases in colorectal cancer cells following cetuximab treatment: A network perspective.分析西妥昔单抗治疗后结直肠癌细胞中的跨损伤聚合酶：网络视角。

Cancer Med. 2024 Jan;13(1):e6945. doi: 10.1002/cam4.6945.

Colorectal cancer prognosis based on dietary pattern using synthetic minority oversampling technique with K-nearest neighbors approach.基于合成少数过采样技术与 K 近邻方法的饮食模式对结直肠癌预后的研究。

Sci Rep. 2024 Jul 31;14(1):17709. doi: 10.1038/s41598-024-67848-3.

Bioinformatics analysis and machine learning approach applied to the identification of novel key genes involved in non-alcoholic fatty liver disease.生物信息学分析和机器学习方法在鉴定非酒精性脂肪性肝病新关键基因中的应用。

Sci Rep. 2023 Nov 22;13(1):20489. doi: 10.1038/s41598-023-46711-x.

Patterns of Gene Expression Profiles Associated with Colorectal Cancer in Colorectal Mucosa by Using Machine Learning Methods.利用机器学习方法分析结直肠黏膜中与结直肠癌相关的基因表达谱模式。

Comb Chem High Throughput Screen. 2024;27(19):2921-2934. doi: 10.2174/0113862073266300231026103844.

本文引用的文献

Machine Learning-Based Identification of Potentially Novel Non-Alcoholic Fatty Liver Disease Biomarkers.基于机器学习识别潜在的新型非酒精性脂肪性肝病生物标志物

Biomedicines. 2021 Nov 7;9(11):1636. doi: 10.3390/biomedicines9111636.

Translational biomarkers in the era of precision medicine.精准医学时代的转化生物标志物。

Adv Clin Chem. 2021;102:191-232. doi: 10.1016/bs.acc.2020.08.002. Epub 2020 Oct 3.

Colon cancer survival differs from right side to left side and lymph node harvest number matter.结直肠癌的生存情况因肿瘤发生位置（右侧和左侧）和淋巴结清扫数目而异。

BMC Public Health. 2021 May 12;21(1):906. doi: 10.1186/s12889-021-10746-4.

Differential Expression Analysis Revealing CLCA1 to Be a Prognostic and Diagnostic Biomarker for Colorectal Cancer.差异表达分析揭示CLCA1是结直肠癌的预后和诊断生物标志物。

Front Oncol. 2020 Oct 28;10:573295. doi: 10.3389/fonc.2020.573295. eCollection 2020.

A random forest based biomarker discovery and power analysis framework for diagnostics research.基于随机森林的生物标志物发现和诊断研究功效分析框架。

BMC Med Genomics. 2020 Nov 23;13(1):178. doi: 10.1186/s12920-020-00826-6.

Omics technologies for improved diagnosis and treatment of colorectal cancer: Technical advancement and major perspectives.用于改善结直肠癌诊断和治疗的组学技术：技术进展与主要观点

Biomed Pharmacother. 2020 Nov;131:110648. doi: 10.1016/j.biopha.2020.110648. Epub 2020 Oct 19.

hsa_circRNA_001587 upregulates SLC4A4 expression to inhibit migration, invasion, and angiogenesis of pancreatic cancer cells via binding to microRNA-223.hsa_circRNA_001587 通过与 microRNA-223 结合来上调 SLC4A4 的表达，从而抑制胰腺癌细胞的迁移、侵袭和血管生成。

Am J Physiol Gastrointest Liver Physiol. 2020 Dec 1;319(6):G703-G717. doi: 10.1152/ajpgi.00118.2020. Epub 2020 Sep 2.

Identifying Molecular Biomarkers for Diseases With Machine Learning Based on Integrative Omics.基于组学整合的机器学习识别疾病的分子生物标志物

IEEE/ACM Trans Comput Biol Bioinform. 2021 Nov-Dec;18(6):2514-2525. doi: 10.1109/TCBB.2020.2986387. Epub 2021 Dec 8.

Type-2 11β-hydroxysteroid dehydrogenase promotes the metastasis of colorectal cancer via the Fgfbp1-AKT pathway.2型11β-羟基类固醇脱氢酶通过Fgfbp1-AKT途径促进结直肠癌转移。

Am J Cancer Res. 2020 Feb 1;10(2):662-673. eCollection 2020.

CDK1 and CDC20 overexpression in patients with colorectal cancer are associated with poor prognosis: evidence from integrated bioinformatics analysis.CDK1 和 CDC20 在结直肠癌患者中的过度表达与不良预后相关：来自综合生物信息学分析的证据。

World J Surg Oncol. 2020 Mar 4;18(1):50. doi: 10.1186/s12957-020-01817-8.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于机器学习的结肠癌候选诊断基因识别

Machine Learning-Based Identification of Colon Cancer Candidate Diagnostics Genes.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献