Suppr超能文献

基于机器学习的结肠癌候选诊断基因识别

Machine Learning-Based Identification of Colon Cancer Candidate Diagnostics Genes.

作者信息

Koppad Saraswati, Basava Annappa, Nash Katrina, Gkoutos Georgios V, Acharjee Animesh

机构信息

Department of Computer Science and Engineering, National Institute of Technology Karnataka, Mangalore 575025, India.

College of Medical and Dental Sciences, University of Birmingham, Birmingham B15 2TT, UK.

出版信息

Biology (Basel). 2022 Feb 25;11(3):365. doi: 10.3390/biology11030365.

Abstract

BACKGROUND

Colorectal cancer (CRC) is the third leading cause of cancer-related death and the fourth most commonly diagnosed cancer worldwide. Due to a lack of diagnostic biomarkers and understanding of the underlying molecular mechanisms, CRC's mortality rate continues to grow. CRC occurrence and progression are dynamic processes. The expression levels of specific molecules vary at various stages of CRC, rendering its early detection and diagnosis challenging and the need for identifying accurate and meaningful CRC biomarkers more pressing. The advances in high-throughput sequencing technologies have been used to explore novel gene expression, targeted treatments, and colon cancer pathogenesis. Such approaches are routinely being applied and result in large datasets whose analysis is increasingly becoming dependent on machine learning (ML) algorithms that have been demonstrated to be computationally efficient platforms for the identification of variables across such high-dimensional datasets.

METHODS

We developed a novel ML-based experimental design to study CRC gene associations. Six different machine learning methods were employed as classifiers to identify genes that can be used as diagnostics for CRC using gene expression and clinical datasets. The accuracy, sensitivity, specificity, F1 score, and area under receiver operating characteristic (AUROC) curve were derived to explore the differentially expressed genes (DEGs) for CRC diagnosis. Gene ontology enrichment analyses of these DEGs were performed and predicted gene signatures were linked with miRNAs.

RESULTS

We evaluated six machine learning classification methods (Adaboost, ExtraTrees, logistic regression, naïve Bayes classifier, random forest, and XGBoost) across different combinations of training and test datasets over GEO datasets. The accuracy and the AUROC of each combination of training and test data with different algorithms were used as comparison metrics. Random forest (RF) models consistently performed better than other models. In total, 34 genes were identified and used for pathway and gene set enrichment analysis. Further mapping of the 34 genes with miRNA identified interesting miRNA hubs genes.

CONCLUSIONS

We identified 34 genes with high accuracy that can be used as a diagnostics panel for CRC.

摘要

背景

结直肠癌(CRC)是全球癌症相关死亡的第三大原因,也是第四大最常被诊断出的癌症。由于缺乏诊断生物标志物以及对潜在分子机制的了解,CRC的死亡率持续上升。CRC的发生和发展是动态过程。特定分子的表达水平在CRC的各个阶段有所不同,这使得其早期检测和诊断具有挑战性,也使得识别准确且有意义的CRC生物标志物的需求更加迫切。高通量测序技术的进步已被用于探索新的基因表达、靶向治疗和结肠癌发病机制。此类方法正在常规应用,并产生了大量数据集,其分析越来越依赖于机器学习(ML)算法,这些算法已被证明是用于识别此类高维数据集中变量的计算高效平台。

方法

我们开发了一种基于ML的新型实验设计来研究CRC基因关联。使用六种不同的机器学习方法作为分类器,利用基因表达和临床数据集识别可用于CRC诊断的基因。得出准确性、敏感性、特异性、F1分数和受试者操作特征曲线下面积(AUROC),以探索用于CRC诊断的差异表达基因(DEG)。对这些DEG进行基因本体富集分析,并将预测的基因特征与miRNA相关联。

结果

我们在GEO数据集上,对六种机器学习分类方法(Adaboost、ExtraTrees、逻辑回归、朴素贝叶斯分类器、随机森林和XGBoost)在训练和测试数据集的不同组合上进行了评估。将不同算法的训练和测试数据的每种组合的准确性和AUROC用作比较指标。随机森林(RF)模型始终比其他模型表现更好。总共鉴定出34个基因,并用于通路和基因集富集分析。将这34个基因与miRNA进一步映射,确定了有趣的miRNA中心基因。

结论

我们高精度地鉴定出34个基因,可作为CRC的诊断指标。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e1bd/8944988/d8b3a2630072/biology-11-00365-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验