Suppr超能文献

一种混合机器学习方法,用于从基因表达微阵列数据中筛选原发性乳腺肿瘤分类的最佳预测因子。

A Hybrid Machine Learning Approach to Screen Optimal Predictors for the Classification of Primary Breast Tumors from Gene Expression Microarray Data.

作者信息

Alromema Nashwan, Syed Asif Hassan, Khan Tabrej

机构信息

Department of Computer Science, Faculty of Computing and Information Technology Rabigh (FCITR), King Abdulaziz University, Jeddah 22254, Saudi Arabia.

Department of Information Systems, Faculty of Computing and Information Technology Rabigh (FCITR), King Abdulaziz University, Jeddah 22254, Saudi Arabia.

出版信息

Diagnostics (Basel). 2023 Feb 13;13(4):708. doi: 10.3390/diagnostics13040708.

Abstract

The high dimensionality and sparsity of the microarray gene expression data make it challenging to analyze and screen the optimal subset of genes as predictors of breast cancer (BC). The authors in the present study propose a novel hybrid Feature Selection (FS) sequential framework involving minimum Redundancy-Maximum Relevance (mRMR), a two-tailed unpaired -test, and meta-heuristics to screen the most optimal set of gene biomarkers as predictors for BC. The proposed framework identified a set of three most optimal gene biomarkers, namely, MAPK 1, APOBEC3B, and ENAH. In addition, the state-of-the-art supervised Machine Learning (ML) algorithms, namely Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Neural Net (NN), Naïve Bayes (NB), Decision Tree (DT), eXtreme Gradient Boosting (XGBoost), and Logistic Regression (LR) were used to test the predictive capability of the selected gene biomarkers and select the most effective breast cancer diagnostic model with higher values of performance matrices. Our study found that the XGBoost-based model was the superior performer with an accuracy of 0.976 ± 0.027, an F1-Score of 0.974 ± 0.030, and an AUC value of 0.961 ± 0.035 when tested on an independent test dataset. The screened gene biomarkers-based classification system efficiently detects primary breast tumors from normal breast samples.

摘要

微阵列基因表达数据的高维度和稀疏性使得分析和筛选作为乳腺癌(BC)预测指标的最佳基因子集具有挑战性。本研究的作者提出了一种新颖的混合特征选择(FS)序列框架,该框架涉及最小冗余-最大相关性(mRMR)、双尾非配对t检验和元启发式算法,以筛选出作为BC预测指标的最佳基因生物标志物集。所提出的框架确定了一组三个最佳基因生物标志物,即丝裂原活化蛋白激酶1(MAPK 1)、载脂蛋白B mRNA编辑酶催化多肽样3B(APOBEC3B)和埃纳赫蛋白(ENAH)。此外,还使用了最先进的监督机器学习(ML)算法,即支持向量机(SVM)、K近邻(KNN)、神经网络(NN)、朴素贝叶斯(NB)、决策树(DT)、极端梯度提升(XGBoost)和逻辑回归(LR),来测试所选基因生物标志物的预测能力,并选择具有更高性能矩阵值的最有效的乳腺癌诊断模型。我们的研究发现,在独立测试数据集上进行测试时,基于XGBoost的模型表现最佳,准确率为0.976±0.027,F1分数为0.974±0.030,AUC值为0.961±0.035。基于筛选出的基因生物标志物的分类系统能够有效地从正常乳腺样本中检测出原发性乳腺肿瘤。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/43f0/9955903/e616bb7ee19c/diagnostics-13-00708-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验