Suppr超能文献

基于 Wilcoxon 符号秩和检验和新型灰狼优化集成学习模型的微阵列基因表达数据分类。

Microarray Gene Expression Data Classification via Wilcoxon Sign Rank Sum and Novel Grey Wolf Optimized Ensemble Learning Models.

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2023 Nov-Dec;20(6):3575-3587. doi: 10.1109/TCBB.2023.3305429. Epub 2023 Dec 25.

Abstract

Cancer is a deadly disease that affects the lives of people all over the world. Finding a few genes relevant to a single cancer disease can lead to effective treatments. The difficulty with microarray datasets is their high dimensionality; they have a large number of features in comparison to the small number of samples in these datasets. Additionally, microarray data typically exhibit significant asymmetry in dimensionality as well as high levels of redundancy and noise. It is widely held that the majority of genes lack informative value about the classes under study. Recent research has attempted to reduce this high dimensionality by employing various feature selection techniques. This paper presents new ensemble feature selection techniques via the Wilcoxon Sign Rank Sum test (WCSRS) and the Fisher's test (F-test). In the first phase of the experiment, data preprocessing was performed; subsequently, feature selection was performed via the WCSRS and F-test in such a way that the (probability values) p-values of the WCRSR and F-test were adopted for cancerous gene identification. The extracted gene set was used to classify cancer patients using ensemble learning models (ELM), random forest (RF), extreme gradient boosting (Xgboost), cat boost, and Adaboost. To boost the performance of the ELM, we optimized the parameters of all the ELMs using the Grey Wolf optimizer (GWO). The experimental analysis was performed on colon cancer, which included 2000 genes from 62 patients (40 malignant and 22 benign). Using a WCSRS test for feature selection, the optimized Xgboost demonstrated 100% accuracy. The optimized cat boost, on the other hand, demonstrated 100% accuracy using the F-test for feature selection. This represents a 15% improvement over previously reported values in the literature.

摘要

癌症是一种影响全球人民生命的致命疾病。找到与单一癌症疾病相关的少数几个基因可以带来有效的治疗方法。微阵列数据集的困难在于其高维度性;与这些数据集中小样本数量相比,它们具有大量的特征。此外,微阵列数据通常在维度上表现出显著的不对称性,并且具有高水平的冗余和噪声。人们普遍认为,大多数基因缺乏关于研究类别的有价值信息。最近的研究试图通过采用各种特征选择技术来降低这种高维度性。本文通过威尔科克森符号秩和检验 (WCSRS) 和 Fisher 检验 (F-test) 提出了新的集成特征选择技术。在实验的第一阶段,进行了数据预处理;随后,通过 WCSRS 和 F-test 进行特征选择,以便采用 WCRSR 和 F-test 的 (概率值) p 值进行癌基因识别。提取的基因集用于使用集成学习模型 (ELM)、随机森林 (RF)、极端梯度提升 (Xgboost)、Catboost 和 Adaboost 对癌症患者进行分类。为了提高 ELM 的性能,我们使用灰狼优化器 (GWO) 优化了所有 ELM 的参数。在结肠癌上进行了实验分析,其中包括 62 名患者(40 名恶性和 22 名良性)的 2000 个基因。使用 WCSRS 测试进行特征选择,优化后的 Xgboost 表现出 100%的准确性。另一方面,使用 F-test 进行特征选择的优化 Catboost 表现出 100%的准确性。这比文献中以前报告的值提高了 15%。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验