• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

三阶段包装器-过滤器特征选择框架用于疾病分类。

A Tri-Stage Wrapper-Filter Feature Selection Framework for Disease Classification.

机构信息

Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India.

Department of Information Technology, Jadavpur University, Kolkata 700106, India.

出版信息

Sensors (Basel). 2021 Aug 18;21(16):5571. doi: 10.3390/s21165571.

DOI:10.3390/s21165571
PMID:34451013
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8402295/
Abstract

In machine learning and data science, feature selection is considered as a crucial step of data preprocessing. When we directly apply the raw data for classification or clustering purposes, sometimes we observe that the learning algorithms do not perform well. One possible reason for this is the presence of redundant, noisy, and non-informative features or attributes in the datasets. Hence, feature selection methods are used to identify the subset of relevant features that can maximize the model performance. Moreover, due to reduction in feature dimension, both training time and storage required by the model can be reduced as well. In this paper, we present a tri-stage wrapper-filter-based feature selection framework for the purpose of medical report-based disease detection. In the first stage, an ensemble was formed by four filter methods-Mutual Information, ReliefF, Chi Square, and Xvariance-and then each feature from the union set was assessed by three classification algorithms-support vector machine, naïve Bayes, and -nearest neighbors-and an average accuracy was calculated. The features with higher accuracy were selected to obtain a preliminary subset of optimal features. In the second stage, Pearson correlation was used to discard highly correlated features. In these two stages, XGBoost classification algorithm was applied to obtain the most contributing features that, in turn, provide the best optimal subset. Then, in the final stage, we fed the obtained feature subset to a meta-heuristic algorithm, called whale optimization algorithm, in order to further reduce the feature set and to achieve higher accuracy. We evaluated the proposed feature selection framework on four publicly available disease datasets taken from the UCI machine learning repository, namely, arrhythmia, leukemia, DLBCL, and prostate cancer. Our obtained results confirm that the proposed method can perform better than many state-of-the-art methods and can detect important features as well. Less features ensure less medical tests for correct diagnosis, thus saving both time and cost.

摘要

在机器学习和数据科学中,特征选择被认为是数据预处理的关键步骤。当我们直接将原始数据应用于分类或聚类目的时,有时我们会发现学习算法表现不佳。造成这种情况的一个可能原因是数据集存在冗余、嘈杂和非信息特征或属性。因此,特征选择方法用于识别可以最大化模型性能的相关特征子集。此外,由于特征维度的减少,模型所需的训练时间和存储也可以减少。在本文中,我们提出了一种基于三阶段包装-过滤器的特征选择框架,用于基于医疗报告的疾病检测。在第一阶段,通过四个过滤器方法(互信息、ReliefF、卡方和 Xvariance)形成一个集成,然后对并集的每个特征进行三种分类算法(支持向量机、朴素贝叶斯和 K-最近邻)的评估,并计算平均准确率。选择准确率较高的特征以获得初步的最佳特征子集。在第二阶段,使用 Pearson 相关性来丢弃高度相关的特征。在这两个阶段中,应用 XGBoost 分类算法以获得最有贡献的特征,进而提供最佳的最佳子集。然后,在最后一个阶段,我们将获得的特征子集输入鲸鱼优化算法(一种元启发式算法),以进一步减少特征集并获得更高的准确率。我们在 UCI 机器学习存储库中从四个公开可用的疾病数据集上评估了所提出的特征选择框架,即心律失常、白血病、DLBCL 和前列腺癌。我们的结果证实,该方法可以比许多最先进的方法表现得更好,并且可以检测到重要的特征。更少的特征可以确保正确诊断所需的医疗检查更少,从而节省时间和成本。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5fda/8402295/68b7dc157d2e/sensors-21-05571-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5fda/8402295/01a6127fc84a/sensors-21-05571-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5fda/8402295/8e591c5d0ddb/sensors-21-05571-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5fda/8402295/356335396b89/sensors-21-05571-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5fda/8402295/68b7dc157d2e/sensors-21-05571-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5fda/8402295/01a6127fc84a/sensors-21-05571-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5fda/8402295/8e591c5d0ddb/sensors-21-05571-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5fda/8402295/356335396b89/sensors-21-05571-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5fda/8402295/68b7dc157d2e/sensors-21-05571-g004.jpg

相似文献

1
A Tri-Stage Wrapper-Filter Feature Selection Framework for Disease Classification.三阶段包装器-过滤器特征选择框架用于疾病分类。
Sensors (Basel). 2021 Aug 18;21(16):5571. doi: 10.3390/s21165571.
2
R-HEFS: Rough set based heterogeneous ensemble feature selection method for medical data classification.基于粗糙集的异质集成特征选择方法在医学数据分类中的应用。
Artif Intell Med. 2021 Apr;114:102049. doi: 10.1016/j.artmed.2021.102049. Epub 2021 Mar 6.
3
Upper-Limb Motion Recognition Based on Hybrid Feature Selection: Algorithm Development and Validation.基于混合特征选择的上肢运动识别:算法开发与验证。
JMIR Mhealth Uhealth. 2021 Sep 2;9(9):e24402. doi: 10.2196/24402.
4
R-Ensembler: A greedy rough set based ensemble attribute selection algorithm with kNN imputation for classification of medical data.R-Ensembler:一种基于粗糙集的贪婪集成属性选择算法,具有 kNN 插补功能,用于医学数据的分类。
Comput Methods Programs Biomed. 2020 Feb;184:105122. doi: 10.1016/j.cmpb.2019.105122. Epub 2019 Oct 8.
5
Human monkeypox diagnose (HMD) strategy based on data mining and artificial intelligence techniques.基于数据挖掘和人工智能技术的人感染猴痘诊断(HMD)策略。
Comput Biol Med. 2023 Jan;152:106383. doi: 10.1016/j.compbiomed.2022.106383. Epub 2022 Dec 2.
6
Wrapper method for feature selection to classify cardiac arrhythmia.用于心律失常分类的特征选择包装方法。
Annu Int Conf IEEE Eng Med Biol Soc. 2017 Jul;2017:3656-3659. doi: 10.1109/EMBC.2017.8037650.
7
Ensemble of heterogeneous classifiers for diagnosis and prediction of coronary artery disease with reduced feature subset.用于冠状动脉疾病诊断和预测的具有简化特征子集的异构分类器集成
Comput Methods Programs Biomed. 2021 Jan;198:105770. doi: 10.1016/j.cmpb.2020.105770. Epub 2020 Sep 30.
8
A Highly Discriminative Hybrid Feature Selection Algorithm for Cancer Diagnosis.一种用于癌症诊断的高判别混合特征选择算法。
ScientificWorldJournal. 2022 Aug 9;2022:1056490. doi: 10.1155/2022/1056490. eCollection 2022.
9
An ensemble learning-based feature selection algorithm for identification of biomarkers of renal cell carcinoma.一种基于集成学习的用于识别肾细胞癌生物标志物的特征选择算法。
PeerJ Comput Sci. 2024 Jan 4;10:e1768. doi: 10.7717/peerj-cs.1768. eCollection 2024.
10
Genetic algorithm based cancerous gene identification from microarray data using ensemble of filter methods.基于遗传算法的基因识别方法,从微阵列数据中使用过滤方法的集成。
Med Biol Eng Comput. 2019 Jan;57(1):159-176. doi: 10.1007/s11517-018-1874-4. Epub 2018 Aug 1.

引用本文的文献

1
A Hybrid Ensemble Equilibrium Optimizer Gene Selection Algorithm for Microarray Data.一种用于微阵列数据的混合集成平衡优化器基因选择算法
Biomimetics (Basel). 2025 Aug 10;10(8):523. doi: 10.3390/biomimetics10080523.
2
Navigating the microarray landscape: a comprehensive review of feature selection techniques and their applications.探索微阵列领域:特征选择技术及其应用的全面综述
Front Big Data. 2025 Jul 10;8:1624507. doi: 10.3389/fdata.2025.1624507. eCollection 2025.
3
Schizophrenia detection from electroencephalogram signals using image encoding and wrapper-based deep feature selection approach.

本文引用的文献

1
Hybrid gene selection approach using XGBoost and multi-objective genetic algorithm for cancer classification.基于 XGBoost 和多目标遗传算法的混合基因选择方法在癌症分类中的应用。
Med Biol Eng Comput. 2022 Mar;60(3):663-681. doi: 10.1007/s11517-021-02476-x. Epub 2022 Jan 13.
2
A Neighborhood Rough Sets-Based Attribute Reduction Method Using Lebesgue and Entropy Measures.一种基于邻域粗糙集的使用勒贝格测度和熵测度的属性约简方法。
Entropy (Basel). 2019 Feb 1;21(2):138. doi: 10.3390/e21020138.
3
GeFeS: A generalized wrapper feature selection approach for optimizing classification performance.
基于图像编码和基于包装器的深度特征选择方法从脑电图信号中检测精神分裂症。
Sci Rep. 2025 Jul 1;15(1):21390. doi: 10.1038/s41598-025-06121-7.
4
Feature Selection for Hypertension Risk Prediction Using XGBoost on Single Nucleotide Polymorphism Data.基于单核苷酸多态性数据使用XGBoost进行高血压风险预测的特征选择
Healthc Inform Res. 2025 Jan;31(1):16-22. doi: 10.4258/hir.2025.31.1.16. Epub 2025 Jan 31.
5
Schizophrenia Detection and Classification: A Systematic Review of the Last Decade.精神分裂症的检测与分类:过去十年的系统综述
Diagnostics (Basel). 2024 Nov 29;14(23):2698. doi: 10.3390/diagnostics14232698.
6
Empirical exploration of whale optimisation algorithm for heart disease prediction.鲸鱼优化算法在心脏病预测中的实证研究。
Sci Rep. 2024 Feb 24;14(1):4530. doi: 10.1038/s41598-024-54990-1.
7
Identification and verification of diagnostic biomarkers based on mitochondria-related genes related to immune microenvironment for preeclampsia using machine learning algorithms.基于机器学习算法的与免疫微环境相关的线粒体相关基因的子痫前期诊断生物标志物的鉴定和验证。
Front Immunol. 2024 Jan 8;14:1304165. doi: 10.3389/fimmu.2023.1304165. eCollection 2023.
8
A novel hybrid algorithm based on Harris Hawks for tumor feature gene selection.一种基于哈里斯鹰算法的新型混合肿瘤特征基因选择算法
PeerJ Comput Sci. 2023 Feb 13;9:e1229. doi: 10.7717/peerj-cs.1229. eCollection 2023.
9
CRV-NET: Robust Intensity Recognition of Coronavirus in Lung Computerized Tomography Scan Images.CRV-NET:肺部计算机断层扫描图像中冠状病毒的稳健强度识别
Diagnostics (Basel). 2023 May 18;13(10):1783. doi: 10.3390/diagnostics13101783.
10
Feature selection for high dimensional microarray gene expression data via weighted signal to noise ratio.基于加权信噪比的高维微阵列基因表达数据特征选择。
PLoS One. 2023 Apr 25;18(4):e0284619. doi: 10.1371/journal.pone.0284619. eCollection 2023.
GeFeS:一种用于优化分类性能的广义包装特征选择方法。
Comput Biol Med. 2020 Oct;125:103974. doi: 10.1016/j.compbiomed.2020.103974. Epub 2020 Aug 20.
4
Diagnosis and classification of cancer using hybrid model based on ReliefF and convolutional neural network.基于ReliefF和卷积神经网络的混合模型用于癌症的诊断与分类
Med Hypotheses. 2020 Apr;137:109577. doi: 10.1016/j.mehy.2020.109577. Epub 2020 Jan 20.
5
Medical data set classification using a new feature selection algorithm combined with twin-bounded support vector machine.使用结合了双边界支持向量机的新特征选择算法进行医学数据集分类。
Med Biol Eng Comput. 2020 Mar;58(3):519-528. doi: 10.1007/s11517-019-02100-z. Epub 2020 Jan 4.
6
Gene Selection via a New Hybrid Ant Colony Optimization Algorithm for Cancer Classification in High-Dimensional Data.基于新型混合蚁群优化算法的基因选择在高维数据癌症分类中的应用。
Comput Math Methods Med. 2019 Oct 13;2019:7828590. doi: 10.1155/2019/7828590. eCollection 2019.
7
Heuristic filter feature selection methods for medical datasets.启发式过滤器特征选择方法在医疗数据集上的应用。
Genomics. 2020 Mar;112(2):1173-1181. doi: 10.1016/j.ygeno.2019.07.002. Epub 2019 Jul 2.
8
Feature selection and tumor classification for microarray data using relaxed Lasso and generalized multi-class support vector machine.使用松弛 Lasso 和广义多类支持向量机进行微阵列数据分析的特征选择和肿瘤分类。
J Theor Biol. 2019 Feb 21;463:77-91. doi: 10.1016/j.jtbi.2018.12.010. Epub 2018 Dec 8.
9
Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries.全球癌症统计数据 2018:GLOBOCAN 对全球 185 个国家/地区 36 种癌症的发病率和死亡率的估计。
CA Cancer J Clin. 2018 Nov;68(6):394-424. doi: 10.3322/caac.21492. Epub 2018 Sep 12.
10
Genetic algorithm based cancerous gene identification from microarray data using ensemble of filter methods.基于遗传算法的基因识别方法,从微阵列数据中使用过滤方法的集成。
Med Biol Eng Comput. 2019 Jan;57(1):159-176. doi: 10.1007/s11517-018-1874-4. Epub 2018 Aug 1.