基于大数据技术的高维小样本数据特征选择与特征稳定性测量方法。

Feature Selection and Feature Stability Measurement Method for High-Dimensional Small Sample Data Based on Big Data Technology.

机构信息

School of Electricity and New Energy, China Three Gorges University, Yichang 443002, China.

出版信息

Comput Intell Neurosci. 2021 Sep 23;2021:3597051. doi: 10.1155/2021/3597051. eCollection 2021.

DOI:10.1155/2021/3597051

PMID:34603430

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8486514/

Abstract

With the rapid development of artificial intelligence in recent years, the research on image processing, text mining, and genome informatics has gradually deepened, and the mining of large-scale databases has begun to receive more and more attention. The objects of data mining have also become more complex, and the data dimensions of mining objects have become higher and higher. Compared with the ultra-high data dimensions, the number of samples available for analysis is too small, resulting in the production of high-dimensional small sample data. High-dimensional small sample data will bring serious dimensional disasters to the mining process. Through feature selection, redundancy and noise features in high-dimensional small sample data can be effectively eliminated, avoiding dimensional disasters and improving the actual efficiency of mining algorithms. However, the existing feature selection methods emphasize the classification or clustering performance of the feature selection results and ignore the stability of the feature selection results, which will lead to unstable feature selection results, and it is difficult to obtain real and understandable features. Based on the traditional feature selection method, this paper proposes an ensemble feature selection method, Random Bits Forest Recursive Clustering Eliminate (RBF-RCE) feature selection method, combined with multiple sets of basic classifiers to carry out parallel learning and screen out the best feature classification results, optimizes the classification performance of traditional feature selection methods, and can also improve the stability of feature selection. Then, this paper analyzes the reasons for the instability of feature selection and introduces a feature selection stability measurement method, the Intersection Measurement (IM), to evaluate whether the feature selection process is stable. The effectiveness of the proposed method is verified by experiments on several groups of high-dimensional small sample data sets.

摘要

近年来，随着人工智能的快速发展，图像处理、文本挖掘和基因组信息学的研究逐渐深入，大规模数据库的挖掘开始受到越来越多的关注。数据挖掘的对象也变得更加复杂，挖掘对象的数据维度越来越高。与超高数据维度相比，可用于分析的样本数量太少，导致产生高维小样本数据。高维小样本数据会给挖掘过程带来严重的维度灾难。通过特征选择，可以有效消除高维小样本数据中的冗余和噪声特征，避免维度灾难，提高挖掘算法的实际效率。但是，现有的特征选择方法强调特征选择结果的分类或聚类性能，而忽略了特征选择结果的稳定性，这将导致特征选择结果不稳定，难以获得真实可理解的特征。本文基于传统的特征选择方法，提出了一种集成特征选择方法，即随机位森林递归聚类消除（RBF-RCE）特征选择方法，结合多组基本分类器进行并行学习，筛选出最佳特征分类结果，优化了传统特征选择方法的分类性能，同时也提高了特征选择的稳定性。然后，本文分析了特征选择不稳定的原因，并引入了特征选择稳定性度量方法——交集度量（IM），用于评估特征选择过程是否稳定。通过对几组高维小样本数据集的实验验证了所提出方法的有效性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2404/8486514/809060ead5e8/CIN2021-3597051.001.jpg

相似文献

Feature Selection and Feature Stability Measurement Method for High-Dimensional Small Sample Data Based on Big Data Technology.

Comput Intell Neurosci. 2021 Sep 23;2021:3597051. doi: 10.1155/2021/3597051. eCollection 2021.

Improving the Accuracy of Feature Selection in Big Data Mining Using Accelerated Flower Pollination (AFP) Algorithm.

J Med Syst. 2019 Mar 9;43(4):96. doi: 10.1007/s10916-019-1200-1.

Computer-aided diagnosis of pulmonary nodules using a two-step approach for feature selection and classifier ensemble construction.

Artif Intell Med. 2010 Sep;50(1):43-53. doi: 10.1016/j.artmed.2010.04.011. Epub 2010 May 31.

Ensemble of sparse classifiers for high-dimensional biological data.

Int J Data Min Bioinform. 2015;12(2):167-83. doi: 10.1504/ijdmb.2015.069416.

Feature selection methods for big data bioinformatics: A survey from the search perspective.

Methods. 2016 Dec 1;111:21-31. doi: 10.1016/j.ymeth.2016.08.014. Epub 2016 Aug 31.

A Feature Selection Algorithm Integrating Maximum Classification Information and Minimum Interaction Feature Dependency Information.

Comput Intell Neurosci. 2021 Dec 28;2021:3569632. doi: 10.1155/2021/3569632. eCollection 2021.

Rough sets and Laplacian score based cost-sensitive feature selection.

PLoS One. 2018 Jun 18;13(6):e0197564. doi: 10.1371/journal.pone.0197564. eCollection 2018.

Teaching Mode Based on Educational Big Data Mining and Digital Twins.

Comput Intell Neurosci. 2022 Feb 16;2022:9071944. doi: 10.1155/2022/9071944. eCollection 2022.

Effect of finite sample size on feature selection and classification: a simulation study.

Med Phys. 2010 Feb;37(2):907-20. doi: 10.1118/1.3284974.

Medical data mining in sentiment analysis based on optimized swarm search feature selection.

Australas Phys Eng Sci Med. 2018 Dec;41(4):1087-1100. doi: 10.1007/s13246-018-0674-3. Epub 2018 Sep 11.

引用本文的文献

RCE-IFE: recursive cluster elimination with intra-cluster feature elimination.

PeerJ Comput Sci. 2025 Feb 7;11:e2528. doi: 10.7717/peerj-cs.2528. eCollection 2025.

Robust identification key predictors of short- and long-term weight status in children and adolescents by machine learning.

Front Public Health. 2024 Sep 24;12:1414046. doi: 10.3389/fpubh.2024.1414046. eCollection 2024.

Integrating Rehabilomics into the Multi-Omics Approach in the Management of Multiple Sclerosis: The Way for Precision Medicine?

Genes (Basel). 2022 Dec 24;14(1):63. doi: 10.3390/genes14010063.

本文引用的文献

A random forest method with feature selection for developing medical prediction models with clustered and longitudinal data.

J Biomed Inform. 2021 May;117:103763. doi: 10.1016/j.jbi.2021.103763. Epub 2021 Mar 26.

KEC: unique sequence search by K-mer exclusion.

Bioinformatics. 2021 Oct 11;37(19):3349-3350. doi: 10.1093/bioinformatics/btab196.

A Cheap Feature Selection Approach for the K-Means Algorithm.

IEEE Trans Neural Netw Learn Syst. 2021 May;32(5):2195-2208. doi: 10.1109/TNNLS.2020.3002576. Epub 2021 May 3.

Random Bits Forest: a Strong Classifier/Regressor for Big Data.

Sci Rep. 2016 Jul 22;6:30086. doi: 10.1038/srep30086.

A survey on filter techniques for feature selection in gene expression microarray analysis.

IEEE/ACM Trans Comput Biol Bioinform. 2012 Jul-Aug;9(4):1106-19. doi: 10.1109/TCBB.2012.33.

A hybrid BPSO-CGA approach for gene selection and classification of microarray data.

J Comput Biol. 2012 Jan;19(1):68-82. doi: 10.1089/cmb.2010.0064. Epub 2011 Jan 6.

Improving the computational efficiency of recursive cluster elimination for gene selection.

IEEE/ACM Trans Comput Biol Bioinform. 2011 Jan-Mar;8(1):122-9. doi: 10.1109/TCBB.2010.44.

Robust biomarker identification for cancer diagnosis with ensemble feature selection methods.

Bioinformatics. 2010 Feb 1;26(3):392-8. doi: 10.1093/bioinformatics/btp630. Epub 2009 Nov 25.

On consensus biomarker selection.

BMC Bioinformatics. 2007 May 24;8 Suppl 5(Suppl 5):S5. doi: 10.1186/1471-2105-8-S5-S5.

Recursive cluster elimination (RCE) for classification and feature selection from gene expression data.

BMC Bioinformatics. 2007 May 2;8:144. doi: 10.1186/1471-2105-8-144.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于大数据技术的高维小样本数据特征选择与特征稳定性测量方法。

Feature Selection and Feature Stability Measurement Method for High-Dimensional Small Sample Data Based on Big Data Technology.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献