• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

随机近邻特征选择 - 一种比随机森林更快更稳定的替代方法。

Random KNN feature selection - a fast and stable alternative to Random Forests.

机构信息

The Department of Statistics, West Virginia University, Morgantown, WV 26506, USA.

出版信息

BMC Bioinformatics. 2011 Nov 18;12:450. doi: 10.1186/1471-2105-12-450.

DOI:10.1186/1471-2105-12-450
PMID:22093447
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3281073/
Abstract

BACKGROUND

Successfully modeling high-dimensional data involving thousands of variables is challenging. This is especially true for gene expression profiling experiments, given the large number of genes involved and the small number of samples available. Random Forests (RF) is a popular and widely used approach to feature selection for such "small n, large p problems." However, Random Forests suffers from instability, especially in the presence of noisy and/or unbalanced inputs.

RESULTS

We present RKNN-FS, an innovative feature selection procedure for "small n, large p problems." RKNN-FS is based on Random KNN (RKNN), a novel generalization of traditional nearest-neighbor modeling. RKNN consists of an ensemble of base k-nearest neighbor models, each constructed from a random subset of the input variables. To rank the importance of the variables, we define a criterion on the RKNN framework, using the notion of support. A two-stage backward model selection method is then developed based on this criterion. Empirical results on microarray data sets with thousands of variables and relatively few samples show that RKNN-FS is an effective feature selection approach for high-dimensional data. RKNN is similar to Random Forests in terms of classification accuracy without feature selection. However, RKNN provides much better classification accuracy than RF when each method incorporates a feature-selection step. Our results show that RKNN is significantly more stable and more robust than Random Forests for feature selection when the input data are noisy and/or unbalanced. Further, RKNN-FS is much faster than the Random Forests feature selection method (RF-FS), especially for large scale problems, involving thousands of variables and multiple classes.

CONCLUSIONS

Given the superiority of Random KNN in classification performance when compared with Random Forests, RKNN-FS's simplicity and ease of implementation, and its superiority in speed and stability, we propose RKNN-FS as a faster and more stable alternative to Random Forests in classification problems involving feature selection for high-dimensional datasets.

摘要

背景

成功建模涉及数千个变量的高维数据具有挑战性。对于基因表达谱实验来说尤其如此,因为涉及到大量的基因,而可用的样本数量却很少。随机森林(RF)是一种流行且广泛使用的特征选择方法,用于解决这种“小 n,大 p 问题”。然而,随机森林存在不稳定性,尤其是在存在噪声和/或不平衡输入的情况下。

结果

我们提出了 RKNN-FS,这是一种用于“小 n,大 p 问题”的创新特征选择程序。RKNN-FS 基于随机 KNN(RKNN),这是传统最近邻建模的一种新推广。RKNN 由一组基 k-最近邻模型组成,每个模型都是从输入变量的随机子集构建的。为了对变量的重要性进行排序,我们在 RKNN 框架上定义了一个基于支持的标准。然后基于该标准开发了一种两阶段后向模型选择方法。在具有数千个变量和相对较少样本的微阵列数据集上的实验结果表明,RKNN-FS 是一种用于高维数据的有效特征选择方法。在没有特征选择的情况下,RKNN 在分类准确性方面与随机森林相似。但是,当每种方法都包含特征选择步骤时,RKNN 提供的分类准确性要优于 RF。我们的结果表明,在输入数据存在噪声和/或不平衡的情况下,RKNN 比随机森林在特征选择方面更稳定和更健壮。此外,与随机森林特征选择方法(RF-FS)相比,RKNN-FS 速度更快,尤其是对于涉及数千个变量和多个类别的大规模问题。

结论

鉴于随机 KNN 在分类性能方面优于随机森林,RKNN-FS 的简单易用性,以及在速度和稳定性方面的优势,我们建议在涉及高维数据集的特征选择的分类问题中,将 RKNN-FS 作为随机森林的更快、更稳定的替代方案。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b26/3281073/e847b4288ed3/1471-2105-12-450-7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b26/3281073/c49185dfa086/1471-2105-12-450-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b26/3281073/a53c73ad422b/1471-2105-12-450-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b26/3281073/8bf9abead92f/1471-2105-12-450-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b26/3281073/233a458214a8/1471-2105-12-450-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b26/3281073/6cf4faec8c76/1471-2105-12-450-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b26/3281073/0601ed19d295/1471-2105-12-450-6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b26/3281073/e847b4288ed3/1471-2105-12-450-7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b26/3281073/c49185dfa086/1471-2105-12-450-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b26/3281073/a53c73ad422b/1471-2105-12-450-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b26/3281073/8bf9abead92f/1471-2105-12-450-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b26/3281073/233a458214a8/1471-2105-12-450-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b26/3281073/6cf4faec8c76/1471-2105-12-450-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b26/3281073/0601ed19d295/1471-2105-12-450-6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b26/3281073/e847b4288ed3/1471-2105-12-450-7.jpg

相似文献

1
Random KNN feature selection - a fast and stable alternative to Random Forests.随机近邻特征选择 - 一种比随机森林更快更稳定的替代方法。
BMC Bioinformatics. 2011 Nov 18;12:450. doi: 10.1186/1471-2105-12-450.
2
Feature weight estimation for gene selection: a local hyperlinear learning approach.特征权重估计在基因选择中的应用:一种局部超线性学习方法。
BMC Bioinformatics. 2014 Mar 14;15:70. doi: 10.1186/1471-2105-15-70.
3
Gene selection and classification of microarray data using random forest.使用随机森林进行微阵列数据的基因选择与分类
BMC Bioinformatics. 2006 Jan 6;7:3. doi: 10.1186/1471-2105-7-3.
4
Optimal combination of feature selection and classification via local hyperplane based learning strategy.基于局部超平面学习策略的特征选择与分类的最优组合
BMC Bioinformatics. 2015 Jul 10;16:219. doi: 10.1186/s12859-015-0629-6.
5
EKNN: Ensemble classifier incorporating connectivity and density into kNN with application to cancer diagnosis.EKNN:将连通性和密度纳入k近邻算法的集成分类器及其在癌症诊断中的应用
Artif Intell Med. 2021 Jan;111:101985. doi: 10.1016/j.artmed.2020.101985. Epub 2020 Nov 8.
6
A novel random forests-based feature selection method for microarray expression data analysis.一种用于微阵列表达数据分析的基于随机森林的新型特征选择方法。
Int J Data Min Bioinform. 2015;13(1):84-101. doi: 10.1504/ijdmb.2015.070852.
7
Identification of clinical factors related to prediction of alcohol use disorder from electronic health records using feature selection methods.利用特征选择方法从电子健康记录中识别与预测酒精使用障碍相关的临床因素。
BMC Med Inform Decis Mak. 2022 Nov 23;22(1):304. doi: 10.1186/s12911-022-02051-w.
8
Gene selection using iterative feature elimination random forests for survival outcomes.基于迭代特征消除随机森林的生存结局基因选择。
IEEE/ACM Trans Comput Biol Bioinform. 2012 Sep-Oct;9(5):1422-31. doi: 10.1109/TCBB.2012.63.
9
Rotation of random forests for genomic and proteomic classification problems.随机森林旋转算法在基因组和蛋白质组分类问题中的应用。
Adv Exp Med Biol. 2011;696:211-21. doi: 10.1007/978-1-4419-7046-6_21.
10
Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests.使用基于质量的两阶段随机森林进行全基因组关联数据分类和单核苷酸多态性选择。
BMC Genomics. 2015;16 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2164-16-S2-S5. Epub 2015 Jan 21.

引用本文的文献

1
Machine learning-based risk factor analysis and prediction model construction for mortality in chronic heart failure.基于机器学习的慢性心力衰竭死亡率风险因素分析与预测模型构建
J Glob Health. 2025 Sep 12;15:04242. doi: 10.7189/jogh.15.04242.
2
A flexible framework for minimal biomarker signature discovery from clinical omics studies without library size normalisation.一种用于从临床组学研究中发现最小生物标志物特征的灵活框架,无需文库大小标准化。
PLOS Digit Health. 2025 Mar 26;4(3):e0000780. doi: 10.1371/journal.pdig.0000780. eCollection 2025 Mar.
3
Current status and future direction of cancer research using artificial intelligence for clinical application.

本文引用的文献

1
Letter to the editor: on the stability and ranking of predictors from random forest variable importance measures.致编辑的信:关于随机森林变量重要性度量的预测因子的稳定性和排名。
Brief Bioinform. 2011 Jul;12(4):369-73. doi: 10.1093/bib/bbr016. Epub 2011 Apr 15.
2
Letter to the editor: Stability of Random Forest importance measures.致编辑的信:随机森林重要性度量的稳定性。
Brief Bioinform. 2011 Jan;12(1):86-9. doi: 10.1093/bib/bbq011. Epub 2010 Mar 31.
3
Note on the sampling error of the difference between correlated proportions or percentages.
利用人工智能进行临床应用的癌症研究现状与未来方向。
Cancer Sci. 2025 Feb;116(2):297-307. doi: 10.1111/cas.16395. Epub 2024 Nov 18.
4
Development of an expert system for the classification of myalgic encephalomyelitis/chronic fatigue syndrome.用于肌痛性脑脊髓炎/慢性疲劳综合征分类的专家系统的开发。
PeerJ Comput Sci. 2024 Mar 20;10:e1857. doi: 10.7717/peerj-cs.1857. eCollection 2024.
5
Predicting MCI to AD Conversation Using Integrated sMRI and rs-fMRI: Machine Learning and Graph Theory Approach.使用整合的结构磁共振成像和静息态功能磁共振成像预测轻度认知障碍向阿尔茨海默病的转变:机器学习与图论方法
Front Aging Neurosci. 2021 Jul 30;13:688926. doi: 10.3389/fnagi.2021.688926. eCollection 2021.
6
Using Embedded Feature Selection and CNN for Classification on CCD-INID-V1-A New IoT Dataset.利用嵌入式特征选择和卷积神经网络对 CCD-INID-V1-新物联网数据集进行分类。
Sensors (Basel). 2021 Jul 15;21(14):4834. doi: 10.3390/s21144834.
7
Predictive Models May Complement or Provide an Alternative to Existing Strategies for Assessing the Enteric Pathogen Contamination Status of Northeastern Streams Used to Provide Water for Produce Production.预测模型可能会补充现有策略,或为评估用于为农产品生产供水的东北部溪流的肠道病原体污染状况提供替代方案。
Front Sustain Food Syst. 2020 Oct;4. doi: 10.3389/fsufs.2020.561517. Epub 2020 Oct 6.
8
The Microbiome Composition of a Man's Penis Predicts Incident Bacterial Vaginosis in His Female Sex Partner With High Accuracy.男性阴茎微生物组组成可高精度预测其女性性伴侣细菌性阴道病的发生。
Front Cell Infect Microbiol. 2020 Aug 4;10:433. doi: 10.3389/fcimb.2020.00433. eCollection 2020.
9
Feature selection with the Fisher score followed by the Maximal Clique Centrality algorithm can accurately identify the hub genes of hepatocellular carcinoma.基于 Fisher 得分的特征选择,再结合最大团中心度算法,可以准确识别肝细胞癌的枢纽基因。
Sci Rep. 2019 Nov 21;9(1):17283. doi: 10.1038/s41598-019-53471-0.
10
Predicting protein-ligand interactions based on bow-pharmacological space and Bayesian additive regression trees.基于 Bow 药效空间和贝叶斯加法回归树预测蛋白质-配体相互作用。
Sci Rep. 2019 May 22;9(1):7703. doi: 10.1038/s41598-019-43125-6.
关于相关比例或百分比差异的抽样误差说明。
Psychometrika. 1947 Jun;12(2):153-7. doi: 10.1007/BF02295996.
4
A review of feature selection techniques in bioinformatics.生物信息学中特征选择技术综述。
Bioinformatics. 2007 Oct 1;23(19):2507-17. doi: 10.1093/bioinformatics/btm344. Epub 2007 Aug 24.
5
Bias in random forest variable importance measures: illustrations, sources and a solution.随机森林变量重要性度量中的偏差:示例、来源及解决方案
BMC Bioinformatics. 2007 Jan 25;8:25. doi: 10.1186/1471-2105-8-25.
6
Gene selection and classification of microarray data using random forest.使用随机森林进行微阵列数据的基因选择与分类
BMC Bioinformatics. 2006 Jan 6;7:3. doi: 10.1186/1471-2105-7-3.
7
Application of the random forest method in studies of local lymph node assay based skin sensitization data.随机森林方法在基于局部淋巴结试验的皮肤致敏数据研究中的应用。
J Chem Inf Model. 2005 Jul-Aug;45(4):952-64. doi: 10.1021/ci050049u.
8
Random forest: a classification and regression tool for compound classification and QSAR modeling.随机森林:一种用于化合物分类和定量构效关系建模的分类与回归工具。
J Chem Inf Comput Sci. 2003 Nov-Dec;43(6):1947-58. doi: 10.1021/ci034160g.
9
Missing value estimation methods for DNA microarrays.DNA微阵列的缺失值估计方法。
Bioinformatics. 2001 Jun;17(6):520-5. doi: 10.1093/bioinformatics/17.6.520.
10
Unsupervised forward selection: a method for eliminating redundant variables.无监督前向选择:一种消除冗余变量的方法。
J Chem Inf Comput Sci. 2000 Sep-Oct;40(5):1160-8. doi: 10.1021/ci000384c.