• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

高维数据随机森林学习中的无偏特征选择

Unbiased feature selection in learning random forests for high-dimensional data.

作者信息

Nguyen Thanh-Tung, Huang Joshua Zhexue, Nguyen Thuy Thi

机构信息

Shenzhen Key Laboratory of High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China ; University of Chinese Academy of Sciences, Beijing 100049, China ; School of Computer Science and Engineering, Water Resources University, Hanoi 10000, Vietnam.

Shenzhen Key Laboratory of High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China ; College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China.

出版信息

ScientificWorldJournal. 2015;2015:471371. doi: 10.1155/2015/471371. Epub 2015 Mar 24.

DOI:10.1155/2015/471371
PMID:25879059
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4387916/
Abstract

Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. This makes RFs have poor accuracy when working with high-dimensional data. Besides that, RFs have bias in the feature selection process where multivalued features are favored. Aiming at debiasing feature selection in RFs, we propose a new RF algorithm, called xRF, to select good features in learning RFs for high-dimensional data. We first remove the uninformative features using p-value assessment, and the subset of unbiased features is then selected based on some statistical measures. This feature subset is then partitioned into two subsets. A feature weighting sampling technique is used to sample features from these two subsets for building trees. This approach enables one to generate more accurate trees, while allowing one to reduce dimensionality and the amount of data needed for learning RFs. An extensive set of experiments has been conducted on 47 high-dimensional real-world datasets including image datasets. The experimental results have shown that RFs with the proposed approach outperformed the existing random forests in increasing the accuracy and the AUC measures.

摘要

随机森林(RFs)作为一种强大的分类方法已被广泛应用。然而,由于在装袋样本和特征选择中都存在随机性,森林中的树倾向于选择无信息的特征进行节点分裂。这使得随机森林在处理高维数据时准确性较差。除此之外,随机森林在特征选择过程中存在偏差,其中多值特征受到青睐。为了消除随机森林中特征选择的偏差,我们提出了一种新的随机森林算法,称为xRF,用于在学习高维数据的随机森林时选择良好的特征。我们首先使用p值评估去除无信息的特征,然后基于一些统计量选择无偏差特征的子集。然后将这个特征子集划分为两个子集。使用特征加权采样技术从这两个子集中采样特征来构建树。这种方法能够生成更准确的树,同时允许降低维度以及减少学习随机森林所需的数据量。我们在包括图像数据集在内的47个高维真实世界数据集上进行了大量实验。实验结果表明,采用所提出方法的随机森林在提高准确性和AUC指标方面优于现有的随机森林。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f797/4387916/139e4166e4ab/TSWJ2015-471371.alg.002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f797/4387916/f558f6953614/TSWJ2015-471371.001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f797/4387916/63bfeb5a017e/TSWJ2015-471371.002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f797/4387916/2cc23b0ff384/TSWJ2015-471371.003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f797/4387916/fab3d4d626b1/TSWJ2015-471371.004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f797/4387916/01dde0f9bd70/TSWJ2015-471371.005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f797/4387916/f3bf31785b3a/TSWJ2015-471371.006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f797/4387916/715318c4998e/TSWJ2015-471371.007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f797/4387916/78ed296a986f/TSWJ2015-471371.008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f797/4387916/d5d3f5877f05/TSWJ2015-471371.009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f797/4387916/57cc94989025/TSWJ2015-471371.alg.001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f797/4387916/139e4166e4ab/TSWJ2015-471371.alg.002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f797/4387916/f558f6953614/TSWJ2015-471371.001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f797/4387916/63bfeb5a017e/TSWJ2015-471371.002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f797/4387916/2cc23b0ff384/TSWJ2015-471371.003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f797/4387916/fab3d4d626b1/TSWJ2015-471371.004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f797/4387916/01dde0f9bd70/TSWJ2015-471371.005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f797/4387916/f3bf31785b3a/TSWJ2015-471371.006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f797/4387916/715318c4998e/TSWJ2015-471371.007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f797/4387916/78ed296a986f/TSWJ2015-471371.008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f797/4387916/d5d3f5877f05/TSWJ2015-471371.009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f797/4387916/57cc94989025/TSWJ2015-471371.alg.001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f797/4387916/139e4166e4ab/TSWJ2015-471371.alg.002.jpg

相似文献

1
Unbiased feature selection in learning random forests for high-dimensional data.高维数据随机森林学习中的无偏特征选择
ScientificWorldJournal. 2015;2015:471371. doi: 10.1155/2015/471371. Epub 2015 Mar 24.
2
Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests.使用基于质量的两阶段随机森林进行全基因组关联数据分类和单核苷酸多态性选择。
BMC Genomics. 2015;16 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2164-16-S2-S5. Epub 2015 Jan 21.
3
Effective hybrid feature selection using different bootstrap enhances cancers classification performance.使用不同的自助法进行有效的混合特征选择可提高癌症分类性能。
BioData Min. 2022 Sep 30;15(1):24. doi: 10.1186/s13040-022-00304-y.
4
Unbiased split variable selection for random survival forests using maximally selected rank statistics.使用最大选择秩统计量对随机生存森林进行无偏分裂变量选择。
Stat Med. 2017 Apr 15;36(8):1272-1284. doi: 10.1002/sim.7212. Epub 2017 Jan 15.
5
Random Shapley Forests: Cooperative Game-Based Random Forests With Consistency.随机沙普利森林:基于合作博弈且具有一致性的随机森林
IEEE Trans Cybern. 2022 Jan;52(1):205-214. doi: 10.1109/TCYB.2020.2972956. Epub 2022 Jan 11.
6
CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests.基于随机森林的用于特征选择和参数优化的CURE-SMOTE算法及混合算法。
BMC Bioinformatics. 2017 Mar 14;18(1):169. doi: 10.1186/s12859-017-1578-z.
7
A Novel Consistent Random Forest Framework: Bernoulli Random Forests.一种新型的一致性随机森林框架:伯努利随机森林。
IEEE Trans Neural Netw Learn Syst. 2018 Aug;29(8):3510-3523. doi: 10.1109/TNNLS.2017.2729778. Epub 2017 Aug 15.
8
Thresholding Gini variable importance with a single-trained random forest: An empirical Bayes approach.使用单训练随机森林对基尼变量重要性进行阈值处理:一种经验贝叶斯方法。
Comput Struct Biotechnol J. 2023 Sep 1;21:4354-4360. doi: 10.1016/j.csbj.2023.08.033. eCollection 2023.
9
Plausibility of Individual Decisions from Random Forests in Clinical Predictive Modelling Applications.临床预测建模应用中随机森林个体决策的合理性
Stud Health Technol Inform. 2017;236:328-335.
10
Cluster ensemble based on Random Forests for genetic data.基于随机森林的基因数据聚类集成方法
BioData Min. 2017 Dec 15;10:37. doi: 10.1186/s13040-017-0156-2. eCollection 2017.

引用本文的文献

1
The Machine Learning Models in Major Cardiovascular Adverse Events Prediction Based on Coronary Computed Tomography Angiography: Systematic Review.基于冠状动脉计算机断层扫描血管造影术的主要心血管不良事件预测中的机器学习模型:系统评价
J Med Internet Res. 2025 Jun 13;27:e68872. doi: 10.2196/68872.
2
Unbiased identification of cell identity in dense mixed neural cultures.在密集混合神经培养物中无偏倚地识别细胞身份。
Elife. 2025 Jan 17;13:RP95273. doi: 10.7554/eLife.95273.
3
Deriving Automated Device Metadata From Intracranial Pressure Waveforms: A Transforming Research and Clinical Knowledge in Traumatic Brain Injury ICU Physiology Cohort Analysis.

本文引用的文献

1
Eigenfaces for recognition.特征脸识别。
J Cogn Neurosci. 1991 Winter;3(1):71-86. doi: 10.1162/jocn.1991.3.1.71.
2
Enriched random forests.增强随机森林
Bioinformatics. 2008 Sep 15;24(18):2010-4. doi: 10.1093/bioinformatics/btn356. Epub 2008 Jul 22.
3
Conditional variable importance for random forests.随机森林的条件变量重要性
从颅内压波形中获取自动设备元数据:创伤性脑损伤重症监护病房生理学队列分析中的变革性研究与临床知识
Crit Care Explor. 2024 Jul 16;6(7):e1118. doi: 10.1097/CCE.0000000000001118. eCollection 2024 Jul 1.
4
Predictive modeling of antibiotic eradication therapy success for new-onset Pseudomonas aeruginosa pulmonary infections in children with cystic fibrosis.预测模型在儿童囊性纤维化新发铜绿假单胞菌肺部感染的抗生素清除治疗中的应用。
PLoS Comput Biol. 2023 Sep 6;19(9):e1011424. doi: 10.1371/journal.pcbi.1011424. eCollection 2023 Sep.
5
Heart Failure Emergency Readmission Prediction Using Stacking Machine Learning Model.使用堆叠机器学习模型预测心力衰竭紧急再入院情况
Diagnostics (Basel). 2023 Jun 2;13(11):1948. doi: 10.3390/diagnostics13111948.
6
How to Effectively Collect and Process Network Data for Intrusion Detection?如何有效地收集和处理用于入侵检测的网络数据?
Entropy (Basel). 2021 Nov 18;23(11):1532. doi: 10.3390/e23111532.
7
DNA methylation-based classifier and gene expression signatures detect BRCAness in osteosarcoma.基于 DNA 甲基化的分类器和基因表达特征可检测骨肉瘤中的 BRCA 样特征。
PLoS Comput Biol. 2021 Nov 11;17(11):e1009562. doi: 10.1371/journal.pcbi.1009562. eCollection 2021 Nov.
8
Graph Embedding Deep Learning Guides Microbial Biomarkers' Identification.图嵌入深度学习助力微生物生物标志物识别。
Front Genet. 2019 Nov 22;10:1182. doi: 10.3389/fgene.2019.01182. eCollection 2019.
9
Detecting the long non‑coding RNA signature related to spinal cord ependymal tumor subtype using a genome‑wide methylome analysis approach.利用全基因组甲基化组分析方法检测与脊髓室管膜瘤亚型相关的长链非编码 RNA 特征。
Mol Med Rep. 2019 Aug;20(2):1531-1540. doi: 10.3892/mmr.2019.10388. Epub 2019 Jun 18.
10
Using Decision Tree Aggregation with Random Forest Model to Identify Gut Microbes Associated with Colorectal Cancer.利用决策树聚合与随机森林模型鉴定与结直肠癌相关的肠道微生物。
Genes (Basel). 2019 Feb 1;10(2):112. doi: 10.3390/genes10020112.
BMC Bioinformatics. 2008 Jul 11;9:307. doi: 10.1186/1471-2105-9-307.
4
Bias in random forest variable importance measures: illustrations, sources and a solution.随机森林变量重要性度量中的偏差:示例、来源及解决方案
BMC Bioinformatics. 2007 Jan 25;8:25. doi: 10.1186/1471-2105-8-25.
5
Gene selection and classification of microarray data using random forest.使用随机森林进行微阵列数据的基因选择与分类
BMC Bioinformatics. 2006 Jan 6;7:3. doi: 10.1186/1471-2105-7-3.