• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于随机森林的用于特征选择和参数优化的CURE-SMOTE算法及混合算法。

CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests.

作者信息

Ma Li, Fan Suohai

机构信息

School of Information Science and Technology, Jinan University, Guangzhou, 510632, China.

出版信息

BMC Bioinformatics. 2017 Mar 14;18(1):169. doi: 10.1186/s12859-017-1578-z.

DOI:10.1186/s12859-017-1578-z
PMID:28292263
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5351181/
Abstract

BACKGROUND

The random forests algorithm is a type of classifier with prominent universality, a wide application range, and robustness for avoiding overfitting. But there are still some drawbacks to random forests. Therefore, to improve the performance of random forests, this paper seeks to improve imbalanced data processing, feature selection and parameter optimization.

RESULTS

We propose the CURE-SMOTE algorithm for the imbalanced data classification problem. Experiments on imbalanced UCI data reveal that the combination of Clustering Using Representatives (CURE) enhances the original synthetic minority oversampling technique (SMOTE) algorithms effectively compared with the classification results on the original data using random sampling, Borderline-SMOTE1, safe-level SMOTE, C-SMOTE, and k-means-SMOTE. Additionally, the hybrid RF (random forests) algorithm has been proposed for feature selection and parameter optimization, which uses the minimum out of bag (OOB) data error as its objective function. Simulation results on binary and higher-dimensional data indicate that the proposed hybrid RF algorithms, hybrid genetic-random forests algorithm, hybrid particle swarm-random forests algorithm and hybrid fish swarm-random forests algorithm can achieve the minimum OOB error and show the best generalization ability.

CONCLUSION

The training set produced from the proposed CURE-SMOTE algorithm is closer to the original data distribution because it contains minimal noise. Thus, better classification results are produced from this feasible and effective algorithm. Moreover, the hybrid algorithm's F-value, G-mean, AUC and OOB scores demonstrate that they surpass the performance of the original RF algorithm. Hence, this hybrid algorithm provides a new way to perform feature selection and parameter optimization.

摘要

背景

随机森林算法是一种具有突出通用性、广泛应用范围且能有效避免过拟合的稳健性分类器。但随机森林仍存在一些缺点。因此,为提高随机森林的性能,本文致力于改进不平衡数据处理、特征选择和参数优化。

结果

我们针对不平衡数据分类问题提出了CURE-SMOTE算法。在不平衡的UCI数据上进行的实验表明,与使用随机采样、Borderline-SMOTE1、安全级SMOTE、C-SMOTE和k均值-SMOTE对原始数据进行分类的结果相比,基于代表点的聚类(CURE)与原始合成少数类过采样技术(SMOTE)算法相结合有效地提高了分类效果。此外,还提出了混合随机森林(RF)算法用于特征选择和参数优化,该算法以袋外(OOB)数据的最小误差作为目标函数。在二元和高维数据上的仿真结果表明,所提出的混合RF算法、混合遗传-随机森林算法、混合粒子群-随机森林算法和混合鱼群-随机森林算法能够实现最小的OOB误差,并展现出最佳的泛化能力。

结论

所提出的CURE-SMOTE算法生成的训练集更接近原始数据分布,因为其包含的噪声最小。因此,该可行且有效的算法能产生更好的分类结果。此外,混合算法的F值、G均值、AUC和OOB分数表明它们优于原始RF算法的性能。因此,这种混合算法为进行特征选择和参数优化提供了一种新方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/826e/5351181/f80868833f49/12859_2017_1578_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/826e/5351181/429a6869c7b1/12859_2017_1578_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/826e/5351181/17d0c627bd0d/12859_2017_1578_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/826e/5351181/9d8751b454fc/12859_2017_1578_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/826e/5351181/343e80c59f69/12859_2017_1578_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/826e/5351181/284fc6213c8d/12859_2017_1578_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/826e/5351181/63e6055741e7/12859_2017_1578_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/826e/5351181/f1719ab9a2d3/12859_2017_1578_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/826e/5351181/0e922f5e42f4/12859_2017_1578_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/826e/5351181/d035a1e6af78/12859_2017_1578_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/826e/5351181/b3209a671b1e/12859_2017_1578_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/826e/5351181/f80868833f49/12859_2017_1578_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/826e/5351181/429a6869c7b1/12859_2017_1578_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/826e/5351181/17d0c627bd0d/12859_2017_1578_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/826e/5351181/9d8751b454fc/12859_2017_1578_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/826e/5351181/343e80c59f69/12859_2017_1578_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/826e/5351181/284fc6213c8d/12859_2017_1578_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/826e/5351181/63e6055741e7/12859_2017_1578_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/826e/5351181/f1719ab9a2d3/12859_2017_1578_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/826e/5351181/0e922f5e42f4/12859_2017_1578_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/826e/5351181/d035a1e6af78/12859_2017_1578_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/826e/5351181/b3209a671b1e/12859_2017_1578_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/826e/5351181/f80868833f49/12859_2017_1578_Fig11_HTML.jpg

相似文献

1
CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests.基于随机森林的用于特征选择和参数优化的CURE-SMOTE算法及混合算法。
BMC Bioinformatics. 2017 Mar 14;18(1):169. doi: 10.1186/s12859-017-1578-z.
2
A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data.基于随机森林的 M-SMOTE 与ENN 混合采样算法在医学不平衡数据中的应用
J Biomed Inform. 2020 Jul;107:103465. doi: 10.1016/j.jbi.2020.103465. Epub 2020 Jun 5.
3
A Synthetic Minority Oversampling Technique Based on Gaussian Mixture Model Filtering for Imbalanced Data Classification.一种基于高斯混合模型滤波的合成少数类过采样技术用于不平衡数据分类
IEEE Trans Neural Netw Learn Syst. 2024 Mar;35(3):3740-3753. doi: 10.1109/TNNLS.2022.3197156. Epub 2024 Feb 29.
4
Research on expansion and classification of imbalanced data based on SMOTE algorithm.基于 SMOTE 算法的不平衡数据扩充与分类研究。
Sci Rep. 2021 Dec 15;11(1):24039. doi: 10.1038/s41598-021-03430-5.
5
Hybrid model for precise hepatitis-C classification using improved random forest and SVM method.基于改进随机森林和 SVM 方法的精准丙型肝炎分类的混合模型。
Sci Rep. 2023 Aug 1;13(1):12473. doi: 10.1038/s41598-023-36605-3.
6
Classification of toxicity effects of biotransformed hepatic drugs using whale optimized support vector machines.使用鲸鱼优化支持向量机对肝脏生物转化药物的毒性效应进行分类
J Biomed Inform. 2017 Apr;68:132-149. doi: 10.1016/j.jbi.2017.03.002. Epub 2017 Mar 8.
7
A hybrid Stacking-SMOTE model for optimizing the prediction of autistic genes.一种混合的堆叠-SMOTE 模型,用于优化自闭症基因预测。
BMC Bioinformatics. 2023 Oct 6;24(1):379. doi: 10.1186/s12859-023-05501-y.
8
Wireless Sensor Networks Intrusion Detection Based on SMOTE and the Random Forest Algorithm.基于 SMOTE 和随机森林算法的无线传感器网络入侵检测。
Sensors (Basel). 2019 Jan 8;19(1):203. doi: 10.3390/s19010203.
9
Clinical data classification using an enhanced SMOTE and chaotic evolutionary feature selection.使用增强型SMOTE和混沌进化特征选择的临床数据分类
Comput Biol Med. 2020 Nov;126:103991. doi: 10.1016/j.compbiomed.2020.103991. Epub 2020 Sep 18.
10
A cluster-based SMOTE both-sampling (CSBBoost) ensemble algorithm for classifying imbalanced data.一种用于对不平衡数据进行分类的基于聚类的SMOTE双边采样(CSBBoost)集成算法。
Sci Rep. 2024 Mar 2;14(1):5152. doi: 10.1038/s41598-024-55598-1.

引用本文的文献

1
INFO-RF-based fault diagnosis and analysis method for busbars.基于信息射频的母线故障诊断与分析方法
Sci Rep. 2025 Jul 2;15(1):23502. doi: 10.1038/s41598-025-07402-x.
2
Bayesian Inference for Drug Discovery by High Negative Samples and Oversampling.基于高负样本和过采样的药物发现贝叶斯推理
Bioinform Biol Insights. 2025 Apr 12;19:11779322251328269. doi: 10.1177/11779322251328269. eCollection 2025.
3
DSA Quantitative Analysis and Predictive Modeling of Obliteration in Cerebral AVM following Stereotactic Radiosurgery.立体定向放射治疗后脑动静脉畸形闭塞的数字减影血管造影定量分析及预测模型

本文引用的文献

1
A hybrid gene selection approach for microarray data classification using cellular learning automata and ant colony optimization.一种使用细胞学习自动机和蚁群优化的用于微阵列数据分类的混合基因选择方法。
Genomics. 2016 Jun;107(6):231-8. doi: 10.1016/j.ygeno.2016.05.001. Epub 2016 May 3.
2
Can-CSC-GBE: Developing Cost-sensitive Classifier with Gentleboost Ensemble for breast cancer classification using protein amino acids and imbalanced data.Can-CSC-GBE:使用蛋白质氨基酸和不均衡数据,通过Gentleboost集成开发用于乳腺癌分类的成本敏感分类器。
Comput Biol Med. 2016 Jun 1;73:38-46. doi: 10.1016/j.compbiomed.2016.04.002. Epub 2016 Apr 5.
3
AJNR Am J Neuroradiol. 2024 Oct 3;45(10):1521-1527. doi: 10.3174/ajnr.A8351.
4
Machine learning-enabled prediction of prolonged length of stay in hospital after surgery for tuberculosis spondylitis patients with unbalanced data: a novel approach using explainable artificial intelligence (XAI).机器学习在数据不平衡的情况下预测脊柱结核手术后住院时间延长的预测:一种使用可解释人工智能 (XAI) 的新方法。
Eur J Med Res. 2024 Jul 25;29(1):383. doi: 10.1186/s40001-024-01988-0.
5
Development of short forms for screening children's dental caries and urgent treatment needs using item response theory and machine learning methods.使用项目反应理论和机器学习方法开发用于筛查儿童龋齿和紧急治疗需求的简短形式。
PLoS One. 2024 Mar 22;19(3):e0299947. doi: 10.1371/journal.pone.0299947. eCollection 2024.
6
A cluster-based SMOTE both-sampling (CSBBoost) ensemble algorithm for classifying imbalanced data.一种用于对不平衡数据进行分类的基于聚类的SMOTE双边采样(CSBBoost)集成算法。
Sci Rep. 2024 Mar 2;14(1):5152. doi: 10.1038/s41598-024-55598-1.
7
Machine learning (ML) techniques to predict breast cancer in imbalanced datasets: a systematic review.用于预测不平衡数据集中乳腺癌的机器学习(ML)技术:一项系统综述。
J Cancer Surviv. 2025 Feb;19(1):270-294. doi: 10.1007/s11764-023-01465-3. Epub 2023 Sep 26.
8
A comprehensive review of machine learning algorithms and their application in geriatric medicine: present and future.机器学习算法及其在老年医学中的应用的全面综述:现状与未来。
Aging Clin Exp Res. 2023 Nov;35(11):2363-2397. doi: 10.1007/s40520-023-02552-2. Epub 2023 Sep 8.
9
Golgi_DF: Golgi proteins classification with deep forest.高尔基体_DF:基于深度森林的高尔基体蛋白质分类
Front Neurosci. 2023 May 12;17:1197824. doi: 10.3389/fnins.2023.1197824. eCollection 2023.
10
A hybrid sampling algorithm combining synthetic minority over-sampling technique and edited nearest neighbor for missed abortion diagnosis.一种结合合成少数过采样技术和编辑最近邻的混合采样算法,用于诊断漏诊的流产。
BMC Med Inform Decis Mak. 2022 Dec 29;22(1):344. doi: 10.1186/s12911-022-02075-2.
Structured feature selection using coordinate descent optimization.
使用坐标下降优化的结构化特征选择
BMC Bioinformatics. 2016 Apr 8;17:158. doi: 10.1186/s12859-016-0954-4.
4
A centroid-based gene selection method for microarray data classification.一种基于质心的微阵列数据分类基因选择方法。
J Theor Biol. 2016 Jul 7;400:32-41. doi: 10.1016/j.jtbi.2016.03.034. Epub 2016 Apr 4.
5
Prediction of O-glycosylation Sites Using Random Forest and GA-Tuned PSO Technique.使用随机森林和遗传算法优化的粒子群优化技术预测O-糖基化位点
Bioinform Biol Insights. 2015 Jul 5;9:103-9. doi: 10.4137/BBI.S26864. eCollection 2015.
6
Predicting protein-RNA interaction amino acids using random forest based on submodularity subset selection.基于次模性子集选择,使用随机森林预测蛋白质-RNA相互作用氨基酸。
Comput Biol Chem. 2014 Dec;53PB:324-330. doi: 10.1016/j.compbiolchem.2014.11.002. Epub 2014 Nov 13.
7
Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines.基于最近邻算法和支持向量机的不平衡数据在人类乳腺癌和结肠癌预测中的应用。
Comput Methods Programs Biomed. 2014 Mar;113(3):792-808. doi: 10.1016/j.cmpb.2014.01.001. Epub 2014 Jan 10.
8
An AUC-based permutation variable importance measure for random forests.基于 AUC 的随机森林排列变量重要性度量。
BMC Bioinformatics. 2013 Apr 5;14:119. doi: 10.1186/1471-2105-14-119.
9
SMOTE for high-dimensional class-imbalanced data.过采样处理高维类别不平衡数据。
BMC Bioinformatics. 2013 Mar 22;14:106. doi: 10.1186/1471-2105-14-106.
10
In silico prediction of toxic action mechanisms of phenols for imbalanced data with Random Forest learner.基于随机森林学习者的不平衡数据中酚类化合物毒性作用机制的计算预测。
J Mol Graph Model. 2012 May;35:21-7. doi: 10.1016/j.jmgm.2012.01.002. Epub 2012 Jan 17.