• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于 Hellinger 距离的高维类不平衡数据稳定稀疏特征选择。

Hellinger distance-based stable sparse feature selection for high-dimensional class-imbalanced data.

机构信息

School of Science, Kunming University of Science and Technology, Kunming, 650500, People's Republic of China.

School of Mathematics, The University of Manchester, Manchester, M13 9PL, UK.

出版信息

BMC Bioinformatics. 2020 Mar 23;21(1):121. doi: 10.1186/s12859-020-3411-3.

DOI:10.1186/s12859-020-3411-3
PMID:32293252
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7092448/
Abstract

BACKGROUND

Feature selection in class-imbalance learning has gained increasing attention in recent years due to the massive growth of high-dimensional class-imbalanced data across many scientific fields. In addition to reducing model complexity and discovering key biomarkers, feature selection is also an effective method of combating overlapping which may arise in such data and become a crucial aspect for determining classification performance. However, ordinary feature selection techniques for classification can not be simply used for addressing class-imbalanced data without any adjustment. Thus, more efficient feature selection technique must be developed for complicated class-imbalanced data, especially in the context of high-dimensionality.

RESULTS

We proposed an algorithm called sssHD to achieve stable sparse feature selection applied it to complicated class-imbalanced data. sssHD is based on the Hellinger distance (HD) coupled with sparse regularization techniques. We stated that Hellinger distance is not only class-insensitive but also translation-invariant. Simulation result indicates that HD-based selection algorithm is effective in recognizing key features and control false discoveries for class-imbalance learning. Five gene expression datasets are also employed to test the performance of the sssHD algorithm, and a comparison with several existing selection procedures is performed. The result shows that sssHD is highly competitive in terms of five assessment metrics. In addition, sssHD presents limited differences between performing and not performing re-balance preprocessing.

CONCLUSIONS

sssHD is a practical feature selection method for high-dimensional class-imbalanced data, which is simple and can be an alternative for performing feature selection in class-imbalanced data. sssHD can be easily extended by connecting it with different re-balance preprocessing, different sparse regularization structures as well as different classifiers. As such, the algorithm is extremely general and has a wide range of applicability.

摘要

背景

由于许多科学领域的高维类别不平衡数据的大量增长,特征选择在类别不平衡学习中越来越受到关注。除了降低模型的复杂性和发现关键生物标志物之外,特征选择也是一种有效的方法,可以克服此类数据中可能出现的重叠问题,并且成为确定分类性能的关键方面。但是,普通的分类特征选择技术不能在不进行任何调整的情况下简单地用于处理类别不平衡数据。因此,必须针对复杂的类别不平衡数据,特别是在高维情况下,开发更有效的特征选择技术。

结果

我们提出了一种名为 sssHD 的算法,用于实现稳定稀疏的特征选择,并将其应用于复杂的类别不平衡数据。sssHD 基于 Hellinger 距离(HD)和稀疏正则化技术。我们指出,Hellinger 距离不仅对类别不敏感,而且还具有平移不变性。模拟结果表明,基于 HD 的选择算法在识别关键特征和控制类别不平衡学习中的错误发现方面非常有效。我们还使用五个基因表达数据集来测试 sssHD 算法的性能,并与几种现有的选择过程进行了比较。结果表明,sssHD 在五个评估指标方面具有很强的竞争力。此外,sssHD 在执行和不执行重新平衡预处理之间的差异有限。

结论

sssHD 是一种适用于高维类别不平衡数据的实用特征选择方法,它简单易用,可以作为类别不平衡数据中执行特征选择的替代方法。sssHD 可以通过连接不同的重新平衡预处理、不同的稀疏正则化结构以及不同的分类器来轻松扩展。因此,该算法非常通用,具有广泛的适用性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0fe9/7092448/a8f0b08faf6b/12859_2020_3411_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0fe9/7092448/929771ad9fef/12859_2020_3411_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0fe9/7092448/a491cace8063/12859_2020_3411_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0fe9/7092448/8be81e2bcfc1/12859_2020_3411_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0fe9/7092448/e2797e3e1734/12859_2020_3411_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0fe9/7092448/a8f0b08faf6b/12859_2020_3411_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0fe9/7092448/929771ad9fef/12859_2020_3411_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0fe9/7092448/a491cace8063/12859_2020_3411_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0fe9/7092448/8be81e2bcfc1/12859_2020_3411_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0fe9/7092448/e2797e3e1734/12859_2020_3411_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0fe9/7092448/a8f0b08faf6b/12859_2020_3411_Fig5_HTML.jpg

相似文献

1
Hellinger distance-based stable sparse feature selection for high-dimensional class-imbalanced data.基于 Hellinger 距离的高维类不平衡数据稳定稀疏特征选择。
BMC Bioinformatics. 2020 Mar 23;21(1):121. doi: 10.1186/s12859-020-3411-3.
2
Feature Ranking and Screening for Class-Imbalanced Metabolomics Data Based on Rank Aggregation Coupled with Re-Balance.基于秩聚合与再平衡的类不平衡代谢组学数据特征排序与筛选
Metabolites. 2021 Jun 14;11(6):389. doi: 10.3390/metabo11060389.
3
Feature Selection for High-Dimensional and Imbalanced Biomedical Data Based on Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm.基于稳健相关冗余和二进制沙蝇优化算法的高维不平衡生物医学数据特征选择。
Genes (Basel). 2020 Jun 27;11(7):717. doi: 10.3390/genes11070717.
4
Imbalanced biomedical data classification using self-adaptive multilayer ELM combined with dynamic GAN.基于自适应多层 ELM 与动态 GAN 结合的生物医学数据不平衡分类。
Biomed Eng Online. 2018 Dec 4;17(1):181. doi: 10.1186/s12938-018-0604-3.
5
Class-imbalanced classifiers for high-dimensional data.高维数据的不平衡分类器。
Brief Bioinform. 2013 Jan;14(1):13-26. doi: 10.1093/bib/bbs006. Epub 2012 Mar 9.
6
A class imbalance-aware Relief algorithm for the classification of tumors using microarray gene expression data.一种基于类别不平衡感知的 Relief 算法,用于使用微阵列基因表达数据进行肿瘤分类。
Comput Biol Chem. 2019 Jun;80:121-127. doi: 10.1016/j.compbiolchem.2019.03.017. Epub 2019 Mar 24.
7
A Pareto-based Ensemble with Feature and Instance Selection for Learning from Multi-Class Imbalanced Datasets.基于 Pareto 的特征和实例选择集成学习方法在多类不平衡数据集上的应用。
Int J Neural Syst. 2017 Sep;27(6):1750028. doi: 10.1142/S0129065717500289. Epub 2017 Apr 11.
8
Experimental Study and Comparison of Imbalance Ensemble Classifiers with Dynamic Selection Strategy.具有动态选择策略的不平衡集成分类器的实验研究与比较
Entropy (Basel). 2021 Jun 28;23(7):822. doi: 10.3390/e23070822.
9
An experimental comparison of feature selection methods on two-class biomedical datasets.两类生物医学数据集上特征选择方法的实验比较。
Comput Biol Med. 2015 Nov 1;66:1-10. doi: 10.1016/j.compbiomed.2015.08.010. Epub 2015 Aug 24.
10
Class prediction for high-dimensional class-imbalanced data.高维类别不平衡数据的类别预测。
BMC Bioinformatics. 2010 Oct 20;11:523. doi: 10.1186/1471-2105-11-523.

引用本文的文献

1
Processing imbalanced medical data at the data level with assisted-reproduction data as an example.以辅助生殖数据为例,在数据层面处理不平衡的医学数据。
BioData Min. 2024 Sep 4;17(1):29. doi: 10.1186/s13040-024-00384-y.
2
A novel hybrid algorithm based on Harris Hawks for tumor feature gene selection.一种基于哈里斯鹰算法的新型混合肿瘤特征基因选择算法
PeerJ Comput Sci. 2023 Feb 13;9:e1229. doi: 10.7717/peerj-cs.1229. eCollection 2023.
3
Cost-sensitive learning strategies for high-dimensional and imbalanced data: a comparative study.

本文引用的文献

1
Tuning model parameters in class-imbalanced learning with precision-recall curve.利用精确率-召回率曲线在类别不平衡学习中调整模型参数。
Biom J. 2019 May;61(3):652-664. doi: 10.1002/bimj.201800148. Epub 2018 Dec 12.
2
Precrec: fast and accurate precision-recall and ROC curve calculations in R.Precrec:在R语言中进行快速准确的精确率-召回率及ROC曲线计算。
Bioinformatics. 2017 Jan 1;33(1):145-147. doi: 10.1093/bioinformatics/btw570. Epub 2016 Sep 1.
3
The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets.
高维不平衡数据的成本敏感学习策略:一项比较研究。
PeerJ Comput Sci. 2021 Dec 24;7:e832. doi: 10.7717/peerj-cs.832. eCollection 2021.
4
ACP-DA: Improving the Prediction of Anticancer Peptides Using Data Augmentation.ACP-DA:利用数据增强改进抗癌肽的预测
Front Genet. 2021 Jun 30;12:698477. doi: 10.3389/fgene.2021.698477. eCollection 2021.
5
Feature Ranking and Screening for Class-Imbalanced Metabolomics Data Based on Rank Aggregation Coupled with Re-Balance.基于秩聚合与再平衡的类不平衡代谢组学数据特征排序与筛选
Metabolites. 2021 Jun 14;11(6):389. doi: 10.3390/metabo11060389.
6
Identifying Robust Risk Factors for Knee Osteoarthritis Progression: An Evolutionary Machine Learning Approach.识别膝关节骨关节炎进展的可靠风险因素:一种进化机器学习方法。
Healthcare (Basel). 2021 Mar 1;9(3):260. doi: 10.3390/healthcare9030260.
7
The Use of Hellinger Distance Undersampling Model to Improve the Classification of Disease Class in Imbalanced Medical Datasets.使用赫林格距离欠采样模型改善不平衡医学数据集中疾病类别的分类
Appl Bionics Biomech. 2020 Nov 4;2020:8824625. doi: 10.1155/2020/8824625. eCollection 2020.
在不平衡数据集上评估二元分类器时,精确率-召回率曲线比ROC曲线更具信息性。
PLoS One. 2015 Mar 4;10(3):e0118432. doi: 10.1371/journal.pone.0118432. eCollection 2015.
4
Analysis of sampling techniques for imbalanced data: An n = 648 ADNI study.分析不平衡数据的采样技术:一项 n = 648 的 ADNI 研究。
Neuroimage. 2014 Feb 15;87:220-41. doi: 10.1016/j.neuroimage.2013.10.005. Epub 2013 Oct 29.
5
A Selective Review of Group Selection in High-Dimensional Models.高维模型中群体选择的选择性综述。
Stat Sci. 2012;27(4). doi: 10.1214/12-STS392.
6
Class-imbalanced classifiers for high-dimensional data.高维数据的不平衡分类器。
Brief Bioinform. 2013 Jan;14(1):13-26. doi: 10.1093/bib/bbs006. Epub 2012 Mar 9.
7
Penalized classification using Fisher's linear discriminant.使用费舍尔线性判别法的惩罚分类
J R Stat Soc Series B Stat Methodol. 2011 Nov;73(5):753-772. doi: 10.1111/j.1467-9868.2011.00783.x.
8
Class prediction for high-dimensional class-imbalanced data.高维类别不平衡数据的类别预测。
BMC Bioinformatics. 2010 Oct 20;11:523. doi: 10.1186/1471-2105-11-523.
9
Regularization Paths for Generalized Linear Models via Coordinate Descent.基于坐标下降法的广义线性模型正则化路径
J Stat Softw. 2010;33(1):1-22.
10
A group bridge approach for variable selection.一种用于变量选择的分组桥接方法。
Biometrika. 2009 Jun;96(2):339-355. doi: 10.1093/biomet/asp020.