Suppr超能文献

通过基于核密度估计的合成采样改进不平衡基因组数据的分类

Improving classification on imbalanced genomic data via KDE-based synthetic sampling.

作者信息

Taccaliti Edoardo, Aguilar-Ruiz Jesus S

机构信息

Department of Biology, University of Naples Federico II, Naples, Italy.

School of Engineering, Pablo de Olavide University, Sevilla, 41013, Spain.

出版信息

BioData Min. 2025 Aug 29;18(1):60. doi: 10.1186/s13040-025-00474-5.

Abstract

Class imbalance poses a serious challenge in biomedical machine learning, particularly in genomics, where datasets are characterized by extremely high dimensionality and very limited sample sizes. In such settings, standard classifiers tend to favor the majority class, leading to biased predictions - an especially problematic issue in clinical diagnostics where rare conditions must not be overlooked. In this study, we introduce a Kernel Density Estimation (KDE)-based oversampling approach to rebalance imbalanced genomic datasets by generating synthetic minority class samples. Unlike conventional methods such as SMOTE, KDE estimates the global probability distribution of the minority class and resamples accordingly, avoiding local interpolation pitfalls. We evaluate our method on 15 real-world genomic datasets using three classifiers -Naïve Bayes, Decision Trees, and Random Forests- and compare it to SMOTE and baseline training. Experimental results demonstrate that KDE oversampling consistently improves classification performance, especially in metrics robust to imbalance, such as AUC of the IMCP curve. Notably, KDE achieves superior results in tree-based models while dramatically simplifying the sampling process. This approach offers a statistically grounded and effective solution for balancing genomic datasets, with strong potential for improving fairness and accuracy in high-stakes medical decision-making.

摘要

类别不平衡在生物医学机器学习中构成了严峻挑战,尤其是在基因组学领域,该领域的数据集具有极高的维度和非常有限的样本量。在这种情况下,标准分类器往往偏向多数类,导致预测有偏差——这在临床诊断中是一个特别棘手的问题,因为罕见病症绝不能被忽视。在本研究中,我们引入了一种基于核密度估计(KDE)的过采样方法,通过生成合成少数类样本,来重新平衡不平衡的基因组数据集。与诸如SMOTE等传统方法不同,KDE估计少数类的全局概率分布并相应地进行重采样,避免了局部插值的陷阱。我们使用三种分类器——朴素贝叶斯、决策树和随机森林——在15个真实世界的基因组数据集上评估我们的方法,并将其与SMOTE和基线训练进行比较。实验结果表明,KDE过采样始终能提高分类性能,尤其是在对不平衡具有鲁棒性的指标上,如IMCP曲线的AUC。值得注意的是,KDE在基于树的模型中取得了优异的结果,同时极大地简化了采样过程。这种方法为平衡基因组数据集提供了一种基于统计学的有效解决方案,在高风险医疗决策中具有提高公平性和准确性的强大潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7282/12395650/73993aa42a7a/13040_2025_474_Figa_HTML.jpg

相似文献

1
Improving classification on imbalanced genomic data via KDE-based synthetic sampling.
BioData Min. 2025 Aug 29;18(1):60. doi: 10.1186/s13040-025-00474-5.
4
Plug-and-play use of tree-based methods: consequences for clinical prediction modeling.
J Clin Epidemiol. 2025 Aug;184:111834. doi: 10.1016/j.jclinepi.2025.111834. Epub 2025 May 19.
8
Semantic classification of Indonesian consumer health questions.
J Biomed Semantics. 2025 Jul 28;16(1):13. doi: 10.1186/s13326-025-00334-5.
9
Leveraging a foundation model zoo for cell similarity search in oncological microscopy across devices.
Front Oncol. 2025 Jun 18;15:1480384. doi: 10.3389/fonc.2025.1480384. eCollection 2025.

本文引用的文献

1
Classification performance assessment for imbalanced multiclass data.
Sci Rep. 2024 May 10;14(1):10759. doi: 10.1038/s41598-024-61365-z.
3
A new challenge for data analytics: transposons.
BioData Min. 2022 Mar 25;15(1):9. doi: 10.1186/s13040-022-00294-x.
6
The Gene Expression Omnibus Database.
Methods Mol Biol. 2016;1418:93-110. doi: 10.1007/978-1-4939-3578-9_5.
7
Novel cost-sensitive approach to improve the multilayer perceptron performance on imbalanced data.
IEEE Trans Neural Netw Learn Syst. 2013 Jun;24(6):888-99. doi: 10.1109/TNNLS.2013.2246188.
8
Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy.
Appl Environ Microbiol. 2007 Aug;73(16):5261-7. doi: 10.1128/AEM.00062-07. Epub 2007 Jun 22.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验