Taccaliti Edoardo, Aguilar-Ruiz Jesus S
Department of Biology, University of Naples Federico II, Naples, Italy.
School of Engineering, Pablo de Olavide University, Sevilla, 41013, Spain.
BioData Min. 2025 Aug 29;18(1):60. doi: 10.1186/s13040-025-00474-5.
Class imbalance poses a serious challenge in biomedical machine learning, particularly in genomics, where datasets are characterized by extremely high dimensionality and very limited sample sizes. In such settings, standard classifiers tend to favor the majority class, leading to biased predictions - an especially problematic issue in clinical diagnostics where rare conditions must not be overlooked. In this study, we introduce a Kernel Density Estimation (KDE)-based oversampling approach to rebalance imbalanced genomic datasets by generating synthetic minority class samples. Unlike conventional methods such as SMOTE, KDE estimates the global probability distribution of the minority class and resamples accordingly, avoiding local interpolation pitfalls. We evaluate our method on 15 real-world genomic datasets using three classifiers -Naïve Bayes, Decision Trees, and Random Forests- and compare it to SMOTE and baseline training. Experimental results demonstrate that KDE oversampling consistently improves classification performance, especially in metrics robust to imbalance, such as AUC of the IMCP curve. Notably, KDE achieves superior results in tree-based models while dramatically simplifying the sampling process. This approach offers a statistically grounded and effective solution for balancing genomic datasets, with strong potential for improving fairness and accuracy in high-stakes medical decision-making.
类别不平衡在生物医学机器学习中构成了严峻挑战,尤其是在基因组学领域,该领域的数据集具有极高的维度和非常有限的样本量。在这种情况下,标准分类器往往偏向多数类,导致预测有偏差——这在临床诊断中是一个特别棘手的问题,因为罕见病症绝不能被忽视。在本研究中,我们引入了一种基于核密度估计(KDE)的过采样方法,通过生成合成少数类样本,来重新平衡不平衡的基因组数据集。与诸如SMOTE等传统方法不同,KDE估计少数类的全局概率分布并相应地进行重采样,避免了局部插值的陷阱。我们使用三种分类器——朴素贝叶斯、决策树和随机森林——在15个真实世界的基因组数据集上评估我们的方法,并将其与SMOTE和基线训练进行比较。实验结果表明,KDE过采样始终能提高分类性能,尤其是在对不平衡具有鲁棒性的指标上,如IMCP曲线的AUC。值得注意的是,KDE在基于树的模型中取得了优异的结果,同时极大地简化了采样过程。这种方法为平衡基因组数据集提供了一种基于统计学的有效解决方案,在高风险医疗决策中具有提高公平性和准确性的强大潜力。