Gurcan Fatih, Soylu Ahmet
Department of Management Information Systems, Faculty of Economics and Administrative Sciences, Karadeniz Technical University, 61080 Trabzon, Turkey.
Department of Computer Science, Faculty of Information Technology and Electrical Engineering, Norwegian University of Science and Technology, 2815 Gjøvik, Norway.
Cancers (Basel). 2024 Dec 2;16(23):4046. doi: 10.3390/cancers16234046.
BACKGROUND/OBJECTIVES: This study examines the effectiveness of different resampling methods and classifier models for handling imbalanced datasets, with a specific focus on critical healthcare applications such as cancer diagnosis and prognosis.
To address the class imbalance issue, traditional sampling methods like SMOTE and ADASYN were replaced by Generative Adversarial Networks (GANs), which leverage deep neural network architectures to generate high-quality synthetic data. The study highlights the advantage of GANs in creating realistic, diverse, and homogeneous samples for the minority class, which plays a significant role in mitigating the diagnostic challenges posed by imbalanced data. Four types of classifiers, Boosting, Bagging, Linear, and Non-linear, were assessed to evaluate their performance using metrics such as accuracy, precision, recall, F1 score, and ROC AUC.
Baseline performance without resampling showed significant limitations, underscoring the need for resampling strategies. Using GAN-generated data notably improved the detection of minority instances and overall classification performance. The average ROC AUC value increased from baseline levels of approximately 0.8276 to over 0.9734, underscoring the effectiveness of GAN-based resampling in enhancing model performance and ensuring more balanced detection across classes. With GAN-based resampling, GradientBoosting classifier achieved a ROC AUC of 0.9890, the highest among all models, demonstrating the effectiveness of GAN-generated data in enhancing performance.
The findings underscore that advanced models like Boosting and Bagging, when paired with effective resampling strategies such as GANs, are better suited for handling imbalanced datasets and improving predictive accuracy in healthcare applications.
背景/目的:本研究考察了不同重采样方法和分类器模型处理不平衡数据集的有效性,特别关注癌症诊断和预后等关键医疗应用。
为解决类别不平衡问题,生成对抗网络(GAN)取代了诸如SMOTE和ADASYN等传统采样方法,GAN利用深度神经网络架构生成高质量的合成数据。该研究突出了GAN在为少数类创建逼真、多样且同质样本方面的优势,这在缓解不平衡数据带来的诊断挑战中发挥了重要作用。评估了四种类型的分类器,即提升、装袋、线性和非线性分类器,使用准确率、精确率、召回率、F1分数和ROC曲线下面积等指标来评估它们的性能。
未进行重采样的基线性能显示出显著局限性,凸显了重采样策略的必要性。使用GAN生成的数据显著提高了对少数实例的检测和整体分类性能。平均ROC曲线下面积值从基线水平的约0.8276提高到超过0.9734,突出了基于GAN的重采样在提高模型性能和确保跨类别更平衡检测方面的有效性。通过基于GAN的重采样,梯度提升分类器的ROC曲线下面积达到0.9890,在所有模型中最高,证明了GAN生成的数据在提高性能方面的有效性。
研究结果强调,像提升和装袋这样的先进模型,与诸如GAN等有效的重采样策略相结合时,更适合处理不平衡数据集并提高医疗应用中的预测准确性。