Suppr超能文献

RSMOTE:提升不平衡医学数据集的分类性能

RSMOTE: improving classification performance over imbalanced medical datasets.

作者信息

Naseriparsa Mehdi, Al-Shammari Ahmed, Sheng Ming, Zhang Yong, Zhou Rui

机构信息

Swinburne University of Technology, Hawthorn, Australia.

University of Al-Qadisiyah, Al Diwaniyah, Iraq.

出版信息

Health Inf Sci Syst. 2020 Jun 12;8(1):22. doi: 10.1007/s13755-020-00112-w. eCollection 2020 Dec.

Abstract

INTRODUCTION

Medical diagnosis is a crucial step for patient treatment. However, diagnosis is prone to bias due to imbalanced datasets. To overcome the imbalanced dataset problem, simple minority oversampling technique (SMOTE) was proposed that can generate new synthetic samples at data level to create the balance between minority and majority classes. However, the synthetic samples are generated on a random basis which causes class mixture problem; thus, resulting in deteriorating the classification performance and biased diagnosis.

PURPOSE

In order to overcome the SMOTE shortcomings, some modified methods were proposed that try to generate synthetic samples along the line segment of selected minority samples. Most of these methods adopt one of the two policies for selecting minority samples to generate synthetic samples: borderline region sampling or safe region sampling. However, they both suffer from over-generalisation problem. We propose a modified SMOTE-based resampling method called RSMOTE to alleviate the medical imbalanced dataset problem. We provide an in-depth analysis and verify the performance of RSMOTE over imbalanced medical datasets.

METHODS

In this paper, the proposed RSMOTE divides the minority sample domain into four regions (normal, semi-normal, semi-critical, and critical) based on the minority sample density analysis. RSMOTE discovers the minority sample region globally and applies the resampling near a specific group of samples.

RESULTS

Our analysis and experiments verify that if synthetic samples are generated in the regions with high minority sample density, classification performance will be improved due to low risk of class mixture. Unlike some safe region methods, RSMOTE decides the region of minority samples on a global basis, thus removing the over-generalisation problem. Classic and additional evaluation metrics are considered to measure the effectiveness of the modified method: Recall, FP Rate, Precision, F-Measure, ROC area, and Average Aggregated Metric. We carried out experiments over various imbalanced medical datasets.

CONCLUSION

Based on the minority sample density analysis, we propose RSMOTE method that divides the minority sample domain into four regions. The proposed RSMOTE includes four re-sampling methods that each of them carries out resampling on a specific region. According to the experimental results, resampling on the regions with high minority sample density obtained better results while those with lower minority sample density got the inferior results. Thus, we conclude that the RSMOTE is a more flexible resampling method for the imbalanced medical datasets that is capable of generating samples with various minority sample densities.

摘要

引言

医学诊断是患者治疗的关键步骤。然而,由于数据集不平衡,诊断容易产生偏差。为了克服数据集不平衡问题,人们提出了简单少数类过采样技术(SMOTE),该技术可以在数据层面生成新的合成样本,以实现少数类和多数类之间的平衡。然而,合成样本是随机生成的,这会导致类混合问题;从而导致分类性能下降和诊断偏差。

目的

为了克服SMOTE的缺点,人们提出了一些改进方法,试图沿着选定的少数类样本的线段生成合成样本。这些方法大多采用两种策略之一来选择少数类样本以生成合成样本:边界区域采样或安全区域采样。然而,它们都存在过度泛化问题。我们提出了一种基于SMOTE的改进重采样方法,称为RSMOTE,以缓解医学不平衡数据集问题。我们进行了深入分析,并验证了RSMOTE在不平衡医学数据集上的性能。

方法

在本文中,提出的RSMOTE基于少数类样本密度分析,将少数类样本域划分为四个区域(正常、半正常、半临界和临界)。RSMOTE全局发现少数类样本区域,并在特定的一组样本附近进行重采样。

结果

我们的分析和实验验证了,如果在少数类样本密度高的区域生成合成样本,由于类混合风险低,分类性能将得到提高。与一些安全区域方法不同,RSMOTE在全局基础上确定少数类样本的区域,从而消除了过度泛化问题。考虑使用经典和附加评估指标来衡量改进方法的有效性:召回率、误报率、精度、F值、ROC面积和平均综合指标。我们在各种不平衡医学数据集上进行了实验。

结论

基于少数类样本密度分析,我们提出了RSMOTE方法,该方法将少数类样本域划分为四个区域。提出的RSMOTE包括四种重采样方法,每种方法都在特定区域进行重采样。根据实验结果,在少数类样本密度高的区域进行重采样得到了更好的结果,而在少数类样本密度低的区域得到了较差的结果。因此,我们得出结论,RSMOTE是一种更灵活的针对不平衡医学数据集的重采样方法,能够生成具有各种少数类样本密度的样本。

相似文献

1
RSMOTE: improving classification performance over imbalanced medical datasets.RSMOTE:提升不平衡医学数据集的分类性能
Health Inf Sci Syst. 2020 Jun 12;8(1):22. doi: 10.1007/s13755-020-00112-w. eCollection 2020 Dec.
10
Imbalanced medical disease dataset classification using enhanced generative adversarial network.使用增强生成对抗网络的不平衡医学疾病数据集分类
Comput Methods Biomech Biomed Engin. 2023 Oct-Dec;26(14):1702-1718. doi: 10.1080/10255842.2022.2134729. Epub 2022 Nov 2.

引用本文的文献

本文引用的文献

2
Comparison of variable selection methods for clinical predictive modeling.比较临床预测建模中的变量选择方法。
Int J Med Inform. 2018 Aug;116:10-17. doi: 10.1016/j.ijmedinf.2018.05.006. Epub 2018 May 21.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验