Vuttipittayamongkol Pattaramon, Elyan Eyad
School of Computing Science and Digital Media, Robert Gordon University, Aberdeen, AB10 7GJ, UK.
Int J Neural Syst. 2020 Aug;30(8):2050043. doi: 10.1142/S0129065720500434. Epub 2020 Jul 17.
Classification of imbalanced datasets has attracted substantial research interest over the past decades. Imbalanced datasets are common in several domains such as health, finance, security and others. A wide range of solutions to handle imbalanced datasets focus mainly on the class distribution problem and aim at providing more balanced datasets by means of resampling. However, existing literature shows that class overlap has a higher negative impact on the learning process than class distribution. In this paper, we propose overlap-based undersampling methods for maximizing the visibility of the minority class instances in the overlapping region. This is achieved by the use of soft clustering and the elimination threshold that is adaptable to the overlap degree to identify and eliminate negative instances in the overlapping region. For more accurate clustering and detection of overlapped negative instances, the presence of the minority class at the borderline areas is emphasized by means of oversampling. Extensive experiments using simulated and real-world datasets covering a wide range of imbalance and overlap scenarios including extreme cases were carried out. Results show significant improvement in sensitivity and competitive performance with well-established and state-of-the-art methods.
在过去几十年中,不平衡数据集的分类吸引了大量的研究兴趣。不平衡数据集在健康、金融、安全等多个领域都很常见。处理不平衡数据集的多种解决方案主要集中在类分布问题上,旨在通过重采样提供更平衡的数据集。然而,现有文献表明,类重叠对学习过程的负面影响比类分布更大。在本文中,我们提出了基于重叠的欠采样方法,以最大化重叠区域中少数类实例的可见性。这是通过使用软聚类和适应重叠程度的消除阈值来识别和消除重叠区域中的负实例来实现的。为了更准确地聚类和检测重叠的负实例,通过过采样强调少数类在边界区域的存在。我们使用了涵盖广泛不平衡和重叠场景(包括极端情况)的模拟和真实世界数据集进行了大量实验。结果表明,与成熟的和最新的方法相比,灵敏度有显著提高,性能具有竞争力。