Mahmood Zafar, Jamel Leila, Salem Dina Ahmed, Ashraf Imran
Department of Computer Science, University of Gujrat, Gujrat, Pakistan.
Department of Information Systems, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, 11671, Riyadh, Saudi Arabia.
Sci Rep. 2025 Aug 25;15(1):31245. doi: 10.1038/s41598-025-13929-w.
Several issues are there to prevent the traditional classifiers from getting an acceptable performance level while learning from multi-class problems. One of the main problems is the unequal distribution of samples, which significantly reduces the efficiency of the underlying classifier when combined with incompatible optimization benchmarks and data overlapping phenomena. The classifier performance is compromised beyond the expected level by the combined effects of imbalanced distribution and sample overlapping around the class boundaries. This problem worsens with the increase in the number of classes in the multi-class scenario. Despite having a more significant combined effect on classifier performance, the combined effects of imbalanced data and overlapping questions have been given the least attention in the research. To improve models' learning from imbalanced multi-class and overlapping of shared attributes issues, this work introduces SVM++, a modified version of support vector machines (SVM). Comprising of three steps, Algorithm-1 finds and splits the training set into overlapping and non-overlapping samples. Algorithm-2 then separates the overlapped data into the Critical-1 and Critical-2 regions. The Critical-1 region consists of overlapped samples, sharing similar characteristics, which is the main cause of degraded classification performance. In the third step, an algorithm based on the mean of the maximum and minimum distance of the Critical-1 region samples is proposed by improving the traditional SVM kernel mapping function to a higher dimension. Thirty real datasets with various imbalances and degrees of overlap are utilized to compare our suggested algorithms' supremacy with the state-of-the-art classifiers.
在从多类问题中学习时,存在几个问题会阻碍传统分类器达到可接受的性能水平。主要问题之一是样本分布不均衡,当与不兼容的优化基准和数据重叠现象结合时,这会显著降低底层分类器的效率。不平衡分布和类边界周围的样本重叠的综合影响,使分类器性能受损程度超出预期水平。在多类场景中,随着类数量的增加,这个问题会恶化。尽管不平衡数据和重叠问题对分类器性能的综合影响更为显著,但在研究中却最少受到关注。为了改善模型从不平衡多类和共享属性重叠问题中的学习效果,这项工作引入了支持向量机的改进版本SVM++。算法1由三个步骤组成,它找到并将训练集分为重叠样本和非重叠样本。然后算法2将重叠数据分离到关键1区和关键2区。关键1区由具有相似特征的重叠样本组成,这是分类性能下降的主要原因。在第三步中,通过将传统SVM核映射函数改进到更高维度,提出了一种基于关键1区样本最大和最小距离均值的算法。利用30个具有各种不平衡和重叠程度的真实数据集,将我们提出的算法的优越性与最先进的分类器进行比较。