School of Information Technologies (J12), The University of Sydney, NSW 2006, Australia.
BMC Genomics. 2009 Dec 3;10 Suppl 3(Suppl 3):S34. doi: 10.1186/1471-2164-10-S3-S34.
Medical and biological data are commonly with small sample size, missing values, and most importantly, imbalanced class distribution. In this study we propose a particle swarm based hybrid system for remedying the class imbalance problem in medical and biological data mining. This hybrid system combines the particle swarm optimization (PSO) algorithm with multiple classifiers and evaluation metrics for evaluation fusion. Samples from the majority class are ranked using multiple objectives according to their merit in compensating the class imbalance, and then combined with the minority class to form a balanced dataset.
One important finding of this study is that different classifiers and metrics often provide different evaluation results. Nevertheless, the proposed hybrid system demonstrates consistent improvements over several alternative methods with three different metrics. The sampling results also demonstrate good generalization on different types of classification algorithms, indicating the advantage of information fusion applied in the hybrid system.
The experimental results demonstrate that unlike many currently available methods which often perform unevenly with different datasets the proposed hybrid system has a better generalization property which alleviates the method-data dependency problem. From the biological perspective, the system provides indication for further investigation of the highly ranked samples, which may result in the discovery of new conditions or disease subtypes.
医学和生物学数据通常具有样本量小、缺失值多,最重要的是,类别分布不平衡等特点。在本研究中,我们提出了一种基于粒子群优化(PSO)算法的混合系统,用于纠正医学和生物学数据挖掘中的类别不平衡问题。该混合系统将粒子群优化算法与多个分类器和评估指标相结合,用于评估融合。根据补偿类别不平衡的优势,使用多个目标对多数类别的样本进行排序,然后与少数类别的样本相结合,形成一个平衡的数据集。
本研究的一个重要发现是,不同的分类器和指标通常会提供不同的评估结果。然而,与三种不同的指标相比,所提出的混合系统在几个替代方法中表现出了一致的改进。抽样结果还表明,在不同类型的分类算法上具有良好的泛化能力,这表明混合系统中应用的信息融合具有优势。
实验结果表明,与许多现有的方法不同,这些方法在不同的数据集中表现不均衡,所提出的混合系统具有更好的泛化能力,减轻了方法-数据的依赖性问题。从生物学的角度来看,该系统为进一步研究排名较高的样本提供了指示,这可能导致新的条件或疾病亚型的发现。