Mazurowski Maciej A, Habas Piotr A, Zurada Jacek M, Lo Joseph Y, Baker Jay A, Tourassi Georgia D
Computational Intelligence Lab, Department of Electrical and Computer Engineering, University of Louisville, Louisville, KY 40292, USA.
Neural Netw. 2008 Mar-Apr;21(2-3):427-36. doi: 10.1016/j.neunet.2007.12.031. Epub 2007 Dec 27.
This study investigates the effect of class imbalance in training data when developing neural network classifiers for computer-aided medical diagnosis. The investigation is performed in the presence of other characteristics that are typical among medical data, namely small training sample size, large number of features, and correlations between features. Two methods of neural network training are explored: classical backpropagation (BP) and particle swarm optimization (PSO) with clinically relevant training criteria. An experimental study is performed using simulated data and the conclusions are further validated on real clinical data for breast cancer diagnosis. The results show that classifier performance deteriorates with even modest class imbalance in the training data. Further, it is shown that BP is generally preferable over PSO for imbalanced training data especially with small data sample and large number of features. Finally, it is shown that there is no clear preference between oversampling and no compensation approach and some guidance is provided regarding a proper selection.
本研究探讨了在开发用于计算机辅助医学诊断的神经网络分类器时,训练数据中类别不平衡的影响。该研究是在存在医学数据中典型的其他特征的情况下进行的,即训练样本量小、特征数量多以及特征之间的相关性。探索了两种神经网络训练方法:经典反向传播(BP)和具有临床相关训练标准的粒子群优化(PSO)。使用模拟数据进行了一项实验研究,并在用于乳腺癌诊断的真实临床数据上进一步验证了结论。结果表明,即使训练数据中存在适度的类别不平衡,分类器性能也会下降。此外,结果表明,对于不平衡的训练数据,尤其是数据样本量小且特征数量多的情况,BP通常比PSO更可取。最后,结果表明,过采样和无补偿方法之间没有明显的偏好,并提供了关于正确选择的一些指导。