Schlieker Laura, Telaar Anna, Lueking Angelika, Schulz-Knappe Peter, Theek Carmen, Ickstadt Katja
ClinStat GmbH, Max-Planck-Str. 22a, 50858 Cologne, formerly Protagen AG, Otto-Hahn-Str. 15, 44227, Dortmund, Germany.
Berufskolleg am Wassertum, 46399 Bocholt, formerly Protagen AG, Otto-Hahn-Str. 15, 44227, Dortmund, Germany.
Biom J. 2017 Sep;59(5):948-966. doi: 10.1002/bimj.201600207. Epub 2017 Jun 19.
The classification of a population by a specific trait is a major task in medicine, for example when in a diagnostic setting groups of patients with specific diseases are identified, but also when in predictive medicine a group of patients is classified into specific disease severity classes that might profit from different treatments. When the sizes of those subgroups become small, for example in rare diseases, imbalances between the classes are more the rule than the exception and make statistical classification problematic when the error rate of the minority class is high. Many observations are classified as belonging to the majority class, while the error rate of the majority class is low. This case study aims to investigate class imbalance for Random Forests and Powered Partial Least Squares Discriminant Analysis (PPLS-DA) and to evaluate the performance of these classifiers when they are combined with methods to compensate imbalance (sampling methods, cost-sensitive learning approaches). We evaluate all approaches with a scoring system taking the classification results into consideration. This case study is based on one high-dimensional multiplex autoimmune assay dataset describing immune response to antigens and consisting of two classes of patients: Rheumatoid Arthritis (RA) and Systemic Lupus Erythemathodes (SLE). Datasets with varying degrees of imbalance are created by successively reducing the class of RA patients. Our results indicate possible benefit of cost-sensitive learning approaches for Random Forests. Although further research is needed to verify our findings by investigating other datasets or large-scale simulation studies, we claim that this work has the potential to increase awareness of practitioners to this problem of class imbalance and stresses the importance of considering methods to compensate class imbalance.
根据特定特征对人群进行分类是医学中的一项主要任务,例如在诊断环境中识别患有特定疾病的患者群体时,以及在预测医学中将一组患者分类为可能从不同治疗中获益的特定疾病严重程度类别时。当这些亚组的规模变小时,例如在罕见疾病中,类别之间的不平衡更多是常态而非例外,并且当少数类别的错误率很高时,会使统计分类变得有问题。许多观察结果被分类为属于多数类,而多数类的错误率很低。本案例研究旨在调查随机森林和加权偏最小二乘判别分析(PPLS-DA)中的类别不平衡问题,并评估这些分类器与补偿不平衡的方法(采样方法、成本敏感学习方法)相结合时的性能。我们使用一个考虑分类结果的评分系统来评估所有方法。本案例研究基于一个高维多重自身免疫分析数据集,该数据集描述了对抗原的免疫反应,由两类患者组成:类风湿性关节炎(RA)和系统性红斑狼疮(SLE)。通过连续减少RA患者类别来创建具有不同程度不平衡的数据集。我们的结果表明成本敏感学习方法对随机森林可能有益。尽管需要通过研究其他数据集或大规模模拟研究来进一步验证我们的发现,但我们声称这项工作有可能提高从业者对类别不平衡问题的认识,并强调考虑补偿类别不平衡方法的重要性。