Enterprise Ireland Medical and Engineering Technologies Gateway, GMIT, Galway, Ireland.
Marine and Freshwater Research Centre, GMIT, Galway, Ireland.
Stat Methods Med Res. 2021 Mar;30(3):916-925. doi: 10.1177/0962280220980484. Epub 2020 Dec 28.
Imbalance between positive and negative outcomes, a so-called class imbalance, is a problem generally found in medical data. Imbalanced data hinder the performance of conventional classification methods which aim to improve the overall accuracy of the model without accounting for uneven distribution of the classes. To rectify this, the data can be resampled by oversampling the positive (minority) class until the classes are approximately equally represented. After that, a prediction model such as gradient boosting algorithm can be fitted with greater confidence. This classification method allows for non-linear relationships and deep interactive effects while focusing on difficult areas by iterative shifting towards problematic observations. In this study, we demonstrate application of these methods to medical data and develop a practical framework for evaluation of features contributing into the probability of stroke.
正、负结果之间的不平衡,即所谓的类别不平衡,是医学数据中普遍存在的问题。不平衡数据会影响旨在提高模型整体准确性而不考虑类别分布不均的常规分类方法的性能。为了纠正这种情况,可以通过对正例(少数类)进行过采样来重新采样数据,直到类别大致相等。之后,可以更有信心地使用梯度提升算法等预测模型进行拟合。这种分类方法允许非线性关系和深度交互效应,同时通过迭代向有问题的观测值转移来关注困难区域。在这项研究中,我们展示了这些方法在医学数据中的应用,并开发了一个实用的框架,用于评估导致中风概率的特征。