Beinecke Jacqueline, Heider Dominik
Department of Mathematics and Computer Science, Philipps-University of Marburg, Hans-Meerwein-Str. 6, 35043, Marburg, Germany.
BioData Min. 2021 Nov 29;14(1):49. doi: 10.1186/s13040-021-00283-6.
Clinical data sets have very special properties and suffer from many caveats in machine learning. They typically show a high-class imbalance, have a small number of samples and a large number of parameters, and have missing values. While feature selection approaches and imputation techniques address the former problems, the class imbalance is typically addressed using augmentation techniques. However, these techniques have been developed for big data analytics, and their suitability for clinical data sets is unclear.This study analyzed different augmentation techniques for use in clinical data sets and subsequent employment of machine learning-based classification. It turns out that Gaussian Noise Up-Sampling (GNUS) is not always but generally, is as good as SMOTE and ADASYN and even outperform those on some datasets. However, it has also been shown that augmentation does not improve classification at all in some cases.
临床数据集具有非常特殊的属性,并且在机器学习中存在许多需要注意的问题。它们通常表现出高度的类别不平衡,样本数量少而参数数量多,并且存在缺失值。虽然特征选择方法和插补技术解决了前两个问题,但类别不平衡通常使用增强技术来解决。然而,这些技术是为大数据分析而开发的,它们对临床数据集的适用性尚不清楚。本研究分析了用于临床数据集的不同增强技术以及随后基于机器学习的分类应用。结果表明,高斯噪声上采样(GNUS)并不总是但通常与SMOTE和ADASYN一样好,甚至在某些数据集上优于它们。然而,也有研究表明,在某些情况下增强根本不会提高分类效果。