Materials, Devices and Systems Division, School of Electrical and Electronic Engineering, The University of Manchester, Manchester, England, United Kingdom.
School of Biological Sciences, The University of Manchester, Manchester, England, United Kingdom.
PLoS One. 2019 Nov 7;14(11):e0224365. doi: 10.1371/journal.pone.0224365. eCollection 2019.
Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.
神经影像学、基因组学、运动跟踪、眼动跟踪和许多其他基于技术的数据收集方法的进步带来了大量高维数据集,由于涉及人类参与者的数据收集成本高昂,这些数据集通常样本数量较少。具有少量样本的高维数据对于识别生物标志物和进行可行性和初步研究至关重要,但它可能导致机器学习 (ML) 性能估计存在偏差。我们对应用 ML 从非自闭症个体预测自闭症个体的研究进行了综述,结果表明,样本量小与报告的分类准确性较高相关。因此,我们研究了这种偏差是否可能是由于使用了不能充分控制过拟合的验证方法引起的。我们的模拟表明,K 折交叉验证 (CV) 在样本量较小时会产生严重偏差的性能估计,即使在样本量为 1000 时,这种偏差仍然明显。嵌套 CV 和训练/测试分割方法无论样本量大小,都能产生稳健且无偏差的性能估计。我们还表明,如果在 pooled 训练和测试数据上执行特征选择,那么它对偏差的贡献比参数调整要大得多。此外,还探讨了数据维数、超参数空间和 CV 折数对偏差的贡献,并将验证方法与可区分数据进行了比较。结果表明,在使用小数据集时如何设计稳健的测试方法,以及如何根据所使用的验证方法解释其他研究的结果。