Annu Int Conf IEEE Eng Med Biol Soc. 2021 Nov;2021:2314-2319. doi: 10.1109/EMBC46164.2021.9629697.
In early stage biomedical studies, small datasets are common due to the high cost and difficulty of sample collection with human subjects. This complicates the validation of machine learning models, which are best suited for large datasets. In this work, we examined feature selection techniques, validation frameworks, and learning curve fitting for small simulated datasets with known underlying discriminability, with the aim of identifying a protocol for estimating and interpreting early stage model performance and for planning future studies. Of a variety of examined validation configurations, a nested cross-validation framework provided the most accurate reflection of the selected features' discriminability, but the relevant features were often not properly identified during the feature selection stage for datasets with small sample sizes. Ultimately, we recommend that: (1) filter-based feature selection methods should be used to minimize overfitting to noise-based features, (2) statistical exploration should be conducted on datasets as a whole to estimate the level of discriminability and the feasibility of the classification problems, and (3) learning curves should be employed using nested cross-validation performance estimates for forecasting accuracy at larger sample sizes and estimating the required number of samples to converge towards best performance. This work should serve as a guideline for researchers incorporating machine learning in small-scale pilot studies.
在早期的生物医学研究中,由于人类样本采集的成本高、难度大,小数据集很常见。这使得机器学习模型的验证变得复杂,因为机器学习模型最适合于大数据集。在这项工作中,我们研究了特征选择技术、验证框架和针对具有已知可区分性的小型模拟数据集的学习曲线拟合,目的是确定一种用于估计和解释早期模型性能以及规划未来研究的方案。在所检查的各种验证配置中,嵌套交叉验证框架最能准确反映所选特征的可区分性,但对于样本量较小的数据集,在特征选择阶段,相关特征往往无法正确识别。最终,我们建议:(1)应使用基于过滤器的特征选择方法,以最小化对基于噪声的特征的过拟合;(2)应在整个数据集上进行统计探索,以估计可区分性水平和分类问题的可行性;(3)应使用嵌套交叉验证性能估计值来绘制学习曲线,以便预测更大样本量下的准确性,并估计达到最佳性能所需的样本数量。这项工作应该为研究人员在小规模试点研究中纳入机器学习提供指导。