University of California at Los Angeles, Los Angeles, CA, USA.
Division of Artificial Intelligence in Medicine, Departments of Medicine and Cardiology, Cedars Sinai Medical Center, Beverly Boulevard, Ste. A047N, Los Angeles, CA, 8700, USA.
Sci Rep. 2021 Jul 14;11(1):14490. doi: 10.1038/s41598-021-93651-5.
As machine learning research in the field of cardiovascular imaging continues to grow, obtaining reliable model performance estimates is critical to develop reliable baselines and compare different algorithms. While the machine learning community has generally accepted methods such as k-fold stratified cross-validation (CV) to be more rigorous than single split validation, the standard research practice in medical fields is the use of single split validation techniques. This is especially concerning given the relatively small sample sizes of datasets used for cardiovascular imaging. We aim to examine how train-test split variation impacts the stability of machine learning (ML) model performance estimates in several validation techniques on two real-world cardiovascular imaging datasets: stratified split-sample validation (70/30 and 50/50 train-test splits), tenfold stratified CV, 10 × repeated tenfold stratified CV, bootstrapping (500 × repeated), and leave one out (LOO) validation. We demonstrate that split validation methods lead to the highest range in AUC and statistically significant differences in ROC curves, unlike the other aforementioned approaches. When building predictive models on relatively small data sets as is often the case in medical imaging, split-sample validation techniques can produce instability in performance estimates with variations in range over 0.15 in the AUC values, and thus any of the alternate validation methods are recommended.
随着机器学习在心血管成像领域的研究不断深入,获得可靠的模型性能估计对于开发可靠的基准和比较不同算法至关重要。虽然机器学习社区普遍认为 k 折分层交叉验证(CV)比单分割验证更严格,但医学领域的标准研究实践是使用单分割验证技术。考虑到心血管成像中使用的数据集相对较小,这尤其令人担忧。我们旨在研究在两个真实的心血管成像数据集上的几种验证技术中,训练-测试分割变化如何影响机器学习(ML)模型性能估计的稳定性:分层分割样本验证(70/30 和 50/50 训练-测试分割)、十折分层 CV、10×重复十折分层 CV、引导(500×重复)和留一法(LOO)验证。我们证明了与其他方法不同,分割验证方法导致 AUC 的范围最高,并且在 ROC 曲线上存在统计学上显著的差异。当在医学成像中经常出现的相对较小的数据集中构建预测模型时,样本分割验证技术会导致性能估计不稳定,AUC 值的范围变化超过 0.15,因此建议使用任何替代验证方法。