School of Project Management, Faculty of Engineering, The University of Sydney, Forest Lodge, NSW, Australia.
PLoS One. 2024 Apr 18;19(4):e0301541. doi: 10.1371/journal.pone.0301541. eCollection 2024.
Many individual studies in the literature observed the superiority of tree-based machine learning (ML) algorithms. However, the current body of literature lacks statistical validation of this superiority. This study addresses this gap by employing five ML algorithms on 200 open-access datasets from a wide range of research contexts to statistically confirm the superiority of tree-based ML algorithms over their counterparts. Specifically, it examines two tree-based ML (Decision tree and Random forest) and three non-tree-based ML (Support vector machine, Logistic regression and k-nearest neighbour) algorithms. Results from paired-sample t-tests show that both tree-based ML algorithms reveal better performance than each non-tree-based ML algorithm for the four ML performance measures (accuracy, precision, recall and F1 score) considered in this study, each at p<0.001 significance level. This performance superiority is consistent across both the model development and test phases. This study also used paired-sample t-tests for the subsets of the research datasets from disease prediction (66) and university-ranking (50) research contexts for further validation. The observed superiority of the tree-based ML algorithms remains valid for these subsets. Tree-based ML algorithms significantly outperformed non-tree-based algorithms for these two research contexts for all four performance measures. We discuss the research implications of these findings in detail in this article.
许多文献中的个别研究都观察到基于树的机器学习 (ML) 算法的优越性。然而,目前的文献缺乏对此优越性的统计学验证。本研究通过在来自广泛研究背景的 200 个开放访问数据集上使用五种 ML 算法,来统计确认基于树的 ML 算法相对于其对应算法的优越性,从而填补了这一空白。具体来说,它检查了两种基于树的 ML(决策树和随机森林)和三种非基于树的 ML(支持向量机、逻辑回归和 K 最近邻)算法。配对样本 t 检验的结果表明,对于本研究考虑的四个 ML 性能指标(准确性、精度、召回率和 F1 得分),这两种基于树的 ML 算法都比每种非基于树的 ML 算法表现更好,每个指标在 p<0.001 的显著性水平上。这种性能优势在模型开发和测试阶段都保持一致。本研究还对疾病预测(66 个)和大学排名(50 个)研究背景的研究数据集子集使用配对样本 t 检验进行了进一步验证。对于这两个研究领域,基于树的 ML 算法的优越性仍然有效。对于所有四个性能指标,基于树的 ML 算法都显著优于非基于树的算法。我们在本文中详细讨论了这些发现的研究意义。