Suppr超能文献

证实基于树的机器学习算法在表格数据方面相对于其对应算法具有统计学上的显著优势。

Confirming the statistically significant superiority of tree-based machine learning algorithms over their counterparts for tabular data.

机构信息

School of Project Management, Faculty of Engineering, The University of Sydney, Forest Lodge, NSW, Australia.

出版信息

PLoS One. 2024 Apr 18;19(4):e0301541. doi: 10.1371/journal.pone.0301541. eCollection 2024.

Abstract

Many individual studies in the literature observed the superiority of tree-based machine learning (ML) algorithms. However, the current body of literature lacks statistical validation of this superiority. This study addresses this gap by employing five ML algorithms on 200 open-access datasets from a wide range of research contexts to statistically confirm the superiority of tree-based ML algorithms over their counterparts. Specifically, it examines two tree-based ML (Decision tree and Random forest) and three non-tree-based ML (Support vector machine, Logistic regression and k-nearest neighbour) algorithms. Results from paired-sample t-tests show that both tree-based ML algorithms reveal better performance than each non-tree-based ML algorithm for the four ML performance measures (accuracy, precision, recall and F1 score) considered in this study, each at p<0.001 significance level. This performance superiority is consistent across both the model development and test phases. This study also used paired-sample t-tests for the subsets of the research datasets from disease prediction (66) and university-ranking (50) research contexts for further validation. The observed superiority of the tree-based ML algorithms remains valid for these subsets. Tree-based ML algorithms significantly outperformed non-tree-based algorithms for these two research contexts for all four performance measures. We discuss the research implications of these findings in detail in this article.

摘要

许多文献中的个别研究都观察到基于树的机器学习 (ML) 算法的优越性。然而,目前的文献缺乏对此优越性的统计学验证。本研究通过在来自广泛研究背景的 200 个开放访问数据集上使用五种 ML 算法,来统计确认基于树的 ML 算法相对于其对应算法的优越性,从而填补了这一空白。具体来说,它检查了两种基于树的 ML(决策树和随机森林)和三种非基于树的 ML(支持向量机、逻辑回归和 K 最近邻)算法。配对样本 t 检验的结果表明,对于本研究考虑的四个 ML 性能指标(准确性、精度、召回率和 F1 得分),这两种基于树的 ML 算法都比每种非基于树的 ML 算法表现更好,每个指标在 p<0.001 的显著性水平上。这种性能优势在模型开发和测试阶段都保持一致。本研究还对疾病预测(66 个)和大学排名(50 个)研究背景的研究数据集子集使用配对样本 t 检验进行了进一步验证。对于这两个研究领域,基于树的 ML 算法的优越性仍然有效。对于所有四个性能指标,基于树的 ML 算法都显著优于非基于树的算法。我们在本文中详细讨论了这些发现的研究意义。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2da1/11025817/1bc70832a69a/pone.0301541.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验