Hancock John T, Khoshgoftaar Taghi M
Florida Atlantic University, 777 Glades Road, Boca Raton, FL USA.
J Big Data. 2020;7(1):94. doi: 10.1186/s40537-020-00369-8. Epub 2020 Nov 4.
Gradient Boosted Decision Trees (GBDT's) are a powerful tool for classification and regression tasks in Big Data. Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT's in order to use them effectively and make successful contributions. CatBoost is a member of the family of GBDT machine learning ensemble techniques. Since its debut in late 2018, researchers have successfully used CatBoost for machine learning studies involving Big Data. We take this opportunity to review recent research on CatBoost as it relates to Big Data, and learn best practices from studies that cast CatBoost in a positive light, as well as studies where CatBoost does not outshine other techniques, since we can learn lessons from both types of scenarios. Furthermore, as a Decision Tree based algorithm, CatBoost is well-suited to machine learning tasks involving categorical, heterogeneous data. Recent work across multiple disciplines illustrates CatBoost's effectiveness and shortcomings in classification and regression tasks. Another important issue we expose in literature on CatBoost is its sensitivity to hyper-parameters and the importance of hyper-parameter tuning. One contribution we make is to take an interdisciplinary approach to cover studies related to CatBoost in a single work. This provides researchers an in-depth understanding to help clarify proper application of CatBoost in solving problems. To the best of our knowledge, this is the first survey that studies all works related to CatBoost in a single publication.
梯度提升决策树(GBDT)是大数据分类和回归任务中的强大工具。研究人员应熟悉当前GBDT实现的优缺点,以便有效使用它们并做出成功贡献。CatBoost是GBDT机器学习集成技术家族的一员。自2018年末首次亮相以来,研究人员已成功将CatBoost用于涉及大数据的机器学习研究。我们借此机会回顾与大数据相关的CatBoost近期研究,并从对CatBoost持肯定态度的研究以及CatBoost并不比其他技术出色的研究中学习最佳实践,因为我们可以从这两种情况中吸取教训。此外,作为一种基于决策树的算法,CatBoost非常适合涉及分类、异构数据的机器学习任务。多个学科的近期工作说明了CatBoost在分类和回归任务中的有效性和缺点。我们在关于CatBoost的文献中揭示的另一个重要问题是它对超参数的敏感性以及超参数调优的重要性。我们的一个贡献是采用跨学科方法在一项工作中涵盖与CatBoost相关的研究。这为研究人员提供了深入理解,有助于阐明CatBoost在解决问题中的正确应用。据我们所知,这是首次在单一出版物中研究与CatBoost相关的所有工作的综述。