• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于大数据的CatBoost:跨学科综述

CatBoost for big data: an interdisciplinary review.

作者信息

Hancock John T, Khoshgoftaar Taghi M

机构信息

Florida Atlantic University, 777 Glades Road, Boca Raton, FL USA.

出版信息

J Big Data. 2020;7(1):94. doi: 10.1186/s40537-020-00369-8. Epub 2020 Nov 4.

DOI:10.1186/s40537-020-00369-8
PMID:33169094
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7610170/
Abstract

Gradient Boosted Decision Trees (GBDT's) are a powerful tool for classification and regression tasks in Big Data. Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT's in order to use them effectively and make successful contributions. CatBoost is a member of the family of GBDT machine learning ensemble techniques. Since its debut in late 2018, researchers have successfully used CatBoost for machine learning studies involving Big Data. We take this opportunity to review recent research on CatBoost as it relates to Big Data, and learn best practices from studies that cast CatBoost in a positive light, as well as studies where CatBoost does not outshine other techniques, since we can learn lessons from both types of scenarios. Furthermore, as a Decision Tree based algorithm, CatBoost is well-suited to machine learning tasks involving categorical, heterogeneous data. Recent work across multiple disciplines illustrates CatBoost's effectiveness and shortcomings in classification and regression tasks. Another important issue we expose in literature on CatBoost is its sensitivity to hyper-parameters and the importance of hyper-parameter tuning. One contribution we make is to take an interdisciplinary approach to cover studies related to CatBoost in a single work. This provides researchers an in-depth understanding to help clarify proper application of CatBoost in solving problems. To the best of our knowledge, this is the first survey that studies all works related to CatBoost in a single publication.

摘要

梯度提升决策树(GBDT)是大数据分类和回归任务中的强大工具。研究人员应熟悉当前GBDT实现的优缺点,以便有效使用它们并做出成功贡献。CatBoost是GBDT机器学习集成技术家族的一员。自2018年末首次亮相以来,研究人员已成功将CatBoost用于涉及大数据的机器学习研究。我们借此机会回顾与大数据相关的CatBoost近期研究,并从对CatBoost持肯定态度的研究以及CatBoost并不比其他技术出色的研究中学习最佳实践,因为我们可以从这两种情况中吸取教训。此外,作为一种基于决策树的算法,CatBoost非常适合涉及分类、异构数据的机器学习任务。多个学科的近期工作说明了CatBoost在分类和回归任务中的有效性和缺点。我们在关于CatBoost的文献中揭示的另一个重要问题是它对超参数的敏感性以及超参数调优的重要性。我们的一个贡献是采用跨学科方法在一项工作中涵盖与CatBoost相关的研究。这为研究人员提供了深入理解,有助于阐明CatBoost在解决问题中的正确应用。据我们所知,这是首次在单一出版物中研究与CatBoost相关的所有工作的综述。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7158/7610170/833b3fafd4db/40537_2020_369_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7158/7610170/ec31a6e7e5b0/40537_2020_369_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7158/7610170/e0bd58ad84eb/40537_2020_369_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7158/7610170/05924e43cb5c/40537_2020_369_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7158/7610170/42189e3aeb66/40537_2020_369_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7158/7610170/ec6e38786ee6/40537_2020_369_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7158/7610170/5e66ceacc726/40537_2020_369_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7158/7610170/4bbbbd9dd73c/40537_2020_369_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7158/7610170/57142e9b2f61/40537_2020_369_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7158/7610170/833b3fafd4db/40537_2020_369_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7158/7610170/ec31a6e7e5b0/40537_2020_369_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7158/7610170/e0bd58ad84eb/40537_2020_369_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7158/7610170/05924e43cb5c/40537_2020_369_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7158/7610170/42189e3aeb66/40537_2020_369_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7158/7610170/ec6e38786ee6/40537_2020_369_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7158/7610170/5e66ceacc726/40537_2020_369_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7158/7610170/4bbbbd9dd73c/40537_2020_369_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7158/7610170/57142e9b2f61/40537_2020_369_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7158/7610170/833b3fafd4db/40537_2020_369_Fig9_HTML.jpg

相似文献

1
CatBoost for big data: an interdisciplinary review.用于大数据的CatBoost:跨学科综述
J Big Data. 2020;7(1):94. doi: 10.1186/s40537-020-00369-8. Epub 2020 Nov 4.
2
Estimation of tetracycline antibiotic photodegradation from wastewater by heterogeneous metal-organic frameworks photocatalysts.利用异相金属-有机骨架光催化剂估算废水中四环素抗生素的光降解。
Chemosphere. 2022 Jan;287(Pt 2):132135. doi: 10.1016/j.chemosphere.2021.132135. Epub 2021 Sep 2.
3
Enhancing Intrusion Detection in Wireless Sensor Networks Using a GSWO-CatBoost Approach.使用GSWO-CatBoost方法增强无线传感器网络中的入侵检测
Sensors (Basel). 2024 May 23;24(11):3339. doi: 10.3390/s24113339.
4
An Improved CatBoost-Based Classification Model for Ecological Suitability of Blueberries.基于改进 CatBoost 的蓝莓生态适宜性分类模型。
Sensors (Basel). 2023 Feb 6;23(4):1811. doi: 10.3390/s23041811.
5
Machine Learning-Derived Prenatal Predictive Risk Model to Guide Intervention and Prevent the Progression of Gestational Diabetes Mellitus to Type 2 Diabetes: Prediction Model Development Study.机器学习衍生的产前预测风险模型,用于指导干预并预防妊娠期糖尿病进展为2型糖尿病:预测模型开发研究
JMIR Diabetes. 2022 Jul 5;7(3):e32366. doi: 10.2196/32366.
6
Research on an identification model for mine water inrush sources based on the HBA-CatBoost algorithm.基于HBA-CatBoost算法的矿井突水水源识别模型研究
Sci Rep. 2024 Oct 9;14(1):23508. doi: 10.1038/s41598-024-74417-1.
7
Prediction of Gestational Diabetes Mellitus under Cascade and Ensemble Learning Algorithm.基于级联和集成学习算法的妊娠期糖尿病预测。
Comput Intell Neurosci. 2022 Jul 14;2022:3212738. doi: 10.1155/2022/3212738. eCollection 2022.
8
Diurnal Pain Classification in Critically Ill Patients using Machine Learning on Accelerometry and Analgesic Data.利用加速度计和镇痛数据的机器学习对危重症患者进行日间疼痛分类
IEEE Int Conf Bioinform Biomed Workshops. 2023 Dec;2023:2207-2212. doi: 10.1109/bibm58861.2023.10385764. Epub 2024 Jan 18.
9
Predicting Fetal Alcohol Spectrum Disorders Using Machine Learning Techniques: Multisite Retrospective Cohort Study.使用机器学习技术预测胎儿酒精谱系障碍:多地点回顾性队列研究。
J Med Internet Res. 2023 Jul 18;25:e45041. doi: 10.2196/45041.
10
Research on Wind Turbine Fault Detection Based on the Fusion of ASL-CatBoost and TtRSA.基于ASL-CatBoost与TtRSA融合的风力发电机组故障检测研究
Sensors (Basel). 2023 Jul 28;23(15):6741. doi: 10.3390/s23156741.

引用本文的文献

1
Multi-Condition Classification of Oil Spill in Ice Areas Based on Laser-Induced Fluorescence.基于激光诱导荧光的冰区溢油多条件分类
J Fluoresc. 2025 Sep 16. doi: 10.1007/s10895-025-04547-w.
2
Artificial Intelligence in Ocular Transcriptomics: Applications of Unsupervised and Supervised Learning.眼科转录组学中的人工智能:无监督学习和监督学习的应用
Cells. 2025 Aug 26;14(17):1315. doi: 10.3390/cells14171315.
3
An interpretable LightGBM model for predicting coronary heart disease: Enhancing clinical decision-making with machine learning.

本文引用的文献

1
Machine learning identifies the dynamics and influencing factors in an auditory category learning experiment.机器学习识别听觉类别学习实验中的动态和影响因素。
Sci Rep. 2020 Apr 16;10(1):6548. doi: 10.1038/s41598-020-61703-x.
2
Prediction Model of Aryl Hydrocarbon Receptor Activation by a Novel QSAR Approach, DeepSnap-Deep Learning.新型 QSAR 方法 DeepSnap-Deep Learning 预测芳香烃受体激活
Molecules. 2020 Mar 13;25(6):1317. doi: 10.3390/molecules25061317.
3
A Novel Fracture Prediction Model Using Machine Learning in a Community-Based Cohort.
一种用于预测冠心病的可解释性LightGBM模型:利用机器学习增强临床决策
PLoS One. 2025 Sep 12;20(9):e0330377. doi: 10.1371/journal.pone.0330377. eCollection 2025.
4
Interpretable Machine Learning for Predicting Neoadjuvant Chemotherapy Response in Breast Cancer Using the Baseline Clinical and Pathological Characteristics.利用基线临床和病理特征进行可解释的机器学习以预测乳腺癌新辅助化疗反应
Cancer Med. 2025 Sep;14(17):e71221. doi: 10.1002/cam4.71221.
5
Identification of sepsis biomarkers through glutamine metabolism-mediated immune regulation: a comprehensive analysis employing mendelian randomization, multi-omics integration, and machine learning.通过谷氨酰胺代谢介导的免疫调节鉴定脓毒症生物标志物:一项采用孟德尔随机化、多组学整合和机器学习的综合分析
Front Immunol. 2025 Aug 20;16:1640425. doi: 10.3389/fimmu.2025.1640425. eCollection 2025.
6
Improving attachment style clustering with ROCKET and CatBoost: Insights from EEG analysis.利用ROCKET和CatBoost改进依恋风格聚类:脑电图分析的见解
PLoS One. 2025 Sep 2;20(9):e0331112. doi: 10.1371/journal.pone.0331112. eCollection 2025.
7
Machine learning techniques in hepatic encephalopathy: a scoping review.肝性脑病中的机器学习技术:一项范围综述
BMC Med Inform Decis Mak. 2025 Sep 1;25(1):323. doi: 10.1186/s12911-025-03168-4.
8
Decoding the adolescent non-suicidal self-injury: understanding with interpretable machine learning insights.解码青少年非自杀性自伤行为:借助可解释的机器学习见解进行理解
BMC Public Health. 2025 Sep 1;25(1):2994. doi: 10.1186/s12889-025-24354-z.
9
Prediction of QTc Prolongation in Acute Poisoning with Atypical Antipsychotics Using Machine Learning Techniques: A Study from Poison Control Center.使用机器学习技术预测非典型抗精神病药物急性中毒时的QTc间期延长:来自中毒控制中心的一项研究
Cardiovasc Toxicol. 2025 Aug 30. doi: 10.1007/s12012-025-10055-x.
10
Leveraging advanced ensemble learning techniques for methane uptake prediction in metal organic frameworks.利用先进的集成学习技术预测金属有机框架中的甲烷吸收量。
Sci Rep. 2025 Aug 29;15(1):31832. doi: 10.1038/s41598-025-17028-8.
一种基于社区队列使用机器学习的新型骨折预测模型。
JBMR Plus. 2020 Feb 10;4(3):e10337. doi: 10.1002/jbm4.10337. eCollection 2020 Mar.
4
Performance Analysis of Boosting Classifiers in Recognizing Activities of Daily Living.提升分类器在日常生活活动识别中的性能分析。
Int J Environ Res Public Health. 2020 Feb 8;17(3):1082. doi: 10.3390/ijerph17031082.
5
Construction and Analysis of Molecular Association Network by Combining Behavior Representation and Node Attributes.结合行为表征与节点属性构建和分析分子关联网络
Front Genet. 2019 Nov 7;10:1106. doi: 10.3389/fgene.2019.01106. eCollection 2019.
6
Reconstructing commuters network using machine learning and urban indicators.利用机器学习和城市指标重建通勤者网络。
Sci Rep. 2019 Aug 13;9(1):11801. doi: 10.1038/s41598-019-48295-x.
7
The Use of Data Mining Methods for the Prediction of Dementia: Evidence From the English Longitudinal Study of Aging.数据挖掘方法在预测痴呆中的应用:来自英国老龄化纵向研究的证据。
IEEE J Biomed Health Inform. 2020 Feb;24(2):345-353. doi: 10.1109/JBHI.2019.2921418. Epub 2019 Jun 6.
8
CT-based machine learning model to predict the Fuhrman nuclear grade of clear cell renal cell carcinoma.基于 CT 的机器学习模型预测透明细胞肾细胞癌的 Fuhrman 核分级。
Abdom Radiol (NY). 2019 Jul;44(7):2528-2534. doi: 10.1007/s00261-019-01992-7.
9
Roles, Functions, and Mechanisms of Long Non-coding RNAs in Cancer.长链非编码RNA在癌症中的作用、功能及机制
Genomics Proteomics Bioinformatics. 2016 Feb;14(1):42-54. doi: 10.1016/j.gpb.2015.09.006. Epub 2016 Feb 12.
10
Learning Nonlinear Functions Using Regularized Greedy Forest.使用正则化贪心森林学习非线性函数。
IEEE Trans Pattern Anal Mach Intell. 2014 May;36(5):942-54. doi: 10.1109/TPAMI.2013.159.