新的静态数据流决策树分裂准则。

New Splitting Criteria for Decision Trees in Stationary Data Streams.

出版信息

IEEE Trans Neural Netw Learn Syst. 2018 Jun;29(6):2516-2529. doi: 10.1109/TNNLS.2017.2698204. Epub 2017 May 10.

DOI:10.1109/TNNLS.2017.2698204

Abstract

The most popular tools for stream data mining are based on decision trees. In previous 15 years, all designed methods, headed by the very fast decision tree algorithm, relayed on Hoeffding's inequality and hundreds of researchers followed this scheme. Recently, we have demonstrated that although the Hoeffding decision trees are an effective tool for dealing with stream data, they are a purely heuristic procedure; for example, classical decision trees such as ID3 or CART cannot be adopted to data stream mining using Hoeffding's inequality. Therefore, there is an urgent need to develop new algorithms, which are both mathematically justified and characterized by good performance. In this paper, we address this problem by developing a family of new splitting criteria for classification in stationary data streams and investigating their probabilistic properties. The new criteria, derived using appropriate statistical tools, are based on the misclassification error and the Gini index impurity measures. The general division of splitting criteria into two types is proposed. Attributes chosen based on type- splitting criteria guarantee, with high probability, the highest expected value of split measure. Type- criteria ensure that the chosen attribute is the same, with high probability, as it would be chosen based on the whole infinite data stream. Moreover, in this paper, two hybrid splitting criteria are proposed, which are the combinations of single criteria based on the misclassification error and Gini index.

摘要

用于流数据挖掘的最流行工具是基于决策树的。在过去的 15 年中，所有设计的方法，都以非常快速的决策树算法为指导，都依赖于 Hoeffding 不等式，并且数以百计的研究人员都遵循这一方案。最近，我们已经证明，尽管 Hoeffding 决策树是处理流数据的有效工具，但它们只是一种纯粹的启发式程序；例如，不能采用 ID3 或 CART 等经典决策树来对流数据使用 Hoeffding 不等式进行挖掘。因此，迫切需要开发新的算法，这些算法既具有数学合理性，又具有良好的性能。在本文中，我们通过开发用于分类的新分裂标准族来解决此问题，并研究了它们的概率性质。新的标准是使用适当的统计工具得出的，基于错误分类误差和基尼指数杂质度量。提出了将分裂标准分为两种类型的一般划分。基于类型的分裂标准选择的属性以很高的概率保证了分割度量的最高期望值。类型标准确保以很高的概率选择相同的属性，就像基于整个无限数据流选择属性一样。此外，在本文中，提出了两种混合分裂标准，它们是基于错误分类误差和基尼指数的单个标准的组合。

相似文献

New Splitting Criteria for Decision Trees in Stationary Data Streams.新的静态数据流决策树分裂准则。

IEEE Trans Neural Netw Learn Syst. 2018 Jun;29(6):2516-2529. doi: 10.1109/TNNLS.2017.2698204. Epub 2017 May 10.

A new method for data stream mining based on the misclassification error.基于误分类错误的数据挖掘新方法。

IEEE Trans Neural Netw Learn Syst. 2015 May;26(5):1048-59. doi: 10.1109/TNNLS.2014.2333557. Epub 2014 Jul 16.

Regularized impurity reduction: accurate decision trees with complexity guarantees.正则化杂质减少：具有复杂度保证的精确决策树

Data Min Knowl Discov. 2023;37(1):434-475. doi: 10.1007/s10618-022-00884-7. Epub 2022 Nov 28.

Comparing Pearson, Spearman and Hoeffding's D measure for gene expression association analysis.比较用于基因表达关联分析的皮尔逊、斯皮尔曼和霍夫丁D度量。

J Bioinform Comput Biol. 2009 Aug;7(4):663-84. doi: 10.1142/s0219720009004230.

Classifiability-based omnivariate decision trees.基于可分类性的多变量决策树

IEEE Trans Neural Netw. 2005 Nov;16(6):1547-60. doi: 10.1109/TNN.2005.852864.

Splitting Choice and Computational Complexity Analysis of Decision Trees.决策树的分裂选择与计算复杂性分析

Entropy (Basel). 2021 Sep 24;23(10):1241. doi: 10.3390/e23101241.

Segment Based Decision Tree Induction With Continuous Valued Attributes.基于分段的连续值属性决策树归纳。

IEEE Trans Cybern. 2015 Jul;45(7):1262-75. doi: 10.1109/TCYB.2014.2348012. Epub 2014 Sep 29.

ROSE: decision trees, automatic learning and their applications in cardiac medicine.罗斯：决策树、自动学习及其在心脏医学中的应用。

Medinfo. 1995;8 Pt 2:1688.

Hoeffding's inequality for general Markov chains with its applications to statistical learning.一般马尔可夫链的霍夫丁不等式及其在统计学习中的应用。

J Mach Learn Res. 2021 Aug;22.

A Two-Parameter Fractional Tsallis Decision Tree.一种双参数分数阶Tsallis决策树。

Entropy (Basel). 2022 Apr 19;24(5):572. doi: 10.3390/e24050572.

引用本文的文献

Quantitative Ultrasound-Based Precision Diagnosis of Papillary, Follicular, and Medullary Thyroid Carcinomas Using Morphological, Structural, and Textural Features.基于定量超声的甲状腺乳头状癌、滤泡状癌和髓样癌的精确诊断：利用形态学、结构和纹理特征

Cancers (Basel). 2025 Aug 24;17(17):2761. doi: 10.3390/cancers17172761.

Analysis of anterior segment in primary angle closure suspect with deep learning models.基于深度学习模型的原发性闭角型青光眼疑似患者前节分析。

BMC Med Inform Decis Mak. 2024 Sep 9;24(1):251. doi: 10.1186/s12911-024-02658-1.

Unraveling the Role of Hydrogen Bonds in Thrombin via Two Machine Learning Methods.通过两种机器学习方法揭示凝血酶中氢键的作用。

J Chem Inf Model. 2023 Jun 26;63(12):3705-3718. doi: 10.1021/acs.jcim.3c00153. Epub 2023 Jun 7.

Risk prediction of cardiovascular disease using machine learning classifiers.使用机器学习分类器预测心血管疾病风险

Open Med (Wars). 2022 Jun 17;17(1):1100-1113. doi: 10.1515/med-2022-0508. eCollection 2022.

Complement as Prognostic Biomarker and Potential Therapeutic Target in Renal Cell Carcinoma.补体作为肾细胞癌的预后生物标志物和潜在治疗靶点。

J Immunol. 2020 Dec 1;205(11):3218-3229. doi: 10.4049/jimmunol.2000511. Epub 2020 Nov 6.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

新的静态数据流决策树分裂准则。

New Splitting Criteria for Decision Trees in Stationary Data Streams.

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献