Suppr超能文献

新的静态数据流决策树分裂准则。

New Splitting Criteria for Decision Trees in Stationary Data Streams.

出版信息

IEEE Trans Neural Netw Learn Syst. 2018 Jun;29(6):2516-2529. doi: 10.1109/TNNLS.2017.2698204. Epub 2017 May 10.

Abstract

The most popular tools for stream data mining are based on decision trees. In previous 15 years, all designed methods, headed by the very fast decision tree algorithm, relayed on Hoeffding's inequality and hundreds of researchers followed this scheme. Recently, we have demonstrated that although the Hoeffding decision trees are an effective tool for dealing with stream data, they are a purely heuristic procedure; for example, classical decision trees such as ID3 or CART cannot be adopted to data stream mining using Hoeffding's inequality. Therefore, there is an urgent need to develop new algorithms, which are both mathematically justified and characterized by good performance. In this paper, we address this problem by developing a family of new splitting criteria for classification in stationary data streams and investigating their probabilistic properties. The new criteria, derived using appropriate statistical tools, are based on the misclassification error and the Gini index impurity measures. The general division of splitting criteria into two types is proposed. Attributes chosen based on type- splitting criteria guarantee, with high probability, the highest expected value of split measure. Type- criteria ensure that the chosen attribute is the same, with high probability, as it would be chosen based on the whole infinite data stream. Moreover, in this paper, two hybrid splitting criteria are proposed, which are the combinations of single criteria based on the misclassification error and Gini index.

摘要

用于流数据挖掘的最流行工具是基于决策树的。在过去的 15 年中,所有设计的方法,都以非常快速的决策树算法为指导,都依赖于 Hoeffding 不等式,并且数以百计的研究人员都遵循这一方案。最近,我们已经证明,尽管 Hoeffding 决策树是处理流数据的有效工具,但它们只是一种纯粹的启发式程序;例如,不能采用 ID3 或 CART 等经典决策树来对流数据使用 Hoeffding 不等式进行挖掘。因此,迫切需要开发新的算法,这些算法既具有数学合理性,又具有良好的性能。在本文中,我们通过开发用于分类的新分裂标准族来解决此问题,并研究了它们的概率性质。新的标准是使用适当的统计工具得出的,基于错误分类误差和基尼指数杂质度量。提出了将分裂标准分为两种类型的一般划分。基于类型的分裂标准选择的属性以很高的概率保证了分割度量的最高期望值。类型标准确保以很高的概率选择相同的属性,就像基于整个无限数据流选择属性一样。此外,在本文中,提出了两种混合分裂标准,它们是基于错误分类误差和基尼指数的单个标准的组合。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验