解释变量分布对分类树学习中使用的杂质度量行为的影响。

Influence of Explanatory Variable Distributions on the Behavior of the Impurity Measures Used in Classification Tree Learning.

作者信息

Gajowniczek Krzysztof, Dudziński Marcin

机构信息

Institute of Information Technology, Warsaw University of Life Sciences-SGGW, 02-787 Warszawa, Poland.

出版信息

Entropy (Basel). 2024 Nov 26;26(12):1020. doi: 10.3390/e26121020.

DOI:10.3390/e26121020

PMID:39766650

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11727596/

Abstract

The primary objective of our study is to analyze how the nature of explanatory variables influences the values and behavior of impurity measures, including the Shannon, Rényi, Tsallis, Sharma-Mittal, Sharma-Taneja, and Kapur entropies. Our analysis aims to use these measures in the interactive learning of decision trees, particularly in the tie-breaking situations where an expert needs to make a decision. We simulate the values of explanatory variables from various probability distributions in order to consider a wide range of variability and properties. These probability distributions include the normal, Cauchy, uniform, exponential, and two beta distributions. This research assumes that the values of the binary responses are generated from the logistic regression model. All of the six mentioned probability distributions of the explanatory variables are presented in the same graphical format. The first two graphs depict histograms of the explanatory variables values and their corresponding probabilities generated by a particular model. The remaining graphs present distinct impurity measures with different parameters. In order to examine and discuss the behavior of the obtained results, we conduct a sensitivity analysis of the algorithms with regard to the entropy parameter values. We also demonstrate how certain explanatory variables affect the process of interactive tree learning.

摘要

我们研究的主要目标是分析解释变量的性质如何影响杂质度量的值和行为，这些杂质度量包括香农熵、雷尼熵、Tsallis熵、夏尔马-米塔尔熵、夏尔马-塔内贾熵和卡普尔熵。我们的分析旨在将这些度量用于决策树的交互式学习，特别是在专家需要做出决策的平局情况中。我们从各种概率分布中模拟解释变量的值，以便考虑广泛的变异性和属性。这些概率分布包括正态分布、柯西分布、均匀分布、指数分布和两种贝塔分布。本研究假设二元响应的值由逻辑回归模型生成。解释变量的所有六种上述概率分布均以相同的图形格式呈现。前两张图描绘了解释变量值的直方图及其由特定模型生成的相应概率。其余的图展示了具有不同参数的不同杂质度量。为了检查和讨论所得结果的行为，我们针对熵参数值对算法进行了敏感性分析。我们还展示了某些解释变量如何影响交互式树学习的过程。