Song Yan-Yan, Lu Ying
Department of Pharmacology and Biostatistics, Institute of Medical Sciences, Shanghai Jiao Tong University School of Medicine, Shanghai, China ; Department of Pharmacology and Biostatistics, Institute of Medical Sciences, Shanghai Jiao Tong University School of Medicine, Shanghai, China.
Shanghai Arch Psychiatry. 2015 Apr 25;27(2):130-5. doi: 10.11919/j.issn.1002-0829.215044.
Decision tree methodology is a commonly used data mining method for establishing classification systems based on multiple covariates or for developing prediction algorithms for a target variable. This method classifies a population into branch-like segments that construct an inverted tree with a root node, internal nodes, and leaf nodes. The algorithm is non-parametric and can efficiently deal with large, complicated datasets without imposing a complicated parametric structure. When the sample size is large enough, study data can be divided into training and validation datasets. Using the training dataset to build a decision tree model and a validation dataset to decide on the appropriate tree size needed to achieve the optimal final model. This paper introduces frequently used algorithms used to develop decision trees (including CART, C4.5, CHAID, and QUEST) and describes the SPSS and SAS programs that can be used to visualize tree structure.
决策树方法是一种常用的数据挖掘方法,用于基于多个协变量建立分类系统或为目标变量开发预测算法。该方法将总体分类为分支状的部分,这些部分构成一棵具有根节点、内部节点和叶节点的倒树。该算法是非参数的,能够有效地处理大型复杂数据集,而无需强加复杂的参数结构。当样本量足够大时,研究数据可分为训练数据集和验证数据集。使用训练数据集构建决策树模型,并使用验证数据集来确定实现最优最终模型所需的合适树大小。本文介绍了用于开发决策树的常用算法(包括CART、C4.5、CHAID和QUEST),并描述了可用于可视化树结构的SPSS和SAS程序。