Czajkowski Marcin, Grześ Marek, Kretowski Marek
Faculty of Computer Science, Bialystok University of Technology, Wiejska 45a, 15-351 Bialystok, Poland.
School of Computer Science, University of Waterloo, 200 University Avenue West, Waterloo, Ontario N2L 3G1, Canada.
Artif Intell Med. 2014 May;61(1):35-44. doi: 10.1016/j.artmed.2014.01.005. Epub 2014 Feb 10.
The desirable property of tools used to investigate biological data is easy to understand models and predictive decisions. Decision trees are particularly promising in this regard due to their comprehensible nature that resembles the hierarchical process of human decision making. However, existing algorithms for learning decision trees have tendency to underfit gene expression data. The main aim of this work is to improve the performance and stability of decision trees with only a small increase in their complexity.
We propose a multi-test decision tree (MTDT); our main contribution is the application of several univariate tests in each non-terminal node of the decision tree. We also search for alternative, lower-ranked features in order to obtain more stable and reliable predictions.
Experimental validation was performed on several real-life gene expression datasets. Comparison results with eight classifiers show that MTDT has a statistically significantly higher accuracy than popular decision tree classifiers, and it was highly competitive with ensemble learning algorithms. The proposed solution managed to outperform its baseline algorithm on 14 datasets by an average 6%. A study performed on one of the datasets showed that the discovered genes used in the MTDT classification model are supported by biological evidence in the literature.
This paper introduces a new type of decision tree which is more suitable for solving biological problems. MTDTs are relatively easy to analyze and much more powerful in modeling high dimensional microarray data than their popular counterparts.
用于研究生物数据的工具的理想特性是易于理解的模型和预测决策。决策树在这方面特别有前景,因为其可理解的性质类似于人类决策的分层过程。然而,现有的决策树学习算法倾向于对基因表达数据拟合不足。这项工作的主要目的是在仅略微增加决策树复杂度的情况下提高其性能和稳定性。
我们提出了一种多测试决策树(MTDT);我们的主要贡献是在决策树的每个非终端节点应用多个单变量测试。我们还搜索替代的、排名较低的特征,以获得更稳定和可靠的预测。
在几个实际的基因表达数据集上进行了实验验证。与八个分类器的比较结果表明,MTDT在统计上具有比流行的决策树分类器显著更高的准确率,并且与集成学习算法具有高度竞争力。所提出的解决方案在14个数据集上比其基线算法平均高出6%。对其中一个数据集进行的一项研究表明,MTDT分类模型中发现的基因得到了文献中的生物学证据的支持。
本文介绍了一种更适合解决生物学问题的新型决策树。MTDT相对易于分析,并且在对高维微阵列数据建模方面比其流行的同类方法更强大。