Division of Thoracic Surgery, Chang Gung Memorial Hospital at Linkou, Taoyuan, Taiwan.
Department of Information Management, National Central University, Taoyuan, Taiwan.
Technol Health Care. 2024;32(1):75-87. doi: 10.3233/THC-220514.
In practice, the collected datasets for data analysis are usually incomplete as some data contain missing attribute values. Many related works focus on constructing specific models to produce estimations to replace the missing values, to make the original incomplete datasets become complete. Another type of solution is to directly handle the incomplete datasets without missing value imputation, with decision trees being the major technique for this purpose.
To introduce a novel approach, namely Deep Learning-based Decision Tree Ensembles (DLDTE), which borrows the bounding box and sliding window strategies used in deep learning techniques to divide an incomplete dataset into a number of subsets and learning from each subset by a decision tree, resulting in decision tree ensembles.
Two medical domain problem datasets contain several hundred feature dimensions with the missing rates of 10% to 50% are used for performance comparison.
The proposed DLDTE provides the highest rate of classification accuracy when compared with the baseline decision tree method, as well as two missing value imputation methods (mean and k-nearest neighbor), and the case deletion method.
The results demonstrate the effectiveness of DLDTE for handling incomplete medical datasets with different missing rates.
在实际应用中,用于数据分析的数据集通常是不完整的,因为有些数据包含缺失的属性值。许多相关的工作都集中在构建特定的模型来进行估计以替换缺失值,从而使原始不完整的数据集变得完整。另一种解决方案是直接处理不完整的数据集,而不进行缺失值插补,决策树是为此目的的主要技术。
引入一种新的方法,即基于深度学习的决策树集成(DLDTE),它借鉴了深度学习技术中使用的边界框和滑动窗口策略,将不完整的数据集划分为多个子集,并通过决策树从每个子集进行学习,从而形成决策树集成。
两个包含数百个特征维度的医学领域问题数据集,缺失率为 10%至 50%,用于性能比较。
与基线决策树方法以及两种缺失值插补方法(均值和 k-最近邻)和案例删除方法相比,所提出的 DLDTE 提供了最高的分类准确率。
结果表明 DLDTE 对于处理具有不同缺失率的不完整医学数据集是有效的。