IEEE/ACM Trans Comput Biol Bioinform. 2022 Sep-Oct;19(5):2817-2828. doi: 10.1109/TCBB.2021.3089417. Epub 2022 Oct 10.
Ensemble methods such as random forest works well on high-dimensional datasets. However, when the number of features is extremely large compared to the number of samples and the percentage of truly informative feature is very small, performance of traditional random forest decline significantly. To this end, we develop a novel approach that enhance the performance of traditional random forest by reducing the contribution of trees whose nodes are populated with less informative features. The proposed method selects eligible subsets at each node by weighted random sampling as opposed to simple random sampling in traditional random forest. We refer to this modified random forest algorithm as "Enriched Random Forest". Using several high-dimensional micro-array datasets, we evaluate the performance of our approach in both regression and classification settings. In addition, we also demonstrate the effectiveness of balanced leave-one-out cross-validation to reduce computational load and decrease sample size while computing feature weights. Overall, the results indicate that enriched random forest improves the prediction accuracy of traditional random forest, especially when relevant features are very few.
集成方法,如随机森林,在高维数据集上表现良好。然而,当特征数量与样本数量相比非常大,并且真正有信息量的特征的百分比非常小时,传统随机森林的性能会显著下降。为此,我们开发了一种新方法,通过减少节点中填充信息量较少特征的树的贡献来增强传统随机森林的性能。所提出的方法通过加权随机抽样而不是传统随机森林中的简单随机抽样在每个节点选择合格的子集。我们将这种修改后的随机森林算法称为“富集随机森林”。使用几个高维微阵列数据集,我们在回归和分类设置中评估了我们方法的性能。此外,我们还演示了平衡留一交叉验证的有效性,以减少计算特征权重时的计算负载和样本量。总的来说,结果表明,富集随机森林提高了传统随机森林的预测准确性,尤其是在相关特征非常少的情况下。