Zhang Lujun, Wang Yanshan, Chen Jingwen, Chen Jun
Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC, United States.
Institute of Soil and Water Resources and Environmental Science, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou, China.
Front Genet. 2022 Jan 24;12:749573. doi: 10.3389/fgene.2021.749573. eCollection 2021.
Random forest is considered as one of the most successful machine learning algorithms, which has been widely used to construct microbiome-based predictive models. However, its use as a statistical testing method has not been explored. In this study, we propose "Random Forest Test" (RFtest), a global (community-level) test based on random forest for high-dimensional and phylogenetically structured microbiome data. RFtest is a permutation test using the generalization error of random forest as the test statistic. Our simulations demonstrate that RFtest has controlled type I error rates, that its power is superior to competing methods for phylogenetically clustered signals, and that it is robust to outliers and adaptive to interaction effects and non-linear associations. Finally, we apply RFtest to two real microbiome datasets to ascertain whether microbial communities are associated or not with the outcome variables.
随机森林被认为是最成功的机器学习算法之一,已被广泛用于构建基于微生物组的预测模型。然而,其作为一种统计检验方法尚未得到探索。在本研究中,我们提出了“随机森林检验”(RFtest),这是一种基于随机森林的用于高维且具有系统发育结构的微生物组数据的全局(群落水平)检验。RFtest是一种使用随机森林的泛化误差作为检验统计量的置换检验。我们的模拟表明,RFtest控制了I型错误率,其功效优于用于系统发育聚类信号的竞争方法,并且它对异常值具有鲁棒性,能适应交互效应和非线性关联。最后,我们将RFtest应用于两个真实的微生物组数据集,以确定微生物群落与结果变量是否相关。