Suppr超能文献

构建随机森林的双图:教程。

Constructing bi-plots for random forest: Tutorial.

机构信息

Department of Pharmacology and Toxicology, School of Nutrition, Toxicology and Translational Research in Metabolism (NUTRIM), Maastricht University Medical Center+, Maastricht, the Netherlands.

Laboratoire de Spectrochimie Infrarouge et Raman - LASIR CNRS - UMR 8516, Université de Lille, Bâtiment C5, F-59000, Lille, France; Molecular Imaging and Photonics Unit, Department of Chemistry, Katholieke Universiteit Leuven, Celestijnenlaan 200F, B-3001, Leuven, Belgium.

出版信息

Anal Chim Acta. 2020 Sep 22;1131:146-155. doi: 10.1016/j.aca.2020.06.043. Epub 2020 Jul 11.

Abstract

Current technological developments have allowed for a significant increase and availability of data. Consequently, this has opened enormous opportunities for the machine learning and data science field, translating into the development of new algorithms in a wide range of applications in medical, biomedical, daily-life, and national security areas. Ensemble techniques are among the pillars of the machine learning field, and they can be defined as approaches in which multiple, complex, independent/uncorrelated, predictive models are subsequently combined by either averaging or voting to yield a higher model performance. Random forest (RF), a popular ensemble method, has been successfully applied in various domains due to its ability to build predictive models with high certainty and little necessity of model optimization. RF provides both a predictive model and an estimation of the variable importance. However, the estimation of the variable importance is based on thousands of trees, and therefore, it does not specify which variable is important for which sample group. The present study demonstrates an approach based on the pseudo-sample principle that allows for construction of bi-plots (i.e. spin plots) associated with RF models. The pseudo-sample principle for RF. is explained and demonstrated by using two simulated datasets, and three different types of real data, which include political sciences, food chemistry and the human microbiome data. The pseudo-sample bi-plots, associated with RF and its unsupervised version, allow for a versatile visualization of multivariate models, and the variable importance and the relation among them.

摘要

当前的技术发展使得数据的数量和可用性大大增加。因此,这为机器学习和数据科学领域带来了巨大的机遇,促成了在医学、生物医学、日常生活和国家安全等广泛应用领域中新算法的发展。集成技术是机器学习领域的支柱之一,它们可以定义为通过平均或投票等方式将多个复杂、独立/不相关的预测模型组合在一起,从而提高模型性能的方法。随机森林(RF)是一种流行的集成方法,由于其能够构建具有高确定性和较少模型优化需求的预测模型,因此已成功应用于各个领域。RF 提供了预测模型和变量重要性的估计。然而,变量重要性的估计是基于数千棵树的,因此,它无法指定哪个变量对于哪个样本组是重要的。本研究展示了一种基于伪样本原理的方法,该方法允许构建与 RF 模型相关的双图(即旋转图)。通过使用两个模拟数据集和三个不同类型的真实数据(包括政治学、食品化学和人类微生物组数据),解释并演示了 RF 的伪样本原理。与 RF 及其无监督版本相关的伪样本双图允许对多变量模型、变量重要性及其之间的关系进行多功能可视化。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验