Department of Pharmacology and Toxicology, School of Nutrition, Toxicology and Translational Research in Metabolism (NUTRIM), Maastricht University Medical Center+, Maastricht, the Netherlands.
Laboratoire de Spectrochimie Infrarouge et Raman - LASIR CNRS - UMR 8516, Université de Lille, Bâtiment C5, F-59000, Lille, France; Molecular Imaging and Photonics Unit, Department of Chemistry, Katholieke Universiteit Leuven, Celestijnenlaan 200F, B-3001, Leuven, Belgium.
Anal Chim Acta. 2020 Sep 22;1131:146-155. doi: 10.1016/j.aca.2020.06.043. Epub 2020 Jul 11.
Current technological developments have allowed for a significant increase and availability of data. Consequently, this has opened enormous opportunities for the machine learning and data science field, translating into the development of new algorithms in a wide range of applications in medical, biomedical, daily-life, and national security areas. Ensemble techniques are among the pillars of the machine learning field, and they can be defined as approaches in which multiple, complex, independent/uncorrelated, predictive models are subsequently combined by either averaging or voting to yield a higher model performance. Random forest (RF), a popular ensemble method, has been successfully applied in various domains due to its ability to build predictive models with high certainty and little necessity of model optimization. RF provides both a predictive model and an estimation of the variable importance. However, the estimation of the variable importance is based on thousands of trees, and therefore, it does not specify which variable is important for which sample group. The present study demonstrates an approach based on the pseudo-sample principle that allows for construction of bi-plots (i.e. spin plots) associated with RF models. The pseudo-sample principle for RF. is explained and demonstrated by using two simulated datasets, and three different types of real data, which include political sciences, food chemistry and the human microbiome data. The pseudo-sample bi-plots, associated with RF and its unsupervised version, allow for a versatile visualization of multivariate models, and the variable importance and the relation among them.
当前的技术发展使得数据的数量和可用性大大增加。因此,这为机器学习和数据科学领域带来了巨大的机遇,促成了在医学、生物医学、日常生活和国家安全等广泛应用领域中新算法的发展。集成技术是机器学习领域的支柱之一,它们可以定义为通过平均或投票等方式将多个复杂、独立/不相关的预测模型组合在一起,从而提高模型性能的方法。随机森林(RF)是一种流行的集成方法,由于其能够构建具有高确定性和较少模型优化需求的预测模型,因此已成功应用于各个领域。RF 提供了预测模型和变量重要性的估计。然而,变量重要性的估计是基于数千棵树的,因此,它无法指定哪个变量对于哪个样本组是重要的。本研究展示了一种基于伪样本原理的方法,该方法允许构建与 RF 模型相关的双图(即旋转图)。通过使用两个模拟数据集和三个不同类型的真实数据(包括政治学、食品化学和人类微生物组数据),解释并演示了 RF 的伪样本原理。与 RF 及其无监督版本相关的伪样本双图允许对多变量模型、变量重要性及其之间的关系进行多功能可视化。