Department of Biology, Tufts University, 200 College Avenue, Medford, MA, 02155, USA.
Oecologia. 2021 May;196(1):13-25. doi: 10.1007/s00442-020-04848-w. Epub 2021 Feb 12.
Ecologists often collect data with the aim of determining which of many variables are associated with a particular cause or consequence. Unsupervised analyses (e.g. principal components analysis, PCA) summarize variation in the data, without regard to the response. Supervised analyses (e.g., partial least squares, PLS) evaluate the variables to find the combination that best explain a causal relationship. These approaches are not interchangeable, especially when the variables most responsible for a causal relationship are not the greatest source of overall variation in the data-a situation that ecologists are likely to encounter. To illustrate the differences between unsupervised and supervised techniques, we analyze a published dataset using both PCA and PLS and compare the questions and answers associated with each method. We also use simulated datasets representing situations that further illustrate differences between unsupervised and supervised analyses. For simulated data with many correlated variables that were unrelated to the response, PLS was better than PCA at identifying which variables were associated with the response. There are many applications for both unsupervised and supervised approaches in ecology. However, PCA is currently overused, at least in part because supervised approaches, such as PLS, are less familiar.
生态学家通常会收集数据,目的是确定众多变量中哪些与特定的原因或结果有关。无监督分析(例如主成分分析,PCA)总结数据中的变化,而不考虑响应。有监督分析(例如偏最小二乘,PLS)评估变量,以找到最佳解释因果关系的组合。这些方法不能互换使用,特别是当与因果关系最相关的变量不是数据中总体变化的最大来源时——这种情况生态学家很可能会遇到。为了说明无监督和有监督技术之间的差异,我们使用 PCA 和 PLS 对已发表的数据集进行分析,并比较与每种方法相关的问题和答案。我们还使用模拟数据集表示进一步说明无监督和有监督分析之间差异的情况。对于具有许多与响应无关的相关变量的模拟数据,PLS 比 PCA 更能识别与响应相关的变量。无监督和有监督方法在生态学中有许多应用。但是,PCA 目前被过度使用,至少部分原因是监督方法(如 PLS)不太为人所知。