Department of Biology, Terrestrial Ecology Unit, Ghent University, Ghent, Belgium.
Plant Biol (Stuttg). 2012 Mar;14(2):271-7. doi: 10.1111/j.1438-8677.2011.00497.x. Epub 2011 Aug 23.
Selecting an appropriate variable subset in linear multivariate methods is an important methodological issue for ecologists. Interest often exists in obtaining general predictive capacity or in finding causal inferences from predictor variables. Because of a lack of solid knowledge on a studied phenomenon, scientists explore predictor variables in order to find the most meaningful (i.e. discriminating) ones. As an example, we modelled the response of the amphibious softwater plant Eleocharis multicaulis using canonical discriminant function analysis. We asked how variables can be selected through comparison of several methods: univariate Pearson chi-square screening, principal components analysis (PCA) and step-wise analysis, as well as combinations of some methods. We expected PCA to perform best. The selected methods were evaluated through fit and stability of the resulting discriminant functions and through correlations between these functions and the predictor variables. The chi-square subset, at P < 0.05, followed by a step-wise sub-selection, gave the best results. In contrast to expectations, PCA performed poorly, as so did step-wise analysis. The different chi-square subset methods all yielded ecologically meaningful variables, while probable noise variables were also selected by PCA and step-wise analysis. We advise against the simple use of PCA or step-wise discriminant analysis to obtain an ecologically meaningful variable subset; the former because it does not take into account the response variable, the latter because noise variables are likely to be selected. We suggest that univariate screening techniques are a worthwhile alternative for variable selection in ecology.
在线性多元方法中选择合适的变量子集是生态学家的一个重要方法问题。人们通常对获得一般预测能力或从预测变量中得出因果推论感兴趣。由于对所研究现象缺乏坚实的知识,科学家们探索预测变量以找到最有意义的(即区分性)变量。例如,我们使用典型判别函数分析来模拟两栖软水植物多刺薹草的响应。我们询问如何通过比较几种方法来选择变量:单变量 Pearson 卡方筛选、主成分分析 (PCA) 和逐步分析,以及一些方法的组合。我们预计 PCA 表现最佳。通过对判别函数的拟合和稳定性以及这些函数与预测变量之间的相关性来评估所选方法。在 P < 0.05 时,选择卡方子集,然后进行逐步子选择,可获得最佳结果。与预期相反,PCA 和逐步分析表现不佳。不同的卡方子集方法都产生了具有生态意义的变量,而 PCA 和逐步分析也选择了可能的噪声变量。我们建议不要简单地使用 PCA 或逐步判别分析来获得具有生态意义的变量子集;前者是因为它没有考虑到响应变量,后者是因为可能会选择噪声变量。我们建议单变量筛选技术是生态学中变量选择的一种有价值的替代方法。