Warton David I
School of Mathematics and Statistics and Evolution and Ecology Research Centre, The University of New South Wales, NSW 2052, Australia.
Biometrics. 2011 Mar;67(1):116-23. doi: 10.1111/j.1541-0420.2010.01438.x.
A modification of generalized estimating equations (GEEs) methodology is proposed for hypothesis testing of high-dimensional data, with particular interest in multivariate abundance data in ecology, an important application of interest in thousands of environmental science studies. Such data are typically counts characterized by high dimensionality (in the sense that cluster size exceeds number of clusters, n>K) and over-dispersion relative to the Poisson distribution. Usual GEE methods cannot be applied in this setting primarily because sandwich estimators become numerically unstable as n increases. We propose instead using a regularized sandwich estimator that assumes a common correlation matrix R, and shrinks the sample estimate of R toward the working correlation matrix to improve its numerical stability. It is shown via theory and simulation that this substantially improves the power of Wald statistics when cluster size is not small. We apply the proposed approach to study the effects of nutrient addition on nematode communities, and in doing so discuss important issues in implementation, such as using statistics that have good properties when parameter estimates approach the boundary (), and using resampling to enable valid inference that is robust to high dimensionality and to possible model misspecification.
本文提出了一种广义估计方程(GEEs)方法的改进方法,用于高维数据的假设检验,特别关注生态学中的多元丰度数据,这是数千项环境科学研究中一个重要的应用领域。这类数据通常是计数数据,其特点是维度高(即聚类大小超过聚类数量,n>K),且相对于泊松分布存在过度离散。通常的GEE方法在这种情况下无法应用,主要原因是随着n的增加,三明治估计量在数值上变得不稳定。我们建议改用正则化三明治估计量,它假设一个共同的相关矩阵R,并将R的样本估计值向工作相关矩阵收缩,以提高其数值稳定性。通过理论和模拟表明,当聚类大小不小的时候,这能显著提高Wald统计量的功效。我们将所提出的方法应用于研究养分添加对线虫群落的影响,并在此过程中讨论实施中的重要问题,例如使用在参数估计接近边界时具有良好性质的统计量,以及使用重采样来进行有效的推断,使其对高维度和可能的模型误设具有鲁棒性。