Lipsitz Stuart, Fitzmaurice Garrett, Sinha Debajyoti, Hevelone Nathanael, Hu Jim, Nguyen Louis L
Brigham & Women's Hospital, Boston, MA.
Harvard Medical School, Boston, MA.
J Comput Graph Stat. 2017;26(3):734-737. doi: 10.1080/10618600.2017.1321552. Epub 2017 Jul 27.
Medical studies increasingly involve a large sample of independent clusters, where the cluster sizes are also large. Our motivating example from the 2010 Nationwide Inpatient Sample (NIS) has 8,001,068 patients and 1049 clusters, with average cluster size of 7627. Consistent parameter estimates can be obtained naively assuming independence, which are inefficient when the intra-cluster correlation (ICC) is high. Efficient generalized estimating equations (GEE) incorporate the ICC and sum all pairs of observations within a cluster when estimating the ICC. For the 2010 NIS, there are 92.6 billion pairs of observations, making summation of pairs computationally prohibitive. We propose a one-step GEE estimator that 1) matches the asymptotic efficiency of the fully-iterated GEE; 2) uses a simpler formula to estimate the ICC that avoids summing over all pairs; and 3) completely avoids matrix multiplications and inversions. These three features make the proposed estimator much less computationally intensive, especially with large cluster sizes. A unique contribution of this paper is that it expresses the GEE estimating equations incorporating the ICC as a simple sum of vectors and scalars.
医学研究越来越多地涉及大量独立聚类的样本,而且聚类规模也很大。我们以2010年全国住院患者样本(NIS)为例,该样本包含8,001,068名患者和1049个聚类,平均聚类规模为7627。如果天真地假设独立性,就可以得到一致的参数估计值,但当聚类内相关性(ICC)较高时,这些估计值效率低下。高效的广义估计方程(GEE)纳入了ICC,并在估计ICC时对聚类内的所有观测值对进行求和。对于2010年的NIS,有926亿对观测值,对观测值对进行求和在计算上令人望而却步。我们提出了一种一步GEE估计量,它:1)与完全迭代GEE的渐近效率相匹配;2)使用一个更简单的公式来估计ICC,避免对所有观测值对进行求和;3)完全避免矩阵乘法和求逆运算。这三个特性使得所提出的估计量在计算上的强度大大降低,尤其是在聚类规模较大时。本文的一个独特贡献在于,它将纳入ICC的GEE估计方程表示为向量和标量的简单求和。