Takkouche B, Cadarso-Suárez C, Spiegelman D
Department of Preventive Medicine, School of Medicine, University of Santiago de Compostela, Spain.
Am J Epidemiol. 1999 Jul 15;150(2):206-15. doi: 10.1093/oxfordjournals.aje.a009981.
The identification of heterogeneity in effects between studies is a key issue in meta-analyses of observational studies, since it is critical for determining whether it is appropriate to pool the individual results into one summary measure. The result of a hypothesis test is often used as the decision criterion. In this paper, the authors use a large simulation study patterned from the key features of five published epidemiologic meta-analyses to investigate the type I error and statistical power of five previously proposed asymptotic homogeneity tests, a parametric bootstrap version of each of the tests, and tau2-bootstrap, a test proposed by the authors. The results show that the asymptotic DerSimonian and Laird Q statistic and the bootstrap versions of the other tests give the correct type I error under the null hypothesis but that all of the tests considered have low statistical power, especially when the number of studies included in the meta-analysis is small (<20). From the point of view of validity, power, and computational ease, the Q statistic is clearly the best choice. The authors found that the performance of all of the tests considered did not depend appreciably upon the value of the pooled odds ratio, both for size and for power. Because tests for heterogeneity will often be underpowered, random effects models can be used routinely, and heterogeneity can be quantified by means of R(I), the proportion of the total variance of the pooled effect measure due to between-study variance, and CV(B), the between-study coefficient of variation.
研究间效应异质性的识别是观察性研究荟萃分析中的一个关键问题,因为这对于确定将各个结果汇总为一个综合指标是否合适至关重要。假设检验的结果常被用作决策标准。在本文中,作者进行了一项大型模拟研究,该研究仿照五项已发表的流行病学荟萃分析的关键特征,以调查五个先前提出的渐近齐性检验、每个检验的参数自举版本以及作者提出的tau2自举检验的I型错误和统计功效。结果表明,渐近的DerSimonian和Laird Q统计量以及其他检验的自举版本在原假设下给出了正确的I型错误,但所有考虑的检验的统计功效都较低,尤其是当荟萃分析中纳入的研究数量较少(<20)时。从有效性、功效和计算简便性的角度来看,Q统计量显然是最佳选择。作者发现,所有考虑的检验的性能在大小和功效方面均未明显依赖于合并比值比的值。由于异质性检验的功效往往不足,因此可以常规使用随机效应模型,并且异质性可以通过R(I)(合并效应量总方差中由研究间方差引起的比例)和CV(B)(研究间变异系数)来量化。