Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA.
Epidemiology. 2011 May;22(3):382-9. doi: 10.1097/EDE.0b013e3182125cff.
Studies of ecologic or aggregate data suffer from a broad range of biases when scientific interest lies with individual-level associations. To overcome these biases, epidemiologists can choose from a range of designs that combine these group-level data with individual-level data. The individual-level data provide information to identify, evaluate, and control bias, whereas the group-level data are often readily accessible and provide gains in efficiency and power. Within this context, the literature on developing models, particularly multilevel models, is well-established, but little work has been published to help researchers choose among competing designs and plan additional data collection.
We review recently proposed "combined" group- and individual-level designs and methods that collect and analyze data at 2 levels of aggregation. These include aggregate data designs, hierarchical related regression, two-phase designs, and hybrid designs for ecologic inference.
The various methods differ in (i) the data elements available at the group and individual levels and (ii) the statistical techniques used to combine the 2 data sources. Implementing these techniques requires care, and it may often be simpler to ignore the group-level data once the individual-level data are collected. A simulation study, based on birth-weight data from North Carolina, is used to illustrate the benefit of incorporating group-level information.
Our focus is on settings where there are individual-level data to supplement readily accessible group-level data. In this context, no single design is ideal. Choosing which design to adopt depends primarily on the model of interest and the nature of the available group-level data.
当科学研究兴趣集中于个体水平的关联时,生态学或总体数据研究会受到广泛的偏倚影响。为了克服这些偏倚,流行病学家可以从一系列设计中进行选择,这些设计将这些组水平数据与个体水平数据相结合。个体水平数据提供了识别、评估和控制偏倚的信息,而组水平数据通常易于获取,并提供了效率和效能的提高。在这种情况下,关于开发模型的文献,特别是多层次模型,已经相当成熟,但很少有工作发表来帮助研究人员在竞争设计之间进行选择并计划额外的数据收集。
我们回顾了最近提出的“组合”组和个体水平设计和方法,这些设计和方法在 2 个聚合水平上收集和分析数据。这些方法包括总体数据设计、层次相关回归、两阶段设计和生态学推断的混合设计。
各种方法在(i)组和个体水平上可用的数据元素和(ii)用于组合 2 个数据源的统计技术方面存在差异。实施这些技术需要谨慎,并且一旦收集了个体水平数据,通常可能更简单地忽略组水平数据。基于北卡罗来纳州出生体重数据的模拟研究用于说明纳入组水平信息的益处。
我们的重点是在有个体水平数据来补充易于获取的组水平数据的环境下。在这种情况下,没有单一的设计是理想的。选择采用哪种设计主要取决于感兴趣的模型和可用的组水平数据的性质。