Schaubel D, Hanley J, Collet J P, Bolvin J F, Sharpe C, Morrison H I, Mao Y
Department of Epidemiology and Biostatistics, Faculty of Medicine, McGill University, Montréal, Québec, Canada.
Am J Epidemiol. 1997 Sep 1;146(5):450-8. doi: 10.1093/oxfordjournals.aje.a009298.
Preexisting computerized databases are potentially valuable sources of epidemiologic data. Since such databases are infrequently created specifically for etiologic research, data may be available for the exposure of interest and, through record linkage, for the endpoint of interest, but lacking for potential confounders. Because of the size of these databases, two-stage sampling is an efficient alternative to surveying the entire study population for confounder data. At stage 1, information on exposure and disease status is obtained for the entire study population. Confounder data are collected for probability-selected subsamples at stage 2. Logistic regression is performed on the stage 2 samples, with the parameter estimates and variances appropriately corrected to account for the stage 1 data. In this paper, the authors present methods for determining the required stage 2 sample size in the case of categorical exposure and confounding variables. Sample size tables, power curves, and a computer program have been produced to accommodate a binary exposure and a single binary confounder. With the increasing availability of preexisting yet incomplete databases, the potential for use of two-stage sampling will greatly increase in the future. This investigation provides a basis for estimating the number of participants to sample for the collection of confounder data at the second stage.
现有的计算机化数据库是潜在的有价值的流行病学数据来源。由于此类数据库很少专门为病因学研究而创建,可能有感兴趣暴露因素的数据,并且通过记录链接,也可能有感兴趣终点的数据,但缺乏潜在混杂因素的数据。由于这些数据库规模较大,两阶段抽样是一种有效的替代方法,无需对整个研究人群进行调查以获取混杂因素数据。在第一阶段,获取整个研究人群的暴露和疾病状态信息。在第二阶段,为按概率选择的子样本收集混杂因素数据。对第二阶段的样本进行逻辑回归分析,并对参数估计值和方差进行适当校正,以考虑第一阶段的数据。在本文中,作者提出了在暴露和混杂变量为分类变量的情况下确定所需第二阶段样本量的方法。已经生成了样本量表、功效曲线和一个计算机程序,以适应二元暴露和单个二元混杂因素的情况。随着现有但不完整数据库的日益普及,未来使用两阶段抽样的可能性将大大增加。本研究为估计在第二阶段收集混杂因素数据所需抽样的参与者数量提供了依据。