Amorim Gustavo, Tao Ran, Lumley Thomas, Shaw Pamela A, Shepherd Bryan E
Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, USA.
Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN, USA.
Stat Med. 2025 Jul;44(15-17):e70111. doi: 10.1002/sim.70111.
Data collection procedures are often time-consuming and expensive. An alternative to collecting full information from all subjects enrolled in a study is a two-phase design: Variables that are inexpensive or easy to measure are obtained for the study population, and more specific, expensive, or hard-to-measure variables are collected only for a well-selected sample of individuals. Often, only these subjects that provided full information are used for inference, while those that were partially observed are discarded from the analysis. Recently, semiparametric approaches that use the entire dataset, resulting in fully efficient estimators, have been proposed. These estimators, however, have challenges incorporating multiple covariates, are computationally expensive, and depend on tuning parameters that affect their performance. In this paper, we propose an alternative semiparametric estimator that does not pose any distributional assumptions on the covariates or measurement error mechanism and can be applied to a wider range of settings. Although the proposed estimator is not semiparametric efficient, simulations show that the loss of efficiency to estimate the parameters associated with the partially observed covariates is minimal. We highlight the estimator's applicability to real-world problems, where data structures are complex and rich, and complicated regression models are often necessary.
数据收集程序通常既耗时又昂贵。一种替代从研究中所有纳入的受试者收集完整信息的方法是两阶段设计:为研究人群获取成本低廉或易于测量的变量,而仅为精心挑选的个体样本收集更具体、成本更高或难以测量的变量。通常,只有提供完整信息的这些受试者才用于推断,而那些部分观测到的受试者则从分析中剔除。最近,有人提出了使用整个数据集的半参数方法,从而得到完全有效的估计量。然而,这些估计量在纳入多个协变量方面存在挑战,计算成本高昂,并且依赖于影响其性能的调整参数。在本文中,我们提出了一种替代的半参数估计量,它对协变量或测量误差机制不做任何分布假设,并且可以应用于更广泛的情形。虽然所提出的估计量不是半参数有效的,但模拟表明,估计与部分观测到的协变量相关参数时效率的损失最小。我们强调了该估计量在现实世界问题中的适用性,在这些问题中,数据结构复杂且丰富,通常需要复杂的回归模型。