Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, USA.
Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.
Biometrics. 2022 Dec;78(4):1674-1685. doi: 10.1111/biom.13512. Epub 2021 Aug 1.
Persons living with HIV engage in routine clinical care, generating large amounts of data in observational HIV cohorts. These data are often error-prone, and directly using them in biomedical research could bias estimation and give misleading results. A cost-effective solution is the two-phase design, under which the error-prone variables are observed for all patients during Phase I, and that information is used to select patients for data auditing during Phase II. For example, the Caribbean, Central, and South America network for HIV epidemiology (CCASAnet) selected a random sample from each site for data auditing. Herein, we consider efficient odds ratio estimation with partially audited, error-prone data. We propose a semiparametric approach that uses all information from both phases and accommodates a number of error mechanisms. We allow both the outcome and covariates to be error-prone and these errors to be correlated, and selection of the Phase II sample can depend on Phase I data in an arbitrary manner. We devise a computationally efficient, numerically stable EM algorithm to obtain estimators that are consistent, asymptotically normal, and asymptotically efficient. We demonstrate the advantages of the proposed methods over existing ones through extensive simulations. Finally, we provide applications to the CCASAnet cohort.
HIV 感染者参与常规临床护理,在观察性 HIV 队列中产生大量数据。这些数据通常容易出错,如果直接将其用于生物医学研究,可能会导致估计值出现偏差,并得出误导性结果。一种具有成本效益的解决方案是两阶段设计,在该设计下,在第一阶段对所有患者观察容易出错的变量,并利用这些信息在第二阶段选择患者进行数据审核。例如,艾滋病毒流行病学的加勒比、中美洲和南美洲网络 (CCASAnet) 从每个站点中随机选择了一个样本进行数据审核。在此,我们考虑使用部分审核、容易出错的数据进行有效率比值估计。我们提出了一种半参数方法,该方法利用了两个阶段的所有信息,并适应了多种错误机制。我们允许结果和协变量都容易出错,并且这些错误可以相关,并且第二阶段样本的选择可以以任意方式依赖于第一阶段的数据。我们设计了一种计算效率高、数值稳定的 EM 算法来获得一致、渐近正态和渐近有效的估计值。我们通过广泛的模拟展示了所提出方法相对于现有方法的优势。最后,我们将应用于 CCASAnet 队列。