Cheung Li C, Pan Qing, Hyun Noorie, Schiffman Mark, Fetterman Barbara, Castle Philip E, Lorey Thomas, Katki Hormuzd A
Department of Statistics, The George Washington University, Washington, DC, U.S.A.
Division of Cancer Epidemiology and Genetics, National Cancer Institute, NIH, Rockville, MD, U.S.A.
Stat Med. 2017 Sep 30;36(22):3583-3595. doi: 10.1002/sim.7380. Epub 2017 Jun 28.
For cost-effectiveness and efficiency, many large-scale general-purpose cohort studies are being assembled within large health-care providers who use electronic health records. Two key features of such data are that incident disease is interval-censored between irregular visits and there can be pre-existing (prevalent) disease. Because prevalent disease is not always immediately diagnosed, some disease diagnosed at later visits are actually undiagnosed prevalent disease. We consider prevalent disease as a point mass at time zero for clinical applications where there is no interest in time of prevalent disease onset. We demonstrate that the naive Kaplan-Meier cumulative risk estimator underestimates risks at early time points and overestimates later risks. We propose a general family of mixture models for undiagnosed prevalent disease and interval-censored incident disease that we call prevalence-incidence models. Parameters for parametric prevalence-incidence models, such as the logistic regression and Weibull survival (logistic-Weibull) model, are estimated by direct likelihood maximization or by EM algorithm. Non-parametric methods are proposed to calculate cumulative risks for cases without covariates. We compare naive Kaplan-Meier, logistic-Weibull, and non-parametric estimates of cumulative risk in the cervical cancer screening program at Kaiser Permanente Northern California. Kaplan-Meier provided poor estimates while the logistic-Weibull model was a close fit to the non-parametric. Our findings support our use of logistic-Weibull models to develop the risk estimates that underlie current US risk-based cervical cancer screening guidelines. Published 2017. This article has been contributed to by US Government employees and their work is in the public domain in the USA.
为了实现成本效益和效率,许多大规模通用队列研究正在大型医疗保健机构中开展,这些机构使用电子健康记录。此类数据的两个关键特征是,新发疾病在不定期就诊之间存在区间删失,并且可能存在既往(现患)疾病。由于现患疾病并非总是能立即被诊断出来,一些在后续就诊时被诊断出的疾病实际上是未被诊断的现患疾病。在对现患疾病发病时间不感兴趣的临床应用中,我们将现患疾病视为时间零点的一个点质量。我们证明,朴素的Kaplan-Meier累积风险估计器在早期时间点低估风险,而在后期高估风险。我们提出了一个用于未诊断现患疾病和区间删失新发疾病的混合模型族,我们称之为现患-发病模型。参数化现患-发病模型(如逻辑回归和威布尔生存(逻辑-威布尔)模型)的参数通过直接似然最大化或期望最大化(EM)算法进行估计。我们提出了非参数方法来计算无协变量病例的累积风险。我们比较了北加利福尼亚凯撒医疗集团宫颈癌筛查项目中朴素Kaplan-Meier、逻辑-威布尔和非参数累积风险估计。Kaplan-Meier估计效果不佳,而逻辑-威布尔模型与非参数估计拟合得很好。我们的研究结果支持我们使用逻辑-威布尔模型来制定当前美国基于风险的宫颈癌筛查指南所依据的风险估计。2017年发表。本文由美国政府雇员撰写,在美国其工作属于公共领域。