Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
Biometrics. 2023 Mar;79(1):190-202. doi: 10.1111/biom.13599. Epub 2021 Nov 22.
Readily available proxies for the time of disease onset such as the time of the first diagnostic code can lead to substantial risk prediction error if performing analyses based on poor proxies. Due to the lack of detailed documentation and labor intensiveness of manual annotation, it is often only feasible to ascertain for a small subset the current status of the disease by a follow-up time rather than the exact time. In this paper, we aim to develop risk prediction models for the onset time efficiently leveraging both a small number of labels on the current status and a large number of unlabeled observations on imperfect proxies. Under a semiparametric transformation model for onset and a highly flexible measurement error model for proxy onset time, we propose the semisupervised risk prediction method by combining information from proxies and limited labels efficiently. From an initially estimator solely based on the labeled subset, we perform a one-step correction with the full data augmenting against a mean zero rank correlation score derived from the proxies. We establish the consistency and asymptotic normality of the proposed semisupervised estimator and provide a resampling procedure for interval estimation. Simulation studies demonstrate that the proposed estimator performs well in a finite sample. We illustrate the proposed estimator by developing a genetic risk prediction model for obesity using data from Mass General Brigham Healthcare Biobank.
易于获得的疾病发病时间的替代指标,如首次诊断代码的时间,如果基于这些较差的替代指标进行分析,可能会导致严重的风险预测误差。由于缺乏详细的文档记录和手动注释的劳动强度,通常只有一小部分可以通过随访时间而不是确切时间来确定疾病的当前状态。在本文中,我们旨在通过充分利用当前状态的少量标签和大量不完善替代指标的未标记观察值,有效地开发发病时间的风险预测模型。在发病的半参数转换模型和代理发病时间的高度灵活的测量误差模型下,我们提出了通过有效结合代理和有限标签的信息的半监督风险预测方法。从最初仅基于标记子集的估计器开始,我们通过从代理中获得的均值为零的秩相关得分的全数据扩充来执行一步校正。我们证明了所提出的半监督估计器的一致性和渐近正态性,并提供了一个用于区间估计的重采样过程。模拟研究表明,该估计器在有限样本中表现良好。我们通过使用 Mass General Brigham Healthcare Biobank 的数据开发肥胖的遗传风险预测模型来说明所提出的估计器。