Estes Jason P, Mukherjee Bhramar, Taylor Jeremy M G
University of Michigan, MI 48109, USA.
Stat Biosci. 2018 Dec;10(3):568-586. doi: 10.1007/s12561-018-9217-4. Epub 2018 May 14.
Large external data sources may be available to augment studies that collect data to address a specific research objective. In this article we consider the problem of building regression models for prediction based on individual-level data from an "internal" study while incorporating summary information from an "external" big data source. We extend the work of Chatterjee et al (2016a) by introducing an adaptive empirical Bayes shrinkage estimator that uses the external summary-level information and the internal data to trade bias with variance for protection against departures in the conditional probability distribution of the outcome given a set of covariates between the two populations. We use simulation studies and a real data application using external summary information from the Prostate Cancer Prevention Trial to assess the performance of the proposed methods in contrast to maximum likelihood estimation and the constrained maximum likelihood (CML) method developed by Chatterjee et al (2016a). Our simulation studies show that the CML method can be biased and inefficient when the assumption of a transportable covariate distribution between the external and internal populations is violated, and our empirical Bayes estimator provides protection against bias and loss of efficiency.
大型外部数据源可用于扩充为实现特定研究目标而收集数据的研究。在本文中,我们考虑基于“内部”研究的个体层面数据构建预测回归模型的问题,同时纳入来自“外部”大数据源的汇总信息。我们扩展了Chatterjee等人(2016a)的工作,引入了一种自适应经验贝叶斯收缩估计器,该估计器使用外部汇总层面信息和内部数据来权衡偏差与方差,以防止在给定一组协变量的情况下,两个总体之间结果的条件概率分布出现偏差。我们使用模拟研究和一个实际数据应用,利用来自前列腺癌预防试验的外部汇总信息,来评估所提出方法相对于最大似然估计以及Chatterjee等人(2016a)开发的约束最大似然(CML)方法的性能。我们的模拟研究表明,当外部和内部总体之间可迁移协变量分布的假设被违反时,CML方法可能会产生偏差且效率低下,而我们的经验贝叶斯估计器可防止偏差和效率损失