Department of Epidemiology & Biostatistics, University of California at San Francisco, San Francisco, California.
Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, New York.
Biometrics. 2022 Jun;78(2):679-690. doi: 10.1111/biom.13429. Epub 2021 Feb 11.
With the increasing availability of data in the public domain, there has been a growing interest in exploiting information from external sources to improve the analysis of smaller scale studies. An emerging challenge in the era of big data is that the subject-level data are high dimensional, but the external information is at an aggregate level and of a lower dimension. Moreover, heterogeneity and uncertainty in the auxiliary information are often not accounted for in information synthesis. In this paper, we propose a unified framework to summarize various forms of aggregated information via estimating equations and develop a penalized empirical likelihood approach to incorporate such information in logistic regression. When the homogeneity assumption is violated, we extend the method to account for population heterogeneity among different sources of information. When the uncertainty in the external information is not negligible, we propose a variance estimator adjusting for the uncertainty. The proposed estimators are asymptotically more efficient than the conventional penalized maximum likelihood estimator and enjoy the oracle property even with a diverging number of predictors. Simulation studies show that the proposed approaches yield higher accuracy in variable selection compared with competitors. We illustrate the proposed methodologies with a pediatric kidney transplant study.
随着公共领域数据的日益丰富,人们越来越感兴趣地利用来自外部资源的信息来改进较小规模研究的分析。在大数据时代,一个新兴的挑战是,主体级别的数据是高维的,但外部信息是聚合水平的,且维度较低。此外,辅助信息中的异质性和不确定性通常在信息综合中没有得到考虑。在本文中,我们提出了一个统一的框架,通过估计方程来总结各种形式的聚合信息,并开发了一种惩罚经验似然方法,将此类信息纳入逻辑回归中。当同质性假设被违反时,我们将方法扩展到考虑不同信息源之间的人群异质性。当外部信息的不确定性不可忽略时,我们提出了一种方差估计器来调整不确定性。所提出的估计量在渐近意义上比传统的惩罚最大似然估计量更有效,即使预测变量数量不断增加,也能享有 Oracle 性质。模拟研究表明,与竞争对手相比,所提出的方法在变量选择方面具有更高的准确性。我们用儿科肾移植研究来说明所提出的方法。