Vazquez Omar, Nan Bin
Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, Pennsylvania, U.S.A.
Department of Statistics, University of California, Irvine, California, U.S.A.
Can J Stat. 2025 Mar;53(1). doi: 10.1002/cjs.11827. Epub 2024 Aug 21.
We consider random sample splitting for estimation and inference in high dimensional generalized linear models, where we first apply the lasso to select a submodel using one subsample and then apply the debiased lasso to fit the selected model using the remaining subsample. We show that a sample splitting procedure based on the debiased lasso yields asymptotically normal estimates under mild conditions and that multiple splitting can address the loss of efficiency. Our simulation results indicate that using the debiased lasso instead of the standard maximum likelihood method in the estimation stage can vastly reduce the bias and variance of the resulting estimates. Furthermore, our multiple splitting debiased lasso method has better numerical performance than some existing methods for high dimensional generalized linear models proposed in the recent literature. We illustrate the proposed multiple splitting method with an analysis of the smoking data of the Mid-South Tobacco Case-Control Study.
我们考虑在高维广义线性模型中进行随机样本拆分以进行估计和推断,其中我们首先应用套索回归使用一个子样本选择一个子模型,然后应用去偏套索回归使用剩余子样本拟合所选模型。我们表明,基于去偏套索回归的样本拆分程序在温和条件下产生渐近正态估计,并且多次拆分可以解决效率损失问题。我们的模拟结果表明,在估计阶段使用去偏套索回归而不是标准最大似然方法可以大大降低所得估计的偏差和方差。此外,我们的多次拆分去偏套索回归方法在数值性能上优于近期文献中提出的一些用于高维广义线性模型的现有方法。我们通过对中南烟草病例对照研究的吸烟数据进行分析来说明所提出的多次拆分方法。