Shin Sunyoung, Liu Yufeng, Cole Stephen R, Fine Jason P
Department of Mathematical Sciences, University of Texas at Dallas, 800 W. Campbell Rd., Richardson, Texas 75080, U.S.A.
Department of Statistics and Operations Research, CB# 3260, University of North Carolina, Chapel Hill, North Carolina 27599, U.S.A.
Biometrika. 2020 Jun;107(2):433-448. doi: 10.1093/biomet/asaa012. Epub 2020 Apr 15.
We consider scenarios in which the likelihood function for a semiparametric regression model factors into separate components, with an efficient estimator of the regression parameter available for each component. An optimal weighted combination of the component estimators, named an ensemble estimator, may be employed as an overall estimate of the regression parameter, and may be fully efficient under uncorrelatedness conditions. This approach is useful when the full likelihood function may be difficult to maximize, but the components are easy to maximize. It covers settings where the nuisance parameter may be estimated at different rates in the component likelihoods. As a motivating example we consider proportional hazards regression with prospective doubly censored data, in which the likelihood factors into a current status data likelihood and a left-truncated right-censored data likelihood. Variable selection is important in such regression modelling, but the applicability of existing techniques is unclear in the ensemble approach. We propose ensemble variable selection using the least squares approximation technique on the unpenalized ensemble estimator, followed by ensemble re-estimation under the selected model. The resulting estimator has the oracle property such that the set of nonzero parameters is successfully recovered and the semiparametric efficiency bound is achieved for this parameter set. Simulations show that the proposed method performs well relative to alternative approaches. Analysis of an AIDS cohort study illustrates the practical utility of the method.
半参数回归模型的似然函数分解为独立的部分,且每个部分都有回归参数的有效估计量。可以采用分量估计量的最优加权组合(称为总体估计量)作为回归参数的总体估计,并且在不相关条件下可能是完全有效的。当完整的似然函数可能难以最大化,但各部分容易最大化时,这种方法很有用。它涵盖了在分量似然中干扰参数可能以不同速率估计的情况。作为一个激励性的例子,我们考虑具有前瞻性双重删失数据的比例风险回归,其中似然函数分解为当前状态数据似然和左截断右删失数据似然。变量选择在这种回归建模中很重要,但现有技术在总体方法中的适用性尚不清楚。我们提出在无惩罚总体估计量上使用最小二乘近似技术进行总体变量选择,然后在所选模型下进行总体重新估计。得到的估计量具有神谕性质,即成功恢复了非零参数集,并为该参数集达到了半参数效率界。模拟表明,相对于其他方法,所提出的方法表现良好。对一项艾滋病队列研究的分析说明了该方法的实际效用。