Kubkowski Mariusz, Mielniczuk Jan
Institute of Computer Science, Polish Academy of Sciences, Jana Kazimierza 5, 01-248 Warsaw, Poland.
Faculty of Mathematics and Information Science, Warsaw University of Technology, Koszykowa 75, 00-662 Warsaw, Poland.
Entropy (Basel). 2020 Jan 28;22(2):153. doi: 10.3390/e22020153.
We consider selection of random predictors for a high-dimensional regression problem with a binary response for a general loss function. An important special case is when the binary model is semi-parametric and the response function is misspecified under a parametric model fit. When the true response coincides with a postulated parametric response for a certain value of parameter, we obtain a common framework for parametric inference. Both cases of correct specification and misspecification are covered in this contribution. Variable selection for such a scenario aims at recovering the support of the minimizer of the associated risk with large probability. We propose a two-step selection Screening-Selection (SS) procedure which consists of screening and ordering predictors by Lasso method and then selecting the subset of predictors which minimizes the Generalized Information Criterion for the corresponding nested family of models. We prove consistency of the proposed selection method under conditions that allow for a much larger number of predictors than the number of observations. For the semi-parametric case when distribution of random predictors satisfies linear regressions condition, the true and the estimated parameters are collinear and their common support can be consistently identified. This partly explains robustness of selection procedures to the response function misspecification.
我们考虑为具有二元响应的高维回归问题选择随机预测变量,以用于一般损失函数。一个重要的特殊情况是,二元模型为半参数模型,且在参数模型拟合下响应函数被错误设定。当真实响应与针对某个参数值假定的参数响应一致时,我们得到了参数推断的通用框架。本论文涵盖了正确设定和错误设定这两种情况。针对这种情形的变量选择旨在以大概率恢复相关风险最小化器的支撑集。我们提出一种两步选择筛选 - 选择(SS)程序,该程序包括通过套索方法对预测变量进行筛选和排序,然后选择使相应嵌套模型族的广义信息准则最小化的预测变量子集。我们证明了在所提出的选择方法在允许预测变量数量远多于观测数量的条件下的一致性。对于随机预测变量分布满足线性回归条件的半参数情形,真实参数和估计参数共线,并且它们的共同支撑集可以被一致地识别。这部分解释了选择程序对响应函数错误设定的稳健性。