Shen Xiaotong, Pan Wei, Zhu Yunzhang
School of Statistics, University of Minnesota, Minneapolis, MN 55455.
J Am Stat Assoc. 2012 Jan 1;107(497):223-232. doi: 10.1080/01621459.2011.645783. Epub 2012 Jun 11.
In high-dimensional data analysis, feature selection becomes one means for dimension reduction, which proceeds with parameter estimation. Concerning accuracy of selection and estimation, we study nonconvex constrained and regularized likelihoods in the presence of nuisance parameters. Theoretically, we show that constrained L(0)-likelihood and its computational surrogate are optimal in that they achieve feature selection consistency and sharp parameter estimation, under one necessary condition required for any method to be selection consistent and to achieve sharp parameter estimation. It permits up to exponentially many candidate features. Computationally, we develop difference convex methods to implement the computational surrogate through prime and dual subproblems. These results establish a central role of L(0)-constrained and regularized likelihoods in feature selection and parameter estimation involving selection. As applications of the general method and theory, we perform feature selection in linear regression and logistic regression, and estimate a precision matrix in Gaussian graphical models. In these situations, we gain a new theoretical insight and obtain favorable numerical results. Finally, we discuss an application to predict the metastasis status of breast cancer patients with their gene expression profiles.
在高维数据分析中,特征选择成为降维的一种手段,它与参数估计同时进行。关于选择和估计的准确性,我们研究了存在干扰参数时的非凸约束和正则化似然。从理论上讲,我们表明约束L(0) - 似然及其计算替代在任何方法实现选择一致性和精确参数估计所需的一个必要条件下是最优的,因为它们实现了特征选择一致性和精确参数估计。它允许多达指数级数量的候选特征。在计算方面,我们开发了差分凸方法,通过原问题和对偶子问题来实现计算替代。这些结果确立了L(0) - 约束和正则化似然在涉及选择的特征选择和参数估计中的核心作用。作为一般方法和理论的应用,我们在线性回归和逻辑回归中进行特征选择,并在高斯图形模型中估计精度矩阵。在这些情况下,我们获得了新的理论见解并取得了良好的数值结果。最后,我们讨论了一个利用乳腺癌患者的基因表达谱预测转移状态的应用。