Wang Lifeng, Li Hongzhe, Huang Jianhua Z
Department of Biostatistics and Epidemiology, University of Pennsylvania School of Medicine, Philadelphia, PA 19104,
J Am Stat Assoc. 2008 Dec 1;103(484):1556-1569. doi: 10.1198/016214508000000788.
Nonparametric varying-coefficient models are commonly used for analysis of data measured repeatedly over time, including longitudinal and functional responses data. While many procedures have been developed for estimating the varying-coefficients, the problem of variable selection for such models has not been addressed. In this article, we present a regularized estimation procedure for variable selection that combines basis function approximations and the smoothly clipped absolute deviation (SCAD) penalty. The proposed procedure simultaneously selects significant variables with time-varying effects and estimates the nonzero smooth coefficient functions. Under suitable conditions, we have established the theoretical properties of our procedure, including consistency in variable selection and the oracle property in estimation. Here the oracle property means that the asymptotic distribution of an estimated coefficient function is the same as that when it is known a priori which variables are in the model. The method is illustrated with simulations and two real data examples, one for identifying risk factors in the study of AIDS and one using microarray time-course gene expression data to identify the transcription factors related to the yeast cell cycle process.
非参数变系数模型常用于分析随时间重复测量的数据,包括纵向数据和函数响应数据。虽然已经开发了许多用于估计变系数的方法,但此类模型的变量选择问题尚未得到解决。在本文中,我们提出了一种用于变量选择的正则化估计方法,该方法结合了基函数逼近和平滑截断绝对偏差(SCAD)惩罚。所提出的方法同时选择具有时变效应的显著变量,并估计非零平滑系数函数。在适当的条件下,我们建立了该方法的理论性质,包括变量选择的一致性和估计中的神谕性质。这里的神谕性质是指估计系数函数的渐近分布与事先知道模型中哪些变量时的渐近分布相同。通过模拟和两个实际数据示例对该方法进行了说明,一个用于识别艾滋病研究中的风险因素,另一个使用微阵列时间序列基因表达数据来识别与酵母细胞周期过程相关的转录因子。