Huang Youjun, Pan Jianxin
Mathematical College, Sichuan University, Chengdu, P. R. China.
Department of Mathematics, The University of Manchester, Manchester, UK.
Biom J. 2022 Jan;64(1):57-73. doi: 10.1002/bimj.202000336. Epub 2021 Sep 29.
In statistical research, variable selection and feature extraction are a typical issue. Variable selection in linear models has been fully developed, while it has received relatively little attention for longitudinal data. Since a longitudinal study involves within-subject correlations, the likelihood function of discrete longitudinal responses generally cannot be expressed in analytically closed form, and standard variable selection methods cannot be directly applied. As an alternative, the penalized generalized estimating equation (PGEE) is helpful but very likely results in incorrect variable selection if the working correlation matrix is misspecified. In many circumstances, the within-subject correlations are of interest and need to be modeled together with the mean. For longitudinal binary data, it becomes more challenging because the within-subject correlation coefficients have the so-called Fréchet-Hoeffding upper bound. In this paper, we proposed smoothly clipped absolute deviation (SCAD)-based and least absolute shrinkage and selection operator (LASSO)-based penalized joint generalized estimating equation (PJGEE) methods to simultaneously model the mean and correlations for longitudinal binary data, together with variable selection in the mean model. The estimated correlation coefficients satisfy the upper bound constraints. Simulation studies under different scenarios are made to assess the performance of the proposed method. Compared to existing PGEE methods that specify a working correlation matrix for longitudinal binary data, the proposed PJGEE method works much better in terms of variable selection consistency and parameter estimation accuracy. A real data set on Clinical Global Impression is analyzed for illustration.
在统计研究中,变量选择和特征提取是一个典型问题。线性模型中的变量选择已经得到充分发展,而对于纵向数据的变量选择却相对较少受到关注。由于纵向研究涉及个体内部的相关性,离散纵向响应的似然函数通常无法以解析封闭形式表示,标准的变量选择方法也不能直接应用。作为一种替代方法,惩罚广义估计方程(PGEE)是有帮助的,但如果工作相关矩阵指定错误,很可能导致错误的变量选择。在许多情况下,个体内部的相关性是令人感兴趣的,需要与均值一起进行建模。对于纵向二元数据,这变得更具挑战性,因为个体内部的相关系数具有所谓的弗雷歇 - 霍夫丁上界。在本文中,我们提出了基于平滑截断绝对偏差(SCAD)和基于最小绝对收缩与选择算子(LASSO)的惩罚联合广义估计方程(PJGEE)方法,用于同时对纵向二元数据的均值和相关性进行建模,以及均值模型中的变量选择。估计的相关系数满足上界约束。我们进行了不同场景下的模拟研究,以评估所提出方法的性能。与为纵向二元数据指定工作相关矩阵的现有PGEE方法相比,所提出的PJGEE方法在变量选择一致性和参数估计准确性方面表现得更好。我们分析了一个关于临床总体印象的真实数据集作为例证。