Yang Sen, Yuan Lei, Lai Ying-Cheng, Shen Xiaotong, Wonka Peter, Ye Jieping
Computer Science and Engineering, Arizona State University, Tempe, AZ 85287, USA.
KDD. 2012:922-930. doi: 10.1145/2339530.2339675.
High-dimensional regression/classification continues to be an important and challenging problem, especially when features are highly correlated. Feature selection, combined with additional structure information on the features has been considered to be promising in promoting regression/classification performance. Graph-guided fused lasso (GFlasso) has recently been proposed to facilitate feature selection and graph structure exploitation, when features exhibit certain graph structures. However, the formulation in GFlasso relies on pairwise sample correlations to perform feature grouping, which could introduce additional estimation bias. In this paper, we propose three new feature grouping and selection methods to resolve this issue. The first method employs a convex function to penalize the pairwise norm of connected regression/classification coefficients, achieving simultaneous feature grouping and selection. The second method improves the first one by utilizing a non-convex function to reduce the estimation bias. The third one is the extension of the second method using a truncated regularization to further reduce the estimation bias. The proposed methods combine feature grouping and feature selection to enhance estimation accuracy. We employ the alternating direction method of multipliers (ADMM) and difference of convex functions (DC) programming to solve the proposed formulations. Our experimental results on synthetic data and two real datasets demonstrate the effectiveness of the proposed methods.
高维回归/分类仍然是一个重要且具有挑战性的问题,尤其是当特征高度相关时。特征选择与特征上的额外结构信息相结合,被认为在提升回归/分类性能方面很有前景。当特征呈现出特定的图结构时,最近有人提出了图引导融合套索(GFlasso)来促进特征选择和图结构利用。然而,GFlasso中的公式依赖于成对样本相关性来进行特征分组,这可能会引入额外的估计偏差。在本文中,我们提出了三种新的特征分组和选择方法来解决这个问题。第一种方法使用一个凸函数来惩罚相连回归/分类系数的成对范数,实现同时的特征分组和选择。第二种方法通过使用一个非凸函数来减少估计偏差改进了第一种方法。第三种方法是第二种方法的扩展,使用截断正则化来进一步减少估计偏差。所提出的方法将特征分组和特征选择相结合以提高估计精度。我们使用交替方向乘子法(ADMM)和凸函数差(DC)编程来求解所提出的公式。我们在合成数据和两个真实数据集上的实验结果证明了所提出方法的有效性。