Liu Jianyu, Yu Guan, Liu Yufeng
Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, NC 27599, USA.
Department of Biostatistics, University at Buffalo, Buffalo, NY 14214, USA.
J Multivar Anal. 2019 May;171:250-269. doi: 10.1016/j.jmva.2018.12.007. Epub 2018 Dec 17.
Linear discriminant analysis (LDA) is a well-known classification technique that enjoyed great success in practical applications. Despite its effectiveness for traditional low-dimensional problems, extensions of LDA are necessary in order to classify high-dimensional data. Many variants of LDA have been proposed in the literature. However, most of these methods do not fully incorporate the structure information among predictors when such information is available. In this paper, we introduce a new high-dimensional LDA technique, namely graph-based sparse LDA (GSLDA), that utilizes the graph structure among the features. In particular, we use the regularized regression formulation for penalized LDA techniques, and propose to impose a structure-based sparse penalty on the discriminant vector . The graph structure can be either given or estimated from the training data. Moreover, we explore the relationship between the within-class feature structure and the overall feature structure. Based on this relationship, we further propose a variant of our proposed GSLDA to utilize effectively unlabeled data, which can be abundant in the semi-supervised learning setting. With the new regularization, we can obtain a sparse estimate of and more accurate and interpretable classifiers than many existing methods. Both the selection consistency of estimation and the convergence rate of the classifier are established, and the resulting classifier has an asymptotic Bayes error rate. Finally, we demonstrate the competitive performance of the proposed GSLDA on both simulated and real data studies.
线性判别分析(LDA)是一种著名的分类技术,在实际应用中取得了巨大成功。尽管它在处理传统低维问题时很有效,但为了对高维数据进行分类,LDA的扩展是必要的。文献中已经提出了许多LDA的变体。然而,当预测变量之间存在结构信息时,这些方法中的大多数并没有充分纳入该信息。在本文中,我们介绍了一种新的高维LDA技术,即基于图的稀疏LDA(GSLDA),它利用了特征之间的图结构。具体来说,我们将正则化回归公式用于惩罚LDA技术,并建议对判别向量施加基于结构的稀疏惩罚。图结构既可以是给定的,也可以从训练数据中估计得到。此外,我们还探讨了类内特征结构与整体特征结构之间的关系。基于这种关系,我们进一步提出了一种GSLDA的变体,以有效利用未标记数据,在半监督学习环境中,未标记数据可能很丰富。通过新的正则化,我们可以得到判别向量的稀疏估计,并且比许多现有方法得到更准确且可解释的分类器。我们建立了判别向量估计的选择一致性和分类器的收敛速度,并且所得到的分类器具有渐近贝叶斯错误率。最后,我们在模拟数据和真实数据研究中展示了所提出的GSLDA的竞争性能。