Yang Hannan, Lin D Y, Li Quefeng
Department of Biostatistics, University of North Carolina, Chapel Hill.
Stat Sin. 2023 May;33(SI):1343-1364. doi: 10.5705/ss.202021.0028.
High-dimensional classification is an important statistical problem that has applications in many areas. One widely used classifier is the Linear Discriminant Analysis (LDA). In recent years, many regularized LDA classifiers have been proposed to solve the problem of high-dimensional classification. However, these methods rely on inverting a large matrix or solving large-scale optimization problems to render classification rules-methods that are computationally prohibitive when the dimension is ultra-high. With the emergence of big data, it is increasingly important to develop more efficient algorithms to solve the high-dimensional LDA problem. In this paper, we propose an efficient greedy search algorithm that depends solely on closed-form formulae to learn a high-dimensional LDA rule. We establish theoretical guarantee of its statistical properties in terms of variable selection and error rate consistency; in addition, we provide an explicit interpretation of the extra information brought by an additional feature in a LDA problem under some mild distributional assumptions. We demonstrate that this new algorithm drastically improves computational speed compared with other high-dimensional LDA methods, while maintaining comparable or even better classification performance.
高维分类是一个重要的统计问题,在许多领域都有应用。一种广泛使用的分类器是线性判别分析(LDA)。近年来,人们提出了许多正则化的LDA分类器来解决高维分类问题。然而,这些方法依赖于对一个大矩阵求逆或求解大规模优化问题来得出分类规则——当维度超高时,这些方法在计算上是令人望而却步的。随着大数据的出现,开发更高效的算法来解决高维LDA问题变得越来越重要。在本文中,我们提出了一种高效的贪心搜索算法,该算法仅依赖于闭式公式来学习高维LDA规则。我们从变量选择和错误率一致性方面建立了其统计特性的理论保证;此外,在一些温和的分布假设下,我们对LDA问题中一个额外特征所带来的额外信息给出了明确的解释。我们证明,与其他高维LDA方法相比,这种新算法极大地提高了计算速度,同时保持了相当甚至更好的分类性能。