Shin Seung Jun, Wu Yichao, Zhang Hao Helen, Liu Yufeng
Department of Mathematics, University of Arizona, P.O. Box 210089, Tucson, Arizona 85721-0089, U.S.A.
Biometrics. 2014 Sep;70(3):546-55. doi: 10.1111/biom.12174. Epub 2014 Apr 29.
In high-dimensional data analysis, it is of primary interest to reduce the data dimensionality without loss of information. Sufficient dimension reduction (SDR) arises in this context, and many successful SDR methods have been developed since the introduction of sliced inverse regression (SIR) [Li (1991) Journal of the American Statistical Association 86, 316-327]. Despite their fast progress, though, most existing methods target on regression problems with a continuous response. For binary classification problems, SIR suffers the limitation of estimating at most one direction since only two slices are available. In this article, we develop a new and flexible probability-enhanced SDR method for binary classification problems by using the weighted support vector machine (WSVM). The key idea is to slice the data based on conditional class probabilities of observations rather than their binary responses. We first show that the central subspace based on the conditional class probability is the same as that based on the binary response. This important result justifies the proposed slicing scheme from a theoretical perspective and assures no information loss. In practice, the true conditional class probability is generally not available, and the problem of probability estimation can be challenging for data with large-dimensional inputs. We observe that, in order to implement the new slicing scheme, one does not need exact probability values and the only required information is the relative order of probability values. Motivated by this fact, our new SDR procedure bypasses the probability estimation step and employs the WSVM to directly estimate the order of probability values, based on which the slicing is performed. The performance of the proposed probability-enhanced SDR scheme is evaluated by both simulated and real data examples.
在高维数据分析中,主要目标是在不损失信息的情况下降低数据维度。在这种背景下出现了充分降维(SDR),自切片逆回归(SIR)被引入以来,已经开发了许多成功的SDR方法[Li(1991)《美国统计协会杂志》86,316 - 327]。然而,尽管取得了快速进展,但大多数现有方法针对的是具有连续响应的回归问题。对于二元分类问题,SIR存在局限性,因为只有两个切片可用,所以最多只能估计一个方向。在本文中,我们通过使用加权支持向量机(WSVM)为二元分类问题开发了一种新的灵活的概率增强SDR方法。关键思想是根据观测值的条件类概率而不是其二元响应来对数据进行切片。我们首先表明基于条件类概率的中心子空间与基于二元响应的中心子空间相同。这一重要结果从理论角度证明了所提出的切片方案的合理性,并确保没有信息损失。在实践中,真实的条件类概率通常不可用,并且对于具有高维输入的数据,概率估计问题可能具有挑战性。我们观察到,为了实施新的切片方案,不需要精确的概率值,唯一需要的信息是概率值的相对顺序。受这一事实的启发,我们的新SDR过程绕过了概率估计步骤,而是使用WSVM直接估计概率值的顺序,并在此基础上进行切片。通过模拟和实际数据示例对所提出的概率增强SDR方案的性能进行了评估。