Chen Kun
Department of Statistics, University of Connecticut, 215 Glenbrook Rd. U-4120, Storrs, CT 06269-4120,
Stat Interface. 2016;9(4):469-484. doi: 10.4310/SII.2016.v9.n4.a7.
Reduced-rank methods are very popular in high-dimensional multivariate analysis for conducting simultaneous dimension reduction and model estimation. However, the commonly-used reduced-rank methods are not robust, as the underlying reduced-rank structure can be easily distorted by only a few data outliers. Anomalies are bound to exist in big data problems, and in some applications they themselves could be of the primary interest. While naive residual analysis is often inadequate for outlier detection due to potential masking and swamping, robust reduced-rank estimation approaches could be computationally demanding. Under Stein's unbiased risk estimation framework, we propose a set of tools, including leverage score and generalized information score, to perform model diagnostics and outlier detection in large-scale reduced-rank estimation. The leverage scores give an exact decomposition of the so-called model degrees of freedom to the observation level, which lead to exact decomposition of many commonly-used information criteria; the resulting quantities are thus named information scores of the observations. The proposed information score approach provides a principled way of combining the residuals and leverage scores for anomaly detection. Simulation studies confirm that the proposed diagnostic tools work well. A pattern recognition example with hand-writing digital images and a time series analysis example with monthly U.S. macroeconomic data further demonstrate the efficacy of the proposed approaches.
降秩方法在高维多元分析中非常流行,用于同时进行降维和模型估计。然而,常用的降秩方法并不稳健,因为潜在的降秩结构很容易被少数数据异常值扭曲。大数据问题中必然存在异常值,并且在某些应用中,异常值本身可能是主要关注对象。虽然由于潜在的掩盖和淹没效应,朴素残差分析通常不足以用于异常值检测,但稳健的降秩估计方法可能在计算上要求很高。在斯坦因无偏风险估计框架下,我们提出了一组工具,包括杠杆得分和广义信息得分,用于在大规模降秩估计中进行模型诊断和异常值检测。杠杆得分将所谓的模型自由度精确分解到观测层面,这导致许多常用信息准则的精确分解;由此产生的量因此被称为观测的信息得分。所提出的信息得分方法为结合残差和杠杆得分进行异常值检测提供了一种有原则的方法。模拟研究证实所提出的诊断工具效果良好。一个使用手写数字图像的模式识别示例和一个使用美国月度宏观经济数据的时间序列分析示例进一步证明了所提出方法的有效性。