Allen Genevera I, Tibshirani Robert
Department of Statistics, Stanford University, Stanford, California, 94305, USA,
Ann Appl Stat. 2010 Jun;4(2):764-790. doi: 10.1214/09-AOAS314.
Missing data estimation is an important challenge with high-dimensional data arranged in the form of a matrix. Typically this data matrix is , meaning that either the rows, columns or both can be treated as features. To model transposable data, we present a modification of the matrix-variate normal, the , in which the rows and columns each have a separate mean vector and covariance matrix. By placing additive penalties on the inverse covariance matrices of the rows and columns, these so called transposable regularized covariance models allow for maximum likelihood estimation of the mean and non-singular covariance matrices. Using these models, we formulate EM-type algorithms for missing data imputation in both the multivariate and transposable frameworks. We present theoretical results exploiting the structure of our transposable models that allow these models and imputation methods to be applied to high-dimensional data. Simulations and results on microarray data and the Netflix data show that these imputation techniques often outperform existing methods and offer a greater degree of flexibility.
缺失数据估计是矩阵形式的高维数据面临的一项重要挑战。通常,这个数据矩阵是 ,这意味着行、列或两者都可被视为特征。为了对可转置数据进行建模,我们提出了矩阵变量正态分布的一种变体,即 ,其中行和列分别有单独的均值向量和协方差矩阵。通过对行和列的逆协方差矩阵施加附加惩罚,这些所谓的可转置正则化协方差模型允许对均值和非奇异协方差矩阵进行最大似然估计。利用这些模型,我们在多元和可转置框架中为缺失数据插补制定了EM型算法。我们给出了利用可转置模型结构的理论结果,这些结果使得这些模型和插补方法能够应用于高维数据。对微阵列数据和Netflix数据的模拟及结果表明,这些插补技术通常优于现有方法,并提供了更大程度的灵活性。