Fosdick Bailey K, Hoff Peter D
Statistical and Applied Mathematical Sciences Institute and University of Washington.
Ann Appl Stat. 2014;8(1):120-147. doi: 10.1214/13-aoas694.
Human mortality data sets can be expressed as multiway data arrays, the dimensions of which correspond to categories by which mortality rates are reported, such as age, sex, country and year. Regression models for such data typically assume an independent error distribution or an error model that allows for dependence along at most one or two dimensions of the data array. However, failing to account for other dependencies can lead to inefficient estimates of regression parameters, inaccurate standard errors and poor predictions. An alternative to assuming independent errors is to allow for dependence along each dimension of the array using a separable covariance model. However, the number of parameters in this model increases rapidly with the dimensions of the array and, for many arrays, maximum likelihood estimates of the covariance parameters do not exist. In this paper, we propose a submodel of the separable covariance model that estimates the covariance matrix for each dimension as having factor analytic structure. This model can be viewed as an extension of factor analysis to array-valued data, as it uses a factor model to estimate the covariance along each dimension of the array. We discuss properties of this model as they relate to ordinary factor analysis, describe maximum likelihood and Bayesian estimation methods, and provide a likelihood ratio testing procedure for selecting the factor model ranks. We apply this methodology to the analysis of data from the Human Mortality Database, and show in a cross-validation experiment how it outperforms simpler methods. Additionally, we use this model to impute mortality rates for countries that have no mortality data for several years. Unlike other approaches, our methodology is able to estimate similarities between the mortality rates of countries, time periods and sexes, and use this information to assist with the imputations.
人类死亡率数据集可以表示为多维数据阵列,其维度对应于报告死亡率的类别,如年龄、性别、国家和年份。此类数据的回归模型通常假设误差分布独立,或者假设误差模型最多允许沿数据阵列的一两个维度存在相关性。然而,未能考虑其他相关性可能导致回归参数估计效率低下、标准误差不准确以及预测效果不佳。替代假设独立误差的方法是使用可分离协方差模型允许沿阵列的每个维度存在相关性。然而,该模型中的参数数量会随着阵列维度的增加而迅速增加,并且对于许多阵列来说,协方差参数的最大似然估计并不存在。在本文中,我们提出了可分离协方差模型的一个子模型,该子模型将每个维度的协方差矩阵估计为具有因子分析结构。这个模型可以看作是因子分析对阵列值数据的扩展,因为它使用因子模型来估计阵列每个维度的协方差。我们讨论该模型与普通因子分析相关的性质,描述最大似然和贝叶斯估计方法,并提供一种用于选择因子模型秩的似然比检验程序。我们将这种方法应用于人类死亡率数据库的数据的分析,并在交叉验证实验中展示它如何优于更简单的方法。此外,我们使用这个模型来估算那些有几年没有死亡率数据的国家的死亡率。与其他方法不同,我们的方法能够估计国家、时间段和性别的死亡率之间的相似性,并利用这些信息来辅助估算。