Holt B, Benfer R A
Department of Anthropology, University of Missouri-Columbia, Columbia, MO, 65211, USA.
J Hum Evol. 2000 Sep;39(3):289-96. doi: 10.1006/jhev.2000.0418.
The problem of missing data is common in all fields of science. Various methods of estimating missing values in a dataset exist, such as deletion of cases, insertion of sample mean, and linear regression. Each approach presents problems inherent in the method itself or in the nature of the pattern of missing data. We report a method that (1) is more general in application and (2) provides better estimates than traditional approaches, such as one-step regression. The model is general in that it may be applied to singular matrices, such as small datasets or those that contain dummy or index variables. The strength of the model is that it builds a regression equation iteratively, using a bootstrap method. The precision of the regressed estimates of a variable increases as regressed estimates of the predictor variables improve. We illustrate this method with a set of measurements of European Upper Paleolithic and Mesolithic human postcranial remains, as well as a set of primate anthropometric data. First, simulation tests using the primate data set involved randomly turning 20% of the values to "missing". In each case, the first iteration produced significantly better estimates than other estimating techniques. Second, we applied our method to the incomplete set of human postcranial measurements. MISDAT estimates always perform better than replacement of missing data by means and better than classical multiple regression. As with classical multiple regression, MISDAT performs when squared multiple correlation values approach the reliability of the measurement to be estimated, e.g., above about 0. 8.
数据缺失问题在所有科学领域都很常见。数据集中存在各种估计缺失值的方法,如删除案例、插入样本均值和线性回归。每种方法都存在该方法本身或缺失数据模式性质所固有的问题。我们报告一种方法,该方法(1)应用更广泛,(2)比传统方法(如一步回归)能提供更好的估计。该模型具有通用性,因为它可应用于奇异矩阵,如小数据集或包含虚拟变量或指标变量的数据集。该模型的优势在于它使用自助法迭代构建回归方程。随着预测变量的回归估计得到改善,变量的回归估计精度也会提高。我们用一组欧洲旧石器时代晚期和中石器时代人类颅后骨骼测量数据以及一组灵长类人体测量数据来说明这种方法。首先,使用灵长类数据集进行的模拟测试涉及随机将20%的值设为“缺失”。在每种情况下,第一次迭代产生的估计都比其他估计技术显著更好。其次,我们将我们的方法应用于不完整的人类颅后测量数据集。MISDAT估计的表现始终优于用均值替换缺失数据,也优于经典多元回归。与经典多元回归一样,当平方复相关值接近待估计测量的可靠性时,例如高于约0.8时,MISDAT就能发挥作用。