Ryan-Despraz Jessica, Wissler Amanda
Department of Physical Anthropology, University of Bern, Bern, Switzerland.
Department of Anthropology, McMaster University, Hamilton, Canada.
Archaeol Anthropol Sci. 2024;16(11):187. doi: 10.1007/s12520-024-02078-2. Epub 2024 Oct 23.
Missing data is a prevalent problem in bioarchaeological research and imputation could provide a promising solution. This work simulated missingness on a control dataset (481 samples × 41 variables) in order to explore imputation methods for mixed data (qualitative and quantitative data). The tested methods included Random Forest (RF), PCA/MCA, factorial analysis for mixed data (FAMD), hotdeck, predictive mean matching (PMM), random samples from observed values (RSOV), and a multi-method (MM) approach for the three missingness mechanisms (MCAR, MAR, and MNAR) at levels of 5%, 10%, 20%, 30%, and 40% missingness. This study also compared single imputation with an adapted multiple imputation method derived from the R package "mice". The results showed that the adapted multiple imputation technique always outperformed single imputation for the same method. The best performing methods were most often RF and MM, and other commonly successful methods were PCA/MCA and PMM multiple imputation. Across all criteria, the amount of missingness was the most important parameter for imputation accuracy. While this study found that some imputation methods performed better than others for the control dataset, each imputation method has advantages and disadvantages. Imputation remains a promising solution for datasets containing missingness; however when making a decision it is essential to consider dataset structure and research goals.
The online version contains supplementary material available at 10.1007/s12520-024-02078-2.
缺失数据是生物考古学研究中普遍存在的问题,插补法可能提供一个有前景的解决方案。这项工作在一个对照数据集(481个样本×41个变量)上模拟缺失情况,以探索混合数据(定性和定量数据)的插补方法。测试的方法包括随机森林(RF)、主成分分析/对应分析(PCA/MCA)、混合数据因子分析(FAMD)、热卡填充、预测均值匹配(PMM)、从观测值中随机抽样(RSOV),以及针对三种缺失机制(完全随机缺失、随机缺失、非随机缺失)在5%、10%、20%、30%和40%缺失水平下的多方法(MM)方法。本研究还将单一插补与从R包“mice”衍生的一种改进的多重插补方法进行了比较。结果表明,对于相同的方法,改进的多重插补技术总是优于单一插补。表现最佳的方法通常是RF和MM,其他常用的成功方法是PCA/MCA和PMM多重插补。在所有标准中,缺失量是插补准确性的最重要参数。虽然本研究发现某些插补方法在对照数据集上比其他方法表现更好,但每种插补方法都有优缺点。对于包含缺失值的数据集,插补仍然是一个有前景的解决方案;然而,在做决定时,考虑数据集结构和研究目标至关重要。
在线版本包含可在10.1007/s12520-024-02078-2获取的补充材料。