School of Public Health, University of Hong Kong, Pok Fu Lam, Hong Kong.
Harvard TH Chan School of Public Health, Harvard University, Boston, MA, USA.
Popul Health Metr. 2021 Nov 4;19(1):44. doi: 10.1186/s12963-021-00274-z.
Poor data quality is limiting the use of data sourced from routine health information systems (RHIS), especially in low- and middle-income countries. An important component of this data quality issue comes from missing values, where health facilities, for a variety of reasons, fail to report to the central system.
Using data from the health management information system in the Democratic Republic of the Congo and the advent of COVID-19 pandemic as an illustrative case study, we implemented seven commonly used imputation methods and evaluated their performance in terms of minimizing bias in imputed values and parameter estimates generated through subsequent analytical techniques, namely segmented regression, which is widely used in interrupted time series studies, and pre-post-comparisons through paired Wilcoxon rank-sum tests. We also examined the performance of these imputation methods under different missing mechanisms and tested their stability to changes in the data.
For regression analyses, there were no substantial differences found in the coefficient estimates generated from all methods except mean imputation and exclusion and interpolation when the data contained less than 20% missing values. However, as the missing proportion grew, k-NN started to produce biased estimates. Machine learning algorithms, i.e. missForest and k-NN, were also found to lack robustness to small changes in the data or consecutive missingness. On the other hand, multiple imputation methods generated the overall most unbiased estimates and were the most robust to all changes in data. They also produced smaller standard errors than single imputations. For pre-post-comparisons, all methods produced p values less than 0.01, regardless of the amount of missingness introduced, suggesting low sensitivity of Wilcoxon rank-sum tests to the imputation method used.
We recommend the use of multiple imputation in addressing missing values in RHIS datasets and appropriate handling of data structure to minimize imputation standard errors. In cases where necessary computing resources are unavailable for multiple imputation, one may consider seasonal decomposition as the next best method. Mean imputation and exclusion and interpolation, however, always produced biased and misleading results in the subsequent analyses, and thus, their use in the handling of missing values should be discouraged.
数据质量差限制了对常规卫生信息系统(RHIS)中获取的数据的使用,尤其是在中低收入国家。数据质量问题的一个重要组成部分是缺失值,由于各种原因,卫生机构未能向中央系统报告数据。
利用刚果民主共和国卫生管理信息系统的数据和 COVID-19 大流行作为一个说明性的案例研究,我们实施了七种常用的插补方法,并根据通过后续分析技术(即广泛用于中断时间序列研究的分段回归和通过配对 Wilcoxon 秩和检验进行的前后比较)生成的插补值和参数估计的最小偏差来评估它们的性能。我们还研究了这些插补方法在不同缺失机制下的性能,并测试了它们对数据变化的稳定性。
对于回归分析,除了均值插补、排除和内插外,当数据缺失率小于 20%时,所有方法生成的系数估计值没有显著差异。然而,随着缺失比例的增加,k-NN 开始产生有偏差的估计值。机器学习算法,即 missForest 和 k-NN,也被发现对数据的微小变化或连续缺失缺乏稳健性。另一方面,多重插补方法生成了总体上最无偏估计值,并且对数据的所有变化都最稳健。它们还产生了比单一插补更小的标准误差。对于前后比较,无论引入的缺失程度如何,所有方法都产生了小于 0.01 的 p 值,这表明 Wilcoxon 秩和检验对所使用的插补方法的敏感性较低。
我们建议在处理 RHIS 数据集的缺失值时使用多重插补,并适当处理数据结构以最小化插补标准误差。在没有必要的计算资源进行多重插补的情况下,可以考虑季节性分解作为下一个最佳方法。然而,均值插补、排除和内插在后续分析中总是产生有偏差和误导性的结果,因此,应鼓励在处理缺失值时不使用这些方法。