解决常规卫生信息系统数据中的缺失值问题：使用刚果民主共和国在 COVID-19 大流行期间的数据评估插补方法。

Addressing missing values in routine health information system data: an evaluation of imputation methods using data from the Democratic Republic of the Congo during the COVID-19 pandemic.

机构信息

School of Public Health, University of Hong Kong, Pok Fu Lam, Hong Kong.

Harvard TH Chan School of Public Health, Harvard University, Boston, MA, USA.

出版信息

Popul Health Metr. 2021 Nov 4;19(1):44. doi: 10.1186/s12963-021-00274-z.

DOI:10.1186/s12963-021-00274-z

PMID:34736462

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8567342/

Abstract

BACKGROUND

Poor data quality is limiting the use of data sourced from routine health information systems (RHIS), especially in low- and middle-income countries. An important component of this data quality issue comes from missing values, where health facilities, for a variety of reasons, fail to report to the central system.

METHODS

Using data from the health management information system in the Democratic Republic of the Congo and the advent of COVID-19 pandemic as an illustrative case study, we implemented seven commonly used imputation methods and evaluated their performance in terms of minimizing bias in imputed values and parameter estimates generated through subsequent analytical techniques, namely segmented regression, which is widely used in interrupted time series studies, and pre-post-comparisons through paired Wilcoxon rank-sum tests. We also examined the performance of these imputation methods under different missing mechanisms and tested their stability to changes in the data.

RESULTS

For regression analyses, there were no substantial differences found in the coefficient estimates generated from all methods except mean imputation and exclusion and interpolation when the data contained less than 20% missing values. However, as the missing proportion grew, k-NN started to produce biased estimates. Machine learning algorithms, i.e. missForest and k-NN, were also found to lack robustness to small changes in the data or consecutive missingness. On the other hand, multiple imputation methods generated the overall most unbiased estimates and were the most robust to all changes in data. They also produced smaller standard errors than single imputations. For pre-post-comparisons, all methods produced p values less than 0.01, regardless of the amount of missingness introduced, suggesting low sensitivity of Wilcoxon rank-sum tests to the imputation method used.

CONCLUSIONS

We recommend the use of multiple imputation in addressing missing values in RHIS datasets and appropriate handling of data structure to minimize imputation standard errors. In cases where necessary computing resources are unavailable for multiple imputation, one may consider seasonal decomposition as the next best method. Mean imputation and exclusion and interpolation, however, always produced biased and misleading results in the subsequent analyses, and thus, their use in the handling of missing values should be discouraged.

摘要

背景

数据质量差限制了对常规卫生信息系统（RHIS）中获取的数据的使用，尤其是在中低收入国家。数据质量问题的一个重要组成部分是缺失值，由于各种原因，卫生机构未能向中央系统报告数据。

方法

利用刚果民主共和国卫生管理信息系统的数据和 COVID-19 大流行作为一个说明性的案例研究，我们实施了七种常用的插补方法，并根据通过后续分析技术（即广泛用于中断时间序列研究的分段回归和通过配对 Wilcoxon 秩和检验进行的前后比较）生成的插补值和参数估计的最小偏差来评估它们的性能。我们还研究了这些插补方法在不同缺失机制下的性能，并测试了它们对数据变化的稳定性。

结果

对于回归分析，除了均值插补、排除和内插外，当数据缺失率小于 20%时，所有方法生成的系数估计值没有显著差异。然而，随着缺失比例的增加，k-NN 开始产生有偏差的估计值。机器学习算法，即 missForest 和 k-NN，也被发现对数据的微小变化或连续缺失缺乏稳健性。另一方面，多重插补方法生成了总体上最无偏估计值，并且对数据的所有变化都最稳健。它们还产生了比单一插补更小的标准误差。对于前后比较，无论引入的缺失程度如何，所有方法都产生了小于 0.01 的 p 值，这表明 Wilcoxon 秩和检验对所使用的插补方法的敏感性较低。

结论

我们建议在处理 RHIS 数据集的缺失值时使用多重插补，并适当处理数据结构以最小化插补标准误差。在没有必要的计算资源进行多重插补的情况下，可以考虑季节性分解作为下一个最佳方法。然而，均值插补、排除和内插在后续分析中总是产生有偏差和误导性的结果，因此，应鼓励在处理缺失值时不使用这些方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ec3/8567614/f2e5d1d169ed/12963_2021_274_Fig1_HTML.jpg

相似文献

Addressing missing values in routine health information system data: an evaluation of imputation methods using data from the Democratic Republic of the Congo during the COVID-19 pandemic.解决常规卫生信息系统数据中的缺失值问题：使用刚果民主共和国在 COVID-19 大流行期间的数据评估插补方法。

Popul Health Metr. 2021 Nov 4;19(1):44. doi: 10.1186/s12963-021-00274-z.

Area-specific covid-19 effects on health services utilization in the Democratic Republic of the Congo using routine health information system data.利用常规卫生信息系统数据，分析刚果民主共和国特定地区对卫生服务利用的新冠疫情影响。

BMC Health Serv Res. 2023 Jun 3;23(1):575. doi: 10.1186/s12913-023-09547-9.

Multiple imputation for non-response when estimating HIV prevalence using survey data.使用调查数据估计艾滋病毒流行率时对无应答情况的多重填补法

BMC Public Health. 2015 Oct 16;15:1059. doi: 10.1186/s12889-015-2390-1.

Performance of Multiple Imputation Using Modern Machine Learning Methods in Electronic Health Records Data.基于现代机器学习方法在电子健康记录数据中的应用表现。

Epidemiology. 2023 Mar 1;34(2):206-215. doi: 10.1097/EDE.0000000000001578. Epub 2022 Dec 9.

Handling Missing Data in COVID-19 Incidence Estimation: Secondary Data Analysis.处理 COVID-19 发病率估计中的缺失数据：二次数据分析。

JMIR Public Health Surveill. 2024 Aug 20;10:e53719. doi: 10.2196/53719.

Identify the most appropriate imputation method for handling missing values in clinical structured datasets: a systematic review.识别处理临床结构化数据集缺失值的最合适插补方法：系统评价。

BMC Med Res Methodol. 2024 Aug 28;24(1):188. doi: 10.1186/s12874-024-02310-6.

Impact of the COVID-19 pandemic and response on the utilisation of health services in public facilities during the first wave in Kinshasa, the Democratic Republic of the Congo.刚果民主共和国金沙萨首次疫情浪潮期间，COVID-19 大流行及应对措施对公共设施卫生服务利用的影响。

BMJ Glob Health. 2021 Jul;6(7). doi: 10.1136/bmjgh-2021-005955.

missForest with feature selection using binary particle swarm optimization improves the imputation accuracy of continuous data.使用二进制粒子群优化进行特征选择的 missForest 提高了连续数据的插补准确性。

Genes Genomics. 2022 Jun;44(6):651-658. doi: 10.1007/s13258-022-01247-8. Epub 2022 Apr 6.

Generative adversarial networks for imputing missing data for big data clinical research.生成对抗网络在大数据临床研究中用于填补缺失数据。

BMC Med Res Methodol. 2021 Apr 20;21(1):78. doi: 10.1186/s12874-021-01272-3.

Mechanism-aware imputation: a two-step approach in handling missing values in metabolomics.基于机制的插补：代谢组学中处理缺失值的两步法。

BMC Bioinformatics. 2022 May 16;23(1):179. doi: 10.1186/s12859-022-04659-1.

引用本文的文献

Comparing Multiple Imputation Methods to Address Missing Patient Demographics in Immunization Information Systems: Retrospective Cohort Study.比较多种多重填补方法以解决免疫接种信息系统中患者人口统计学数据缺失问题：回顾性队列研究。

JMIR Public Health Surveill. 2025 Aug 26;11:e73916. doi: 10.2196/73916.

Comparative analysis of HIV data completeness in Haiti's iSanté Plus Electronic Medical Record system across children, adolescents and adults: a cross-sectional evaluation of 2016-2022 data.海地iSanté Plus电子病历系统中儿童、青少年和成人HIV数据完整性的比较分析：对2016 - 2022年数据的横断面评估

BMJ Open. 2025 Jul 13;15(7):e087654. doi: 10.1136/bmjopen-2024-087654.

Integration of sentinel surveillance and climate factors to accelerate malaria elimination in a changing climate of Senegal.整合哨点监测与气候因素以加速在气候变化的塞内加尔消除疟疾

Sci One Health. 2025 May 10;4:100112. doi: 10.1016/j.soh.2025.100112. eCollection 2025.

How much missing data is too much to impute for longitudinal health indicators? A preliminary guideline for the choice of the extent of missing proportion to impute with multiple imputation by chained equations.对于纵向健康指标而言，多少缺失数据量过多而无法进行插补？关于选择使用链式方程多重插补法进行插补的缺失比例范围的初步指南。

Popul Health Metr. 2025 Feb 1;23(1):2. doi: 10.1186/s12963-025-00364-2.

Handling Missing Data in COVID-19 Incidence Estimation: Secondary Data Analysis.处理 COVID-19 发病率估计中的缺失数据：二次数据分析。

JMIR Public Health Surveill. 2024 Aug 20;10:e53719. doi: 10.2196/53719.

Tracking health system performance in times of crisis using routine health data: lessons learned from a multicountry consortium.利用常规卫生数据追踪危机时期的卫生系统绩效：来自一个多国家联盟的经验教训。

Health Res Policy Syst. 2023 Jan 31;21(1):14. doi: 10.1186/s12961-022-00956-6.

Feasibility of establishing a core set of sexual, reproductive, maternal, newborn, child, and adolescent health indicators in humanitarian settings: results from a multi-methods assessment in the Democratic Republic of Congo.在人道主义环境中建立一套核心的性健康、生殖健康、孕产妇健康、新生儿健康、儿童健康和青少年健康指标的可行性：来自刚果民主共和国的多方法评估结果。

Reprod Health. 2022 Jun 2;19(1):129. doi: 10.1186/s12978-022-01415-9.

Quantifying the indirect impact of COVID-19 pandemic on utilisation of outpatient and immunisation services in Kenya: a longitudinal study using interrupted time series analysis.量化 COVID-19 大流行对肯尼亚门急诊和免疫服务利用的间接影响：使用中断时间序列分析的纵向研究。

BMJ Open. 2022 Mar 10;12(3):e055815. doi: 10.1136/bmjopen-2021-055815.

Childhood immunization during the COVID-19 pandemic: experiences in Haiti, Lesotho, Liberia and Malawi.COVID-19 大流行期间的儿童免疫接种：海地、莱索托、利比里亚和马拉维的经验。

Bull World Health Organ. 2022 Feb 1;100(2):115-126C. doi: 10.2471/BLT.21.286774. Epub 2021 Nov 17.

Identifying early-measured variables associated with APACHE IVa providing incorrect in-hospital mortality predictions for critical care patients.识别与 APACHE IVa 相关的早期测量变量，这些变量为重症监护患者提供了错误的住院死亡率预测。

Sci Rep. 2021 Nov 12;11(1):22203. doi: 10.1038/s41598-021-01290-7.

本文引用的文献

BMJ Glob Health. 2021 Jul;6(7). doi: 10.1136/bmjgh-2021-005955.

Use of interrupted time series methods in the evaluation of health system quality improvement interventions: a methodological systematic review.运用中断时间序列法评估卫生系统质量改进干预措施：方法学系统评价。

BMJ Glob Health. 2020 Oct;5(10). doi: 10.1136/bmjgh-2020-003567.

Using routine health information data for research in low- and middle-income countries: a systematic review.利用常规健康信息数据开展中低收入国家的研究：系统评价。

BMC Health Serv Res. 2020 Aug 25;20(1):790. doi: 10.1186/s12913-020-05660-1.

Impact of a free care policy on the utilisation of health services during an Ebola outbreak in the Democratic Republic of Congo: an interrupted time-series analysis.免费医疗政策对刚果民主共和国埃博拉疫情期间卫生服务利用情况的影响：一项中断时间序列分析

BMJ Glob Health. 2020 Jul;5(7). doi: 10.1136/bmjgh-2019-002119.

Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction.基于随机森林的缺失数据插补在非正态性、非线性和交互作用存在下的准确性。

BMC Med Res Methodol. 2020 Jul 25;20(1):199. doi: 10.1186/s12874-020-01080-1.

Understanding the challenges associated with the use of data from routine health information systems in low- and middle-income countries: A systematic review.理解在中低收入国家使用常规卫生信息系统数据所面临的挑战：系统评价。

Health Inf Manag. 2022 Sep;51(3):135-148. doi: 10.1177/1833358320928729. Epub 2020 Jun 30.

Comparison of methods for handling covariate missingness in propensity score estimation with a binary exposure.比较处理二分类暴露因素倾向性评分估计中协变量缺失的方法。

BMC Med Res Methodol. 2020 Jun 26;20(1):168. doi: 10.1186/s12874-020-01053-4.

Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study.基于随机森林的插补方法在 LC-MS 代谢组学数据插补方面优于其他方法：一项比较研究。

BMC Bioinformatics. 2019 Oct 11;20(1):492. doi: 10.1186/s12859-019-3110-0.

Socio-economic inequality in health service utilisation: Does accounting for seasonality in health-seeking behaviour matter?卫生服务利用中的社会经济不平等：考虑到寻求医疗行为的季节性是否重要？

Health Econ. 2019 Nov;28(11):1370-1376. doi: 10.1002/hec.3925. Epub 2019 Jul 2.

A comparison of multiple imputation methods for missing data in longitudinal studies.纵向研究中缺失数据的多种插补方法比较。

BMC Med Res Methodol. 2018 Dec 12;18(1):168. doi: 10.1186/s12874-018-0615-6.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

解决常规卫生信息系统数据中的缺失值问题：使用刚果民主共和国在 COVID-19 大流行期间的数据评估插补方法。

Addressing missing values in routine health information system data: an evaluation of imputation methods using data from the Democratic Republic of the Congo during the COVID-19 pandemic.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献