基于分位数误差的自训练在电子病历回归问题中对多变量缺失数据插补的应用：算法开发研究。

Self-Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study.

机构信息

Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea.

Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea.

出版信息

JMIR Public Health Surveill. 2021 Oct 13;7(10):e30824. doi: 10.2196/30824.

DOI:10.2196/30824

PMID:34643539

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8552097/

Abstract

BACKGROUND

When using machine learning in the real world, the missing value problem is the first problem encountered. Methods to impute this missing value include statistical methods such as mean, expectation-maximization, and multiple imputations by chained equations (MICE) as well as machine learning methods such as multilayer perceptron, k-nearest neighbor, and decision tree.

OBJECTIVE

The objective of this study was to impute numeric medical data such as physical data and laboratory data. We aimed to effectively impute data using a progressive method called self-training in the medical field where training data are scarce.

METHODS

In this paper, we propose a self-training method that gradually increases the available data. Models trained with complete data predict the missing values in incomplete data. Among the incomplete data, the data in which the missing value is validly predicted are incorporated into the complete data. Using the predicted value as the actual value is called pseudolabeling. This process is repeated until the condition is satisfied. The most important part of this process is how to evaluate the accuracy of pseudolabels. They can be evaluated by observing the effect of the pseudolabeled data on the performance of the model.

RESULTS

In self-training using random forest (RF), mean squared error was up to 12% lower than pure RF, and the Pearson correlation coefficient was 0.1% higher. This difference was confirmed statistically. In the Friedman test performed on MICE and RF, self-training showed a P value between .003 and .02. A Wilcoxon signed-rank test performed on the mean imputation showed the lowest possible P value, 3.05e-5, in all situations.

CONCLUSIONS

Self-training showed significant results in comparing the predicted values and actual values, but it needs to be verified in an actual machine learning system. And self-training has the potential to improve performance according to the pseudolabel evaluation method, which will be the main subject of our future research.

摘要

背景

在实际应用机器学习时，首要面临的问题是缺失值问题。针对该问题，有多种方法可用于填补缺失值，包括均值法、期望最大化法、链式方程多重插补法（MICE）等统计方法，以及多层感知机、k-最近邻、决策树等机器学习方法。

目的

本研究旨在填补数值型医学数据，如体格数据和实验室数据。我们旨在通过在训练数据稀缺的医学领域中使用称为自训练的渐进方法来有效地填补数据。

方法

在本文中，我们提出了一种自训练方法，该方法可逐步增加可用数据。使用完整数据训练的模型可预测不完整数据中的缺失值。在不完整数据中，将有效预测缺失值的数据纳入完整数据中。将预测值用作实际值称为伪标签。此过程会一直重复，直到满足条件为止。此过程最重要的部分是如何评估伪标签的准确性。可以通过观察伪标记数据对模型性能的影响来评估其准确性。

结果

在随机森林（RF）的自训练中，均方误差降低了 12%，而皮尔逊相关系数提高了 0.1%。通过统计学方法验证了这一差异。在 MICE 和 RF 上进行的 Friedman 检验中，自训练的 P 值在 0.003 到 0.02 之间。在所有情况下，对均值插补进行的 Wilcoxon 符号秩检验显示出可能的最低 P 值为 3.05e-5。

结论

自训练在比较预测值和实际值方面显示出显著效果，但仍需在实际机器学习系统中进行验证。并且，根据伪标签评估方法，自训练有可能提高性能，这将是我们未来研究的主要课题。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0b72/8552097/e70782a15be2/publichealth_v7i10e30824_fig1.jpg

相似文献

Self-Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study.

JMIR Public Health Surveill. 2021 Oct 13;7(10):e30824. doi: 10.2196/30824.

Extremely missing numerical data in Electronic Health Records for machine learning can be managed through simple imputation methods considering informative missingness: A comparative of solutions in a COVID-19 mortality case study.

Comput Methods Programs Biomed. 2023 Dec;242:107803. doi: 10.1016/j.cmpb.2023.107803. Epub 2023 Sep 7.

Advanced methods for missing values imputation based on similarity learning.

PeerJ Comput Sci. 2021 Jul 21;7:e619. doi: 10.7717/peerj-cs.619. eCollection 2021.

Comparison of the effects of imputation methods for missing data in predictive modelling of cohort study datasets.

BMC Med Res Methodol. 2024 Feb 16;24(1):41. doi: 10.1186/s12874-024-02173-x.

Missing value imputation in high-dimensional phenomic data: imputable or not, and how?

BMC Bioinformatics. 2014 Nov 5;15(1):346. doi: 10.1186/s12859-014-0346-6.

Binned Data Provide Better Imputation of Missing Time Series Data from Wearables.

Sensors (Basel). 2023 Jan 28;23(3):1454. doi: 10.3390/s23031454.

A simulation study on missing data imputation for dichotomous variables using statistical and machine learning methods.

Sci Rep. 2023 Jun 9;13(1):9432. doi: 10.1038/s41598-023-36509-2.

Application of machine learning missing data imputation techniques in clinical decision making: taking the discharge assessment of patients with spontaneous supratentorial intracerebral hemorrhage as an example.

BMC Med Inform Decis Mak. 2022 Jan 13;22(1):13. doi: 10.1186/s12911-022-01752-6.

The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model.

Front Public Health. 2021 Jul 5;9:680054. doi: 10.3389/fpubh.2021.680054. eCollection 2021.

Attention-based Imputation of Missing Values in Electronic Health Records Tabular Data.

Proc (IEEE Int Conf Healthc Inform). 2024 Jun;2024:177-182. doi: 10.1109/ichi61247.2024.00030. Epub 2024 Aug 22.

引用本文的文献

Predictors of depression among Chinese college students: a machine learning approach.

BMC Public Health. 2025 Feb 5;25(1):470. doi: 10.1186/s12889-025-21632-8.

Moving Beyond Medical Statistics: A Systematic Review on Missing Data Handling in Electronic Health Records.

Health Data Sci. 2024 Dec 4;4:0176. doi: 10.34133/hds.0176. eCollection 2024.

Comparing human milk macronutrients measured using analyzers based on mid-infrared spectroscopy and ultrasound and the application of machine learning in data fitting.

BMC Pregnancy Childbirth. 2022 Jul 14;22(1):562. doi: 10.1186/s12884-022-04891-w.

本文引用的文献

CardioNet: a manually curated database for artificial intelligence-based research on cardiovascular diseases.

BMC Med Inform Decis Mak. 2021 Jan 28;21(1):29. doi: 10.1186/s12911-021-01392-2.

Strategies for handling missing clinical data for automated surgical site infection detection from the electronic health record.

J Biomed Inform. 2017 Apr;68:112-120. doi: 10.1016/j.jbi.2017.03.009. Epub 2017 Mar 16.

Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network.

J Am Med Inform Assoc. 2013 Jun;20(e1):e147-54. doi: 10.1136/amiajnl-2012-000896. Epub 2013 Mar 26.

Missing value imputation on missing completely at random data using multilayer perceptrons.

Neural Netw. 2011 Jan;24(1):121-9. doi: 10.1016/j.neunet.2010.09.008. Epub 2010 Sep 17.

Missing data imputation using statistical and machine learning methods in a real breast cancer problem.

Artif Intell Med. 2010 Oct;50(2):105-15. doi: 10.1016/j.artmed.2010.05.002. Epub 2010 Jul 16.

Multiple imputation using an iterative hot-deck with distance-based donor selection.

Stat Med. 2008 Jan 15;27(1):83-102. doi: 10.1002/sim.3001.

Use of the mean, hot deck and multiple imputation techniques to predict outcome in intensive care unit patients in Colombia.

Stat Med. 2002 Dec 30;21(24):3885-96. doi: 10.1002/sim.1391.

The barriers to electronic medical record systems and how to overcome them.

J Am Med Inform Assoc. 1997 May-Jun;4(3):213-21. doi: 10.1136/jamia.1997.0040213.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于分位数误差的自训练在电子病历回归问题中对多变量缺失数据插补的应用：算法开发研究。

Self-Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study.

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献