Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea.
Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea.
JMIR Public Health Surveill. 2021 Oct 13;7(10):e30824. doi: 10.2196/30824.
When using machine learning in the real world, the missing value problem is the first problem encountered. Methods to impute this missing value include statistical methods such as mean, expectation-maximization, and multiple imputations by chained equations (MICE) as well as machine learning methods such as multilayer perceptron, k-nearest neighbor, and decision tree.
The objective of this study was to impute numeric medical data such as physical data and laboratory data. We aimed to effectively impute data using a progressive method called self-training in the medical field where training data are scarce.
In this paper, we propose a self-training method that gradually increases the available data. Models trained with complete data predict the missing values in incomplete data. Among the incomplete data, the data in which the missing value is validly predicted are incorporated into the complete data. Using the predicted value as the actual value is called pseudolabeling. This process is repeated until the condition is satisfied. The most important part of this process is how to evaluate the accuracy of pseudolabels. They can be evaluated by observing the effect of the pseudolabeled data on the performance of the model.
In self-training using random forest (RF), mean squared error was up to 12% lower than pure RF, and the Pearson correlation coefficient was 0.1% higher. This difference was confirmed statistically. In the Friedman test performed on MICE and RF, self-training showed a P value between .003 and .02. A Wilcoxon signed-rank test performed on the mean imputation showed the lowest possible P value, 3.05e-5, in all situations.
Self-training showed significant results in comparing the predicted values and actual values, but it needs to be verified in an actual machine learning system. And self-training has the potential to improve performance according to the pseudolabel evaluation method, which will be the main subject of our future research.
在实际应用机器学习时,首要面临的问题是缺失值问题。针对该问题,有多种方法可用于填补缺失值,包括均值法、期望最大化法、链式方程多重插补法(MICE)等统计方法,以及多层感知机、k-最近邻、决策树等机器学习方法。
本研究旨在填补数值型医学数据,如体格数据和实验室数据。我们旨在通过在训练数据稀缺的医学领域中使用称为自训练的渐进方法来有效地填补数据。
在本文中,我们提出了一种自训练方法,该方法可逐步增加可用数据。使用完整数据训练的模型可预测不完整数据中的缺失值。在不完整数据中,将有效预测缺失值的数据纳入完整数据中。将预测值用作实际值称为伪标签。此过程会一直重复,直到满足条件为止。此过程最重要的部分是如何评估伪标签的准确性。可以通过观察伪标记数据对模型性能的影响来评估其准确性。
在随机森林(RF)的自训练中,均方误差降低了 12%,而皮尔逊相关系数提高了 0.1%。通过统计学方法验证了这一差异。在 MICE 和 RF 上进行的 Friedman 检验中,自训练的 P 值在 0.003 到 0.02 之间。在所有情况下,对均值插补进行的 Wilcoxon 符号秩检验显示出可能的最低 P 值为 3.05e-5。
自训练在比较预测值和实际值方面显示出显著效果,但仍需在实际机器学习系统中进行验证。并且,根据伪标签评估方法,自训练有可能提高性能,这将是我们未来研究的主要课题。