School of Data Science, Fudan University, Shanghai, China.
School of Economics and Management, Beijing Forestry University, Beijing, China.
BMC Med Res Methodol. 2024 Sep 6;24(1):194. doi: 10.1186/s12874-024-02319-x.
Early identification of children at high risk of developing myopia is essential to prevent myopia progression by introducing timely interventions. However, missing data and measurement error (ME) are common challenges in risk prediction modelling that can introduce bias in myopia prediction.
We explore four imputation methods to address missing data and ME: single imputation (SI), multiple imputation under missing at random (MI-MAR), multiple imputation with calibration procedure (MI-ME), and multiple imputation under missing not at random (MI-MNAR). We compare four machine-learning models (Decision Tree, Naive Bayes, Random Forest, and Xgboost) and three statistical models (logistic regression, stepwise logistic regression, and least absolute shrinkage and selection operator logistic regression) in myopia risk prediction. We apply these models to the Shanghai Jinshan Myopia Cohort Study and also conduct a simulation study to investigate the impact of missing mechanisms, the degree of ME, and the importance of predictors on model performance. Model performance is evaluated using the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC).
Our findings indicate that in scenarios with missing data and ME, using MI-ME in combination with logistic regression yields the best prediction results. In scenarios without ME, employing MI-MAR to handle missing data outperforms SI regardless of the missing mechanisms. When ME has a greater impact on prediction than missing data, the relative advantage of MI-MAR diminishes, and MI-ME becomes more superior. Furthermore, our results demonstrate that statistical models exhibit better prediction performance than machine-learning models.
MI-ME emerges as a reliable method for handling missing data and ME in important predictors for early-onset myopia risk prediction.
早期识别有发展为近视风险的儿童对于通过及时干预来阻止近视进展至关重要。然而,缺失数据和测量误差(ME)是风险预测建模中常见的挑战,可能会导致近视预测的偏差。
我们探索了四种处理缺失数据和 ME 的插补方法:单一插补(SI)、在随机缺失下的多重插补(MI-MAR)、带有校准程序的多重插补(MI-ME)和在非随机缺失下的多重插补(MI-MNAR)。我们比较了四种机器学习模型(决策树、朴素贝叶斯、随机森林和 Xgboost)和三种统计模型(逻辑回归、逐步逻辑回归和最小绝对收缩和选择算子逻辑回归)在近视风险预测中的应用。我们将这些模型应用于上海金山近视队列研究,并进行了一项模拟研究,以调查缺失机制、ME 的程度以及预测因子的重要性对模型性能的影响。模型性能通过接收者操作特征曲线(AUROC)和精度-召回曲线下面积(AUPRC)进行评估。
我们的研究结果表明,在存在缺失数据和 ME 的情况下,使用 MI-ME 结合逻辑回归可获得最佳的预测结果。在不存在 ME 的情况下,使用 MI-MAR 处理缺失数据优于 SI,无论缺失机制如何。当 ME 对预测的影响大于缺失数据时,MI-MAR 的相对优势会减弱,而 MI-ME 则更具优势。此外,我们的结果表明,统计模型在预测性能方面优于机器学习模型。
MI-ME 是一种可靠的方法,可用于处理早期近视风险预测中重要预测因子的缺失数据和 ME。