基于机器学习的插补方法的多指标比较及其在乳腺癌生存分析中的应用。

Multi-metric comparison of machine learning imputation methods with application to breast cancer survival.

机构信息

Mohammed VI Center For Research and Innovation, Rabat, Morocco.

International School of Public Health, Mohammed VI University of Sciences and Health, Casablanca, Morocco.

出版信息

BMC Med Res Methodol. 2024 Aug 30;24(1):191. doi: 10.1186/s12874-024-02305-3.

DOI:10.1186/s12874-024-02305-3

PMID:39215245

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11363416/

Abstract

Handling missing data in clinical prognostic studies is an essential yet challenging task. This study aimed to provide a comprehensive assessment of the effectiveness and reliability of different machine learning (ML) imputation methods across various analytical perspectives. Specifically, it focused on three distinct classes of performance metrics used to evaluate ML imputation methods: post-imputation bias of regression estimates, post-imputation predictive accuracy, and substantive model-free metrics. As an illustration, we applied data from a real-world breast cancer survival study. This comprehensive approach aimed to provide a thorough assessment of the effectiveness and reliability of ML imputation methods across various analytical perspectives. A simulated dataset with 30% Missing At Random (MAR) values was used. A number of single imputation (SI) methods - specifically KNN, missMDA, CART, missForest, missRanger, missCforest - and multiple imputation (MI) methods - specifically miceCART and miceRF - were evaluated. The performance metrics used were Gower's distance, estimation bias, empirical standard error, coverage rate, length of confidence interval, predictive accuracy, proportion of falsely classified (PFC), normalized root mean squared error (NRMSE), AUC, and C-index scores. The analysis revealed that in terms of Gower's distance, CART and missForest were the most accurate, while missMDA and CART excelled for binary covariates; missForest and miceCART were superior for continuous covariates. When assessing bias and accuracy in regression estimates, miceCART and miceRF exhibited the least bias. Overall, the various imputation methods demonstrated greater efficiency than complete-case analysis (CCA), with MICE methods providing optimal confidence interval coverage. In terms of predictive accuracy for Cox models, missMDA and missForest had superior AUC and C-index scores. Despite offering better predictive accuracy, the study found that SI methods introduced more bias into the regression coefficients compared to MI methods. This study underlines the importance of selecting appropriate imputation methods based on study goals and data types in time-to-event research. The varying effectiveness of methods across the different performance metrics studied highlights the value of using advanced machine learning algorithms within a multiple imputation framework to enhance research integrity and the robustness of findings.

摘要

处理临床预后研究中的缺失数据是一项至关重要但具有挑战性的任务。本研究旨在从多个分析角度全面评估不同机器学习（ML）插补方法的有效性和可靠性。具体来说，它侧重于用于评估 ML 插补方法的三类不同性能指标：回归估计的后插补偏差、后插补预测准确性和实质性无模型指标。为了说明问题，我们应用了来自真实乳腺癌生存研究的数据。这种综合方法旨在从多个分析角度全面评估 ML 插补方法的有效性和可靠性。使用了具有 30%随机缺失（MAR）值的模拟数据集。评估了几种单插补（SI）方法 - 特别是 KNN、missMDA、CART、missForest、missRanger 和 missCforest - 和多种插补（MI）方法 - 特别是 miceCART 和 miceRF。使用的性能指标是 Gower 距离、估计偏差、经验标准误差、覆盖率、置信区间长度、预测准确性、错误分类比例（PFC）、归一化均方根误差（NRMSE）、AUC 和 C 指数得分。分析表明，在 Gower 距离方面，CART 和 missForest 最为准确，而 missMDA 和 CART 在二元协变量方面表现出色；missForest 和 miceCART 在连续协变量方面表现出色。在评估回归估计中的偏差和准确性时，miceCART 和 miceRF 表现出最小的偏差。总体而言，各种插补方法比完全案例分析（CCA）更有效，MICE 方法提供了最佳的置信区间覆盖。在 Cox 模型的预测准确性方面，missMDA 和 missForest 的 AUC 和 C 指数得分更高。尽管 SI 方法的预测准确性更高，但研究发现，与 MI 方法相比，SI 方法会使回归系数产生更大的偏差。本研究强调了在生存研究中根据研究目标和数据类型选择适当的插补方法的重要性。在所研究的不同性能指标中，方法的有效性各不相同，这突出了在多重插补框架内使用先进的机器学习算法来提高研究完整性和结果稳健性的价值。

相似文献

Multi-metric comparison of machine learning imputation methods with application to breast cancer survival.基于机器学习的插补方法的多指标比较及其在乳腺癌生存分析中的应用。

BMC Med Res Methodol. 2024 Aug 30;24(1):191. doi: 10.1186/s12874-024-02305-3.

Generative adversarial networks for imputing missing data for big data clinical research.生成对抗网络在大数据临床研究中用于填补缺失数据。

BMC Med Res Methodol. 2021 Apr 20;21(1):78. doi: 10.1186/s12874-021-01272-3.

Comparison of methods for handling missing data on immunohistochemical markers in survival analysis of breast cancer.比较乳腺癌生存分析中免疫组化标志物缺失数据处理方法。

Br J Cancer. 2011 Feb 15;104(4):693-9. doi: 10.1038/sj.bjc.6606078. Epub 2011 Jan 25.

The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model.基于最优机器学习的 Cox 比例风险模型缺失数据插补。

Front Public Health. 2021 Jul 5;9:680054. doi: 10.3389/fpubh.2021.680054. eCollection 2021.

Does the Presence of Missing Data Affect the Performance of the SORG Machine-learning Algorithm for Patients With Spinal Metastasis? Development of an Internet Application Algorithm.缺失数据的存在是否会影响 SORG 机器学习算法在脊柱转移瘤患者中的性能？开发一种互联网应用算法。

Clin Orthop Relat Res. 2024 Jan 1;482(1):143-157. doi: 10.1097/CORR.0000000000002706. Epub 2023 Jun 12.

Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study.预后建模研究中缺失协变量数据处理技术的比较：一项模拟研究。

BMC Med Res Methodol. 2010 Jan 19;10:7. doi: 10.1186/1471-2288-10-7.

Comparison of the effects of imputation methods for missing data in predictive modelling of cohort study datasets.缺失数据插补方法对队列研究数据集预测建模效果的比较。

BMC Med Res Methodol. 2024 Feb 16;24(1):41. doi: 10.1186/s12874-024-02173-x.

missForest with feature selection using binary particle swarm optimization improves the imputation accuracy of continuous data.使用二进制粒子群优化进行特征选择的 missForest 提高了连续数据的插补准确性。

Genes Genomics. 2022 Jun;44(6):651-658. doi: 10.1007/s13258-022-01247-8. Epub 2022 Apr 6.

Comparison of imputation methods for handling missing covariate data when fitting a Cox proportional hazards model: a resampling study.缺失协变量数据处理的填补方法在 Cox 比例风险模型拟合中的比较：重抽样研究。

BMC Med Res Methodol. 2010 Dec 31;10:112. doi: 10.1186/1471-2288-10-112.

The performance of prognostic models depended on the choice of missing value imputation algorithm: a simulation study.预后模型的性能取决于缺失值插补算法的选择：一项模拟研究。

J Clin Epidemiol. 2024 Dec;176:111539. doi: 10.1016/j.jclinepi.2024.111539. Epub 2024 Sep 24.

引用本文的文献

Development and validation of a multidimensional predictive model for 28-day mortality in ICU patients with bloodstream infections: a cohort study.重症监护病房血流感染患者28天死亡率的多维预测模型的开发与验证：一项队列研究

Front Cell Infect Microbiol. 2025 Jul 7;15:1569748. doi: 10.3389/fcimb.2025.1569748. eCollection 2025.

Integrating Clinical and Transcriptomic Profiles Associated with Vitamin D to Enhance Disease-Free Survival in Cervical Cancer Recurrence Using the CatBoost Algorithm.整合与维生素D相关的临床和转录组学特征，使用CatBoost算法提高宫颈癌复发患者的无病生存率。

Diagnostics (Basel). 2025 Jun 21;15(13):1579. doi: 10.3390/diagnostics15131579.

本文引用的文献

Toward a standardized evaluation of imputation methodology.向着评估插补方法的标准化迈进。

Biom J. 2024 Jan;66(1):e2200107. doi: 10.1002/bimj.202200107. Epub 2023 Mar 17.

MISL: Multiple imputation by super learning.MISL：超级学习的多重插补。

Stat Methods Med Res. 2022 Oct;31(10):1904-1915. doi: 10.1177/09622802221104238. Epub 2022 Jun 5.

SuperMICE: An Ensemble Machine Learning Approach to Multiple Imputation by Chained Equations.超级小鼠：一种基于链式方程的多重填补集成机器学习方法。

Am J Epidemiol. 2022 Feb 19;191(3):516-525. doi: 10.1093/aje/kwab271.

Prognostic factors in metastatic breast cancer: a prospective single-centre cohort study in a Finnish University Hospital.转移性乳腺癌的预后因素：芬兰大学医院的前瞻性单中心队列研究。

BMJ Open. 2020 Oct 12;10(10):e038798. doi: 10.1136/bmjopen-2020-038798.

Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction.基于随机森林的缺失数据插补在非正态性、非线性和交互作用存在下的准确性。

BMC Med Res Methodol. 2020 Jul 25;20(1):199. doi: 10.1186/s12874-020-01080-1.

How are missing data in covariates handled in observational time-to-event studies in oncology? A systematic review.在肿瘤学观察性生存时间研究中，如何处理协变量中的缺失数据？一项系统评价。

BMC Med Res Methodol. 2020 May 29;20(1):134. doi: 10.1186/s12874-020-01018-7.

A fair comparison of tree-based and parametric methods in multiple imputation by chained equations.基于树的方法和参数方法在链式方程多重插补中的公平比较。

Stat Med. 2020 Apr 15;39(8):1156-1166. doi: 10.1002/sim.8468. Epub 2020 Jan 29.

Random Forest Missing Data Algorithms.随机森林缺失数据算法

Stat Anal Data Min. 2017 Dec;10(6):363-377. doi: 10.1002/sam.11348. Epub 2017 Jun 13.

Outcome of Breast Cancer in Moroccan Young Women Correlated to Clinic-Pathological Features, Risk Factors and Treatment: A Comparative Study of 716 Cases in a Single Institution.摩洛哥年轻女性乳腺癌的预后与临床病理特征、危险因素及治疗的相关性：单机构716例病例的比较研究

PLoS One. 2016 Oct 19;11(10):e0164841. doi: 10.1371/journal.pone.0164841. eCollection 2016.

Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study.基于 MICE 使用随机森林和参数插补模型比较缺失数据插补：CALIBER 研究。

Am J Epidemiol. 2014 Mar 15;179(6):764-74. doi: 10.1093/aje/kwt312. Epub 2014 Jan 12.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于机器学习的插补方法的多指标比较及其在乳腺癌生存分析中的应用。

Multi-metric comparison of machine learning imputation methods with application to breast cancer survival.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献