利用定量偏差分析解决随机森林中的测量误差问题。

Addressing Measurement Error in Random Forests Using Quantitative Bias Analysis.

出版信息

Am J Epidemiol. 2021 Sep 1;190(9):1830-1840. doi: 10.1093/aje/kwab010.

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8408353/

Abstract

Although variables are often measured with error, the impact of measurement error on machine-learning predictions is seldom quantified. The purpose of this study was to assess the impact of measurement error on the performance of random-forest models and variable importance. First, we assessed the impact of misclassification (i.e., measurement error of categorical variables) of predictors on random-forest model performance (e.g., accuracy, sensitivity) and variable importance (mean decrease in accuracy) using data from the National Comorbidity Survey Replication (2001-2003). Second, we created simulated data sets in which we knew the true model performance and variable importance measures and could verify that quantitative bias analysis was recovering the truth in misclassified versions of the data sets. Our findings showed that measurement error in the data used to construct random forests can distort model performance and variable importance measures and that bias analysis can recover the correct results. This study highlights the utility of applying quantitative bias analysis in machine learning to quantify the impact of measurement error on study results.

摘要

尽管变量通常会带有误差进行测量，但测量误差对机器学习预测的影响很少被量化。本研究的目的是评估测量误差对随机森林模型性能和变量重要性的影响。首先，我们使用来自国家共病调查再调查（2001-2003 年）的数据评估了预测变量的分类错误（即分类变量的测量误差）对随机森林模型性能（例如准确性、敏感性）和变量重要性（准确性平均下降）的影响。其次，我们创建了模拟数据集，我们知道真实的模型性能和变量重要性度量，可以验证定量偏差分析在数据集的分类版本中是否能恢复真实情况。我们的发现表明，用于构建随机森林的数据中的测量误差会扭曲模型性能和变量重要性度量，而偏差分析可以恢复正确的结果。本研究强调了在机器学习中应用定量偏差分析来量化测量误差对研究结果的影响的实用性。

相似文献

Addressing Measurement Error in Random Forests Using Quantitative Bias Analysis.利用定量偏差分析解决随机森林中的测量误差问题。

Am J Epidemiol. 2021 Sep 1;190(9):1830-1840. doi: 10.1093/aje/kwab010.

A comparative study of forest methods for time-to-event data: variable selection and predictive performance.森林方法在生存时间数据中的比较研究：变量选择和预测性能。

BMC Med Res Methodol. 2021 Sep 25;21(1):193. doi: 10.1186/s12874-021-01386-8.

Invited Commentary: Quantitative Bias Analysis Can See the Forest for the Trees.特邀评论：定量偏倚分析能明察秋毫。

Am J Epidemiol. 2021 Sep 1;190(9):1841-1843. doi: 10.1093/aje/kwab011.

Mitigating underreported error in food frequency questionnaire data using a supervised machine learning method and error adjustment algorithm.利用有监督机器学习方法和误差调整算法减轻食物频率问卷数据中的漏报误差。

BMC Med Inform Decis Mak. 2023 Sep 9;23(1):178. doi: 10.1186/s12911-023-02262-9.

Model-based bootstrapping when correcting for measurement error with application to logistic regression.在逻辑回归中应用时，基于模型的自举法在测量误差校正中的应用。

Biometrics. 2018 Mar;74(1):135-144. doi: 10.1111/biom.12730. Epub 2017 May 30.

Quantifying the short-term effects of air pollution on health in the presence of exposure measurement error: a simulation study of multi-pollutant model results.量化暴露测量误差存在下空气污染对健康的短期影响：多污染物模型结果的模拟研究。

Environ Health. 2021 Aug 24;20(1):94. doi: 10.1186/s12940-021-00757-4.

Bias in random forest variable importance measures: illustrations, sources and a solution.随机森林变量重要性度量中的偏差：示例、来源及解决方案

BMC Bioinformatics. 2007 Jan 25;8:25. doi: 10.1186/1471-2105-8-25.

Improving random forest predictions in small datasets from two-phase sampling designs.改进两阶段抽样设计中小数据集的随机森林预测。

BMC Med Inform Decis Mak. 2021 Nov 22;21(1):322. doi: 10.1186/s12911-021-01688-3.

MEBoost: Variable selection in the presence of measurement error.MEBoost：存在测量误差时的变量选择。

Stat Med. 2019 Jul 10;38(15):2705-2718. doi: 10.1002/sim.8130. Epub 2019 Mar 11.

A Bayesian approach for analysis of ordered categorical responses subject to misclassification.贝叶斯方法分析受错误分类影响的有序分类反应。

PLoS One. 2018 Dec 13;13(12):e0208433. doi: 10.1371/journal.pone.0208433. eCollection 2018.

引用本文的文献

Identifying potential action points for reducing kinesiophobia among atrial fibrillation patients: a network and DAG analysis.确定降低心房颤动患者运动恐惧的潜在行动点：网络分析和有向无环图分析

Qual Life Res. 2025 May;34(5):1253-1264. doi: 10.1007/s11136-025-03897-z. Epub 2025 Feb 10.

Evaluating Binary Outcome Classifiers Estimated from Survey Data.评估基于调查数据估计的二项分类器。

Epidemiology. 2024 Nov 1;35(6):805-812. doi: 10.1097/EDE.0000000000001776. Epub 2024 Aug 14.

Predictive models of miscarriage on the basis of data from a preconception cohort study.基于孕前队列研究数据的流产预测模型。

Fertil Steril. 2024 Jul;122(1):140-149. doi: 10.1016/j.fertnstert.2024.04.007. Epub 2024 Apr 10.

Factors associated with infant sex and preterm birth status for selected birth defects from the National Birth Defects Prevention Study, 1997-2011.1997 - 2011年全国出生缺陷预防研究中与特定出生缺陷的婴儿性别和早产状况相关的因素。

Birth Defects Res. 2024 Jan;116(1):e2294. doi: 10.1002/bdr2.2294. Epub 2023 Dec 28.

A general algorithm for error-in-variables regression modelling using Monte Carlo expectation maximization.使用蒙特卡罗期望最大化的变量误差回归建模的通用算法。

PLoS One. 2023 Apr 3;18(4):e0283798. doi: 10.1371/journal.pone.0283798. eCollection 2023.

Deep Survival Analysis With Clinical Variables for COVID-19.深度生存分析与 COVID-19 的临床变量。

IEEE J Transl Eng Health Med. 2023 Mar 14;11:223-231. doi: 10.1109/JTEHM.2023.3256966. eCollection 2023.

Identification MNS1, FRZB, OGN, LUM, SERP1NA3 and FCN3 as the potential immune-related key genes involved in ischaemic cardiomyopathy by random forest and nomogram.通过随机森林和列线图鉴定 MNS1、FRZB、OGN、LUM、SERP1NA3 和 FCN3 为潜在的与免疫相关的缺血性心肌病关键基因。

Aging (Albany NY). 2023 Feb 27;15(5):1475-1495. doi: 10.18632/aging.204547.

Timing errors and temporal uncertainty in clinical databases-A narrative review.临床数据库中的时间误差与时间不确定性——一篇叙述性综述。

Front Digit Health. 2022 Aug 18;4:932599. doi: 10.3389/fdgth.2022.932599. eCollection 2022.

Predictive models of pregnancy based on data from a preconception cohort study.基于孕前队列研究数据的妊娠预测模型。

Hum Reprod. 2022 Mar 1;37(3):565-576. doi: 10.1093/humrep/deab280.

Detection of child depression using machine learning methods.使用机器学习方法检测儿童抑郁症。

PLoS One. 2021 Dec 16;16(12):e0261131. doi: 10.1371/journal.pone.0261131. eCollection 2021.

本文引用的文献

Prediction of Sex-Specific Suicide Risk Using Machine Learning and Single-Payer Health Care Registry Data From Denmark.利用丹麦的机器学习和单一支付者健康保险登记数据预测性别特异性自杀风险

JAMA Psychiatry. 2020 Jan 1;77(1):25-34. doi: 10.1001/jamapsychiatry.2019.2905.

PTSD from a suicide attempt: An empirical investigation among suicide attempt survivors.创伤后应激障碍（PTSD）源于自杀未遂：自杀未遂幸存者中的实证研究。

J Clin Psychol. 2019 Oct;75(10):1879-1895. doi: 10.1002/jclp.22833. Epub 2019 Jul 23.

Big Data From Health Records in Mental Health Care: Hardly Clairvoyant but Already Useful.心理健康护理中健康记录的大数据：虽远非未卜先知但已颇具用处。

JAMA Psychiatry. 2019 Apr 1;76(4):349-350. doi: 10.1001/jamapsychiatry.2018.4510.

A Bayesian latent class approach for EHR-based phenotyping.基于电子健康记录的表型分析的贝叶斯潜在类别方法。

Stat Med. 2019 Jan 15;38(1):74-87. doi: 10.1002/sim.7953. Epub 2018 Sep 3.

Potential Biases in Machine Learning Algorithms Using Electronic Health Record Data.利用电子健康记录数据的机器学习算法中的潜在偏差。

JAMA Intern Med. 2018 Nov 1;178(11):1544-1547. doi: 10.1001/jamainternmed.2018.3763.

Measurement error and timing of predictor values for multivariable risk prediction models are poorly reported.多变量风险预测模型的测量误差和预测值时间报告不佳。

J Clin Epidemiol. 2018 Oct;102:38-49. doi: 10.1016/j.jclinepi.2018.05.008. Epub 2018 May 18.

The Economic Cost of Suicide and Non-Fatal Suicide Behavior in the Australian Workforce and the Potential Impact of a Workplace Suicide Prevention Strategy.澳大利亚劳动力中自杀及非致命自杀行为的经济成本以及工作场所自杀预防策略的潜在影响。

Int J Environ Res Public Health. 2017 Mar 27;14(4):347. doi: 10.3390/ijerph14040347.

Estimating inverse probability weights using super learner when weight-model specification is unknown in a marginal structural Cox model context.在边际结构Cox模型背景下，当权重模型规范未知时，使用超级学习器估计逆概率权重。

Stat Med. 2017 Jun 15;36(13):2032-2047. doi: 10.1002/sim.7266. Epub 2017 Feb 20.

Predicting the Future - Big Data, Machine Learning, and Clinical Medicine.预测未来——大数据、机器学习与临床医学。

N Engl J Med. 2016 Sep 29;375(13):1216-9. doi: 10.1056/NEJMp1606181.

The parameter sensitivity of random forests.随机森林的参数敏感性。

BMC Bioinformatics. 2016 Sep 1;17(1):331. doi: 10.1186/s12859-016-1228-x.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验