用于理解医疗保健机器学习中标签泄漏的框架。

A framework for understanding label leakage in machine learning for health care.

机构信息

Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37232, United States.

Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37232, United States.

出版信息

J Am Med Inform Assoc. 2023 Dec 22;31(1):274-280. doi: 10.1093/jamia/ocad178.

DOI:10.1093/jamia/ocad178

PMID:37669138

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10746313/

Abstract

INTRODUCTION

The pitfalls of label leakage, contamination of model input features with outcome information, are well established. Unfortunately, avoiding label leakage in clinical prediction models requires more nuance than the common advice of applying "no time machine rule."

FRAMEWORK

We provide a framework for contemplating whether and when model features pose leakage concerns by considering the cadence, perspective, and applicability of predictions. To ground these concepts, we use real-world clinical models to highlight examples of appropriate and inappropriate label leakage in practice.

RECOMMENDATIONS

Finally, we provide recommendations to support clinical and technical stakeholders as they evaluate the leakage tradeoffs associated with model design, development, and implementation decisions. By providing common language and dimensions to consider when designing models, we hope the clinical prediction community will be better prepared to develop statistically valid and clinically useful machine learning models.

摘要

简介

标签泄露的陷阱，即模型输入特征与结果信息的污染，已经得到充分证实。不幸的是，要避免临床预测模型中的标签泄露，需要比常见的“不使用时间机器规则”的建议更细致。

框架

我们通过考虑预测的节奏、视角和适用性，提供了一个框架来思考模型特征是否存在以及何时存在泄漏问题。为了说明这些概念，我们使用真实世界的临床模型来突出实践中适当和不适当的标签泄露的例子。

建议

最后，我们提供了一些建议，以支持临床和技术利益相关者在评估与模型设计、开发和实施决策相关的泄漏权衡时做出决策。通过为设计模型时需要考虑的内容提供通用语言和维度，我们希望临床预测社区能够更好地准备开发统计上有效和临床上有用的机器学习模型。

相似文献

A framework for understanding label leakage in machine learning for health care.用于理解医疗保健机器学习中标签泄漏的框架。

J Am Med Inform Assoc. 2023 Dec 22;31(1):274-280. doi: 10.1093/jamia/ocad178.

The future of Cochrane Neonatal.考克兰新生儿协作网的未来。

Early Hum Dev. 2020 Nov;150:105191. doi: 10.1016/j.earlhumdev.2020.105191. Epub 2020 Sep 12.

Confound-leakage: confound removal in machine learning leads to leakage.混杂-泄露：机器学习中的混杂去除导致泄露。

Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad071. Epub 2023 Sep 30.

A Novel Diabetes Healthcare Disease Prediction Framework Using Machine Learning Techniques.一种使用机器学习技术的新型糖尿病医疗保健疾病预测框架。

J Healthc Eng. 2022 Jan 11;2022:1684017. doi: 10.1155/2022/1684017. eCollection 2022.

A qualitative research framework for the design of user-centered displays of explanations for machine learning model predictions in healthcare.面向医疗保健中机器学习模型预测解释的以用户为中心的显示设计的定性研究框架。

BMC Med Inform Decis Mak. 2020 Oct 8;20(1):257. doi: 10.1186/s12911-020-01276-x.

An Interpretable Longitudinal Preeclampsia Risk Prediction Using Machine Learning.一种使用机器学习的可解释性纵向子痫前期风险预测

medRxiv. 2023 Aug 16:2023.08.16.23293946. doi: 10.1101/2023.08.16.23293946.

Implementing Machine Learning Models for Suicide Risk Prediction in Clinical Practice: Focus Group Study With Hospital Providers.在临床实践中实施用于自杀风险预测的机器学习模型：与医院医护人员的焦点小组研究

JMIR Form Res. 2022 Mar 11;6(3):e30946. doi: 10.2196/30946.

Must-have Qualities of Clinical Research on Artificial Intelligence and Machine Learning.人工智能和机器学习临床研究的必备素质

Balkan Med J. 2023 Jan 23;40(1):3-12. doi: 10.4274/balkanmedj.galenos.2022.2022-11-51. Epub 2022 Dec 29.

Distilling the knowledge from large-language model for health event prediction.从大语言模型中提取知识用于健康事件预测。

Sci Rep. 2024 Dec 28;14(1):30675. doi: 10.1038/s41598-024-75331-2.

Probabilistic Machine Learning for Healthcare.医疗保健中的概率机器学习。

Annu Rev Biomed Data Sci. 2021 Jul 20;4:393-415. doi: 10.1146/annurev-biodatasci-092820-033938. Epub 2021 Jun 1.

引用本文的文献

Identifying and Predicting Cognitive Decline Using Multi-Modal Sensor Data and Machine Learning Approach.使用多模态传感器数据和机器学习方法识别和预测认知衰退

Res Sq. 2025 Jun 18:rs.3.rs-6735622. doi: 10.21203/rs.3.rs-6735622/v1.

Determining the ground truth for the prediction of delirium in adult patients in acute care: a scoping review.确定急性护理中成年患者谵妄预测的真实情况：一项范围综述

JAMIA Open. 2025 May 26;8(3):ooaf037. doi: 10.1093/jamiaopen/ooaf037. eCollection 2025 Jun.

A roadmap to implementing machine learning in healthcare: from concept to practice.医疗保健领域实施机器学习的路线图：从概念到实践。

Front Digit Health. 2025 Jan 20;7:1462751. doi: 10.3389/fdgth.2025.1462751. eCollection 2025.

Multimodal Deep Learning for Differentiating Bacterial and Fungal Keratitis Using Prospective Representative Data.使用前瞻性代表性数据的多模态深度学习用于鉴别细菌性和真菌性角膜炎

Ophthalmol Sci. 2024 Nov 29;5(2):100665. doi: 10.1016/j.xops.2024.100665. eCollection 2025 Mar-Apr.

Why do probabilistic clinical models fail to transport between sites.为什么概率性临床模型无法在不同地点之间进行迁移？

NPJ Digit Med. 2024 Mar 1;7(1):53. doi: 10.1038/s41746-024-01037-4.

本文引用的文献

Next-Generation Artificial Intelligence for Diagnosis: From Predicting Diagnostic Labels to "Wayfinding".用于诊断的下一代人工智能：从预测诊断标签到“路径导航”。

JAMA. 2021 Dec 28;326(24):2467-2468. doi: 10.1001/jama.2021.22396.

Data Leakage in Health Outcomes Prediction With Machine Learning. Comment on "Prediction of Incident Hypertension Within the Next Year: Prospective Study Using Statewide Electronic Health Records and Machine Learning".机器学习在健康结果预测中的数据泄露。对《预测未来一年内高血压的发病情况：使用全州电子健康记录和机器学习的前瞻性研究》的评论

J Med Internet Res. 2021 Feb 11;23(2):e10969. doi: 10.2196/10969.

Hidden Stratification Causes Clinically Meaningful Failures in Machine Learning for Medical Imaging.隐藏分层导致医学成像机器学习中具有临床意义的失败。

Proc ACM Conf Health Inference Learn (2020). 2020 Apr;2020:151-159. doi: 10.1145/3368555.3384468.

Machine learning for early detection of sepsis: an internal and temporal validation study.用于脓毒症早期检测的机器学习：一项内部及时间验证研究。

JAMIA Open. 2020 Apr 11;3(2):252-260. doi: 10.1093/jamiaopen/ooaa006. eCollection 2020 Jul.

Real-World Integration of a Sepsis Deep Learning Technology Into Routine Clinical Care: Implementation Study.将脓毒症深度学习技术实际整合到常规临床护理中的实施研究

JMIR Med Inform. 2020 Jul 15;8(7):e15182. doi: 10.2196/15182.

A Review of Challenges and Opportunities in Machine Learning for Health.机器学习在健康领域的挑战与机遇综述。

AMIA Jt Summits Transl Sci Proc. 2020 May 30;2020:191-200. eCollection 2020.

Presenting machine learning model information to clinical end users with model facts labels.通过模型事实标签向临床终端用户展示机器学习模型信息。

NPJ Digit Med. 2020 Mar 23;3:41. doi: 10.1038/s41746-020-0253-3. eCollection 2020.

Prospective and External Evaluation of a Machine Learning Model to Predict In-Hospital Mortality of Adults at Time of Admission.机器学习模型对入院时成人院内死亡率的前瞻性和外部评估。

JAMA Netw Open. 2020 Feb 5;3(2):e1920733. doi: 10.1001/jamanetworkopen.2019.20733.

Machine Learning in Health Care: A Critical Appraisal of Challenges and Opportunities.医疗保健中的机器学习：对挑战与机遇的批判性评估

EGEMS (Wash DC). 2019 Jan 24;7(1):1. doi: 10.5334/egems.287.

Development and validation of machine learning models to identify high-risk surgical patients using automatically curated electronic health record data (Pythia): A retrospective, single-site study.使用自动整理的电子健康记录数据（Pythia）开发和验证机器学习模型以识别高风险手术患者：一项回顾性、单站点研究。

PLoS Med. 2018 Nov 27;15(11):e1002701. doi: 10.1371/journal.pmed.1002701. eCollection 2018 Nov.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验