队列选择会影响基于临床数据的机器学习吗？

Does Cohort Selection Affect Machine Learning from Clinical Data?

作者信息

Haghighathoseini Atefehsadat, Wojtusiak Janusz, Min Hua, Leslie Timothy, Frankenfeld Cara, Menon Nirup M

机构信息

George Mason University, Fairfax, VA, USA.

MaineHealth Institute for Research, Scarborough, ME, USA.

出版信息

AMIA Annu Symp Proc. 2025 May 22;2024:473-482. eCollection 2024.

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12099332/

Abstract

This study investigates cohort selection and its effects on the quality of machine learning (ML) models trained on clinical data, focusing on measurements taken within the first 48 hours of hospital admission. It discusses the potential repercussions of making arbitrary decisions during data processing prior to applying ML methods. Experiments are performed within the framework of the National COVID Cohort Collaborative (N3C) dataset. The research aims to unravel biases and assess the fairness of machine learning models used to predict outcomes for hospitalized patients. Detailed discussions cover the data, decision-making processes, and the resulting impact on model predictions regarding patient outcomes. An experiment is conducted in which four arbitrary decisions are made, resulting in 16 distinct datasets characterized by varying sizes and properties. The findings demonstrate significant differences in the obtained datasets and indicate a high potential for bias based on inclusion or exclusion decisions. The results also confirm significant differences in the performance of models constructed on different cohorts, especially when cross-compared between ones based on different inclusion criteria. The study specifically chose to analyze gender, race, and ethnicity as these social determinants of health played a significant role in COVID-19 outcomes.

摘要

本研究调查了队列选择及其对基于临床数据训练的机器学习（ML）模型质量的影响，重点关注入院后48小时内进行的测量。它讨论了在应用ML方法之前的数据处理过程中做出任意决策的潜在影响。实验在国家COVID队列协作（N3C）数据集的框架内进行。该研究旨在揭示偏差并评估用于预测住院患者预后的机器学习模型的公平性。详细讨论涵盖了数据、决策过程以及对患者预后模型预测的最终影响。进行了一项实验，其中做出了四个任意决策，产生了16个不同的数据集，其特征在于大小和属性各不相同。研究结果表明，所获得的数据集存在显著差异，并表明基于纳入或排除决策存在高度的偏差可能性。结果还证实了在不同队列上构建的模型性能存在显著差异，特别是在基于不同纳入标准的模型之间进行交叉比较时。该研究特别选择分析性别、种族和民族，因为这些健康的社会决定因素在COVID-19的结果中发挥了重要作用。

相似文献

1

Does Cohort Selection Affect Machine Learning from Clinical Data?队列选择会影响基于临床数据的机器学习吗？

AMIA Annu Symp Proc. 2025 May 22;2024:473-482. eCollection 2024.

2

Clinical Characterization and Prediction of Clinical Severity of SARS-CoV-2 Infection Among US Adults Using Data From the US National COVID Cohort Collaborative.利用美国国家 COVID 队列协作的数据，对美国成年人中 SARS-CoV-2 感染的临床特征和临床严重程度进行临床描述和预测。

JAMA Netw Open. 2021 Jul 1;4(7):e2116901. doi: 10.1001/jamanetworkopen.2021.16901.

3

COVID-Net Biochem: an explainability-driven framework to building machine learning models for predicting survival and kidney injury of COVID-19 patients from clinical and biochemistry data.COVID-Net 生化：一个基于可解释性的框架，用于构建基于临床和生化数据预测 COVID-19 患者生存和肾脏损伤的机器学习模型。

Sci Rep. 2023 Oct 9;13(1):17001. doi: 10.1038/s41598-023-42203-0.

4

Machine Learning to Predict Mortality and Critical Events in a Cohort of Patients With COVID-19 in New York City: Model Development and Validation.机器学习预测纽约市新冠肺炎患者队列中的死亡率和危急事件：模型开发与验证

J Med Internet Res. 2020 Nov 6;22(11):e24018. doi: 10.2196/24018.

5

Learning From Past Respiratory Infections to Predict COVID-19 Outcomes: Retrospective Study.从既往呼吸道感染预测 COVID-19 结局：回顾性研究。

J Med Internet Res. 2021 Feb 22;23(2):e23026. doi: 10.2196/23026.

6

Machine learning algorithms for predicting COVID-19 mortality in Ethiopia.用于预测埃塞俄比亚 COVID-19 死亡率的机器学习算法。

BMC Public Health. 2024 Jun 28;24(1):1728. doi: 10.1186/s12889-024-19196-0.

7

The Development and Validation of Simplified Machine Learning Algorithms to Predict Prognosis of Hospitalized Patients With COVID-19: Multicenter, Retrospective Study.中文译文：简化机器学习算法预测 COVID-19 住院患者预后的开发和验证：多中心回顾性研究。

J Med Internet Res. 2022 Jan 21;24(1):e31549. doi: 10.2196/31549.

8

Editorial: The National COVID Cohort Collaborative Consortium Combines Population Data with Machine Learning to Evaluate and Predict Risk Factors for the Severity of COVID-19.社论：国家 COVID 队列协作联盟结合人群数据与机器学习评估和预测 COVID-19 严重程度的风险因素。

Med Sci Monit. 2021 Aug 2;27:e934171. doi: 10.12659/MSM.934171.

9

Crowd-sourced machine learning prediction of long COVID using data from the National COVID Cohort Collaborative.基于国家 COVID 队列协作数据的众包机器学习预测长新冠。

EBioMedicine. 2024 Oct;108:105333. doi: 10.1016/j.ebiom.2024.105333. Epub 2024 Sep 24.

10

A Machine Learning Approach for Mortality Prediction in COVID-19 Pneumonia: Development and Evaluation of the Piacenza Score.机器学习在 COVID-19 肺炎死亡率预测中的应用：皮埃蒙特大阪评分的建立和评估。

J Med Internet Res. 2021 May 31;23(5):e29058. doi: 10.2196/29058.

本文引用的文献

1

Suggestion of statistical validation on feature importance of machine learning.关于机器学习特征重要性的统计验证建议。

Annu Int Conf IEEE Eng Med Biol Soc. 2023 Jul;2023:1-4. doi: 10.1109/EMBC40787.2023.10340208.

2

Toward a Clearer Definition of Selection Bias When Estimating Causal Effects.当估计因果效应时，对选择偏差有更清晰的定义。

Epidemiology. 2022 Sep 1;33(5):699-706. doi: 10.1097/EDE.0000000000001516. Epub 2022 Jun 6.

3

Understanding the bias in machine learning systems for cardiovascular disease risk assessment: The first of its kind review.理解机器学习系统在心血管疾病风险评估中的偏差：首例此类综述。

Comput Biol Med. 2022 Mar;142:105204. doi: 10.1016/j.compbiomed.2021.105204. Epub 2022 Jan 4.

4

Mitigating bias in machine learning for medicine.减轻医学机器学习中的偏差。

Commun Med (Lond). 2021 Aug 23;1:25. doi: 10.1038/s43856-021-00028-w.

5

Comparison of Random Forest and Gradient Boosting Machine Models for Predicting Demolition Waste Based on Small Datasets and Categorical Variables.基于小数据集和分类变量的随机森林和梯度提升机模型在预测拆除废物方面的比较。

Int J Environ Res Public Health. 2021 Aug 12;18(16):8530. doi: 10.3390/ijerph18168530.

6

Implicit bias in healthcare: clinical practice, research and decision making.医疗保健中的隐性偏见：临床实践、研究与决策

Future Healthc J. 2021 Mar;8(1):40-48. doi: 10.7861/fhj.2020-0233.

7

Logistic regression was as good as machine learning for predicting major chronic diseases.逻辑回归在预测主要慢性病方面与机器学习一样出色。

J Clin Epidemiol. 2020 Jun;122:56-69. doi: 10.1016/j.jclinepi.2020.03.002. Epub 2020 Mar 10.

8

I choose, therefore I like: preference for faces induced by arbitrary choice.我选择，所以我喜欢：任意选择引起的面孔偏好。

PLoS One. 2013 Aug 16;8(8):e72071. doi: 10.1371/journal.pone.0072071. eCollection 2013.

9

The problem of bias in training data in regression problems in medical decision support.医学决策支持中回归问题训练数据的偏差问题。

Artif Intell Med. 2002 Jan;24(1):51-70. doi: 10.1016/s0933-3657(01)00092-6.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验