使用观察性健康数据进行可靠且快速的预测模型开发和验证的标准化分析管道。

A standardized analytics pipeline for reliable and rapid development and validation of prediction models using observational health data.

机构信息

Botnar Research Centre, Centre for Statistics in Medicine, Nuffield Department of Orthopaedics Rheumatology and Musculoskeletal Sciences (NDORMS), University of Oxford, Oxford, UK.

Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands.

出版信息

Comput Methods Programs Biomed. 2021 Nov;211:106394. doi: 10.1016/j.cmpb.2021.106394. Epub 2021 Sep 6.

DOI:10.1016/j.cmpb.2021.106394

PMID:34560604

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8420135/

Abstract

BACKGROUND AND OBJECTIVE

As a response to the ongoing COVID-19 pandemic, several prediction models in the existing literature were rapidly developed, with the aim of providing evidence-based guidance. However, none of these COVID-19 prediction models have been found to be reliable. Models are commonly assessed to have a risk of bias, often due to insufficient reporting, use of non-representative data, and lack of large-scale external validation. In this paper, we present the Observational Health Data Sciences and Informatics (OHDSI) analytics pipeline for patient-level prediction modeling as a standardized approach for rapid yet reliable development and validation of prediction models. We demonstrate how our analytics pipeline and open-source software tools can be used to answer important prediction questions while limiting potential causes of bias (e.g., by validating phenotypes, specifying the target population, performing large-scale external validation, and publicly providing all analytical source code).

METHODS

We show step-by-step how to implement the analytics pipeline for the question: 'In patients hospitalized with COVID-19, what is the risk of death 0 to 30 days after hospitalization?'. We develop models using six different machine learning methods in a USA claims database containing over 20,000 COVID-19 hospitalizations and externally validate the models using data containing over 45,000 COVID-19 hospitalizations from South Korea, Spain, and the USA.

RESULTS

Our open-source software tools enabled us to efficiently go end-to-end from problem design to reliable Model Development and evaluation. When predicting death in patients hospitalized with COVID-19, AdaBoost, random forest, gradient boosting machine, and decision tree yielded similar or lower internal and external validation discrimination performance compared to L1-regularized logistic regression, whereas the MLP neural network consistently resulted in lower discrimination. L1-regularized logistic regression models were well calibrated.

CONCLUSION

Our results show that following the OHDSI analytics pipeline for patient-level prediction modelling can enable the rapid development towards reliable prediction models. The OHDSI software tools and pipeline are open source and available to researchers from all around the world.

摘要

背景和目的

作为对当前 COVID-19 大流行的应对措施，现有文献中迅速开发了几种预测模型，旨在提供基于证据的指导。然而，这些 COVID-19 预测模型都没有被证明是可靠的。这些模型通常被评估为存在偏倚风险，这往往是由于报告不充分、使用非代表性数据以及缺乏大规模外部验证所致。在本文中，我们提出了观察性健康数据科学和信息学（OHDSI）的患者水平预测建模分析管道，作为一种快速而可靠地开发和验证预测模型的标准化方法。我们展示了如何使用我们的分析管道和开源软件工具来回答重要的预测问题，同时限制潜在的偏倚原因（例如，通过验证表型、指定目标人群、进行大规模外部验证以及公开提供所有分析源代码）。

方法

我们逐步展示了如何在一个包含超过 20000 例 COVID-19 住院患者的美国索赔数据库中，使用六种不同的机器学习方法实施针对问题“在因 COVID-19 住院的患者中，住院后 0 至 30 天内的死亡风险是多少？”的分析管道。我们使用来自韩国、西班牙和美国的超过 45000 例 COVID-19 住院患者的数据对模型进行外部验证。

结果

我们的开源软件工具使我们能够从问题设计到可靠的模型开发和评估高效地端到端进行操作。在预测因 COVID-19 住院的患者的死亡风险时，AdaBoost、随机森林、梯度提升机和决策树的内部和外部验证区分性能与 L1 正则化逻辑回归相似或较低，而 MLP 神经网络的区分性能始终较低。L1 正则化逻辑回归模型具有良好的校准性能。

结论

我们的结果表明，遵循 OHDSI 患者水平预测建模分析管道可以快速开发出可靠的预测模型。OHDSI 软件工具和管道是开源的，可供来自世界各地的研究人员使用。

相似文献

A standardized analytics pipeline for reliable and rapid development and validation of prediction models using observational health data.使用观察性健康数据进行可靠且快速的预测模型开发和验证的标准化分析管道。

Comput Methods Programs Biomed. 2021 Nov;211:106394. doi: 10.1016/j.cmpb.2021.106394. Epub 2021 Sep 6.

Development and Validation of a Robust and Interpretable Early Triaging Support System for Patients Hospitalized With COVID-19: Predictive Algorithm Modeling and Interpretation Study.开发和验证用于 COVID-19 住院患者的强大且可解释的早期分诊支持系统：预测算法建模和解释研究。

J Med Internet Res. 2024 Jan 11;26:e52134. doi: 10.2196/52134.

Leveraging artificial intelligence and data science techniques in harmonizing, sharing, accessing and analyzing SARS-COV-2/COVID-19 data in Rwanda (LAISDAR Project): study design and rationale.利用人工智能和数据科学技术协调、共享、访问和分析卢旺达 SARS-COV-2/COVID-19 数据（LAISDAR 项目）：研究设计和原理。

BMC Med Inform Decis Mak. 2022 Aug 12;22(1):214. doi: 10.1186/s12911-022-01965-9.

Implementation of the COVID-19 Vulnerability Index Across an International Network of Health Care Data Sets: Collaborative External Validation Study.在国际医疗保健数据集网络中实施COVID-19脆弱性指数：协作外部验证研究。

JMIR Med Inform. 2021 Apr 5;9(4):e21547. doi: 10.2196/21547.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

The Development and Validation of Simplified Machine Learning Algorithms to Predict Prognosis of Hospitalized Patients With COVID-19: Multicenter, Retrospective Study.中文译文：简化机器学习算法预测 COVID-19 住院患者预后的开发和验证：多中心回顾性研究。

J Med Internet Res. 2022 Jan 21;24(1):e31549. doi: 10.2196/31549.

Machine learning algorithms for predicting COVID-19 mortality in Ethiopia.用于预测埃塞俄比亚 COVID-19 死亡率的机器学习算法。

BMC Public Health. 2024 Jun 28;24(1):1728. doi: 10.1186/s12889-024-19196-0.

Machine Learning to Predict Mortality and Critical Events in a Cohort of Patients With COVID-19 in New York City: Model Development and Validation.机器学习预测纽约市新冠肺炎患者队列中的死亡率和危急事件：模型开发与验证

J Med Internet Res. 2020 Nov 6;22(11):e24018. doi: 10.2196/24018.

Developing and Validating Multi-Modal Models for Mortality Prediction in COVID-19 Patients: a Multi-center Retrospective Study.开发和验证多模态模型以预测 COVID-19 患者的死亡率：一项多中心回顾性研究。

J Digit Imaging. 2022 Dec;35(6):1514-1529. doi: 10.1007/s10278-022-00674-z. Epub 2022 Jul 5.

Development and evaluation of a machine learning-based in-hospital COVID-19 disease outcome predictor (CODOP): A multicontinental retrospective study.基于机器学习的院内 COVID-19 疾病转归预测器（CODOP）的开发和评估：一项多大陆回顾性研究。

Elife. 2022 May 17;11:e75985. doi: 10.7554/eLife.75985.

引用本文的文献

Data Interoperability in Context: The Importance of Open-Source Implementations When Choosing Open Standards.情境中的数据互操作性：选择开放标准时开源实现的重要性。

J Med Internet Res. 2025 Apr 15;27:e66616. doi: 10.2196/66616.

Conversion of Sensitive Data to the Observational Medical Outcomes Partnership Common Data Model: Protocol for the Development and Use of Carrot.将敏感数据转换为观察性医疗结果合作组织通用数据模型：胡萝卜开发与使用协议

JMIR Res Protoc. 2025 Apr 2;14:e60917. doi: 10.2196/60917.

Big data analytics and machine learning in hematology: Transformative insights, applications and challenges.血液学中的大数据分析与机器学习：变革性见解、应用及挑战

Medicine (Baltimore). 2025 Mar 7;104(10):e41766. doi: 10.1097/MD.0000000000041766.

ETL: From the German Health Data Lab data formats to the OMOP Common Data Model.ETL：从德国健康数据实验室数据格式到OMOP通用数据模型。

PLoS One. 2025 Jan 6;20(1):e0311511. doi: 10.1371/journal.pone.0311511. eCollection 2025.

Hospital Length of Stay Prediction for Planned Admissions Using Observational Medical Outcomes Partnership Common Data Model: Retrospective Study.利用观察医疗结局伙伴关系通用数据模型预测计划性入院的住院时间：回顾性研究。

J Med Internet Res. 2024 Nov 22;26:e59260. doi: 10.2196/59260.

Musculoskeletal Disorder (MSD) Health Data Collection, Personalized Management and Exchange Using Fast Healthcare Interoperability Resources (FHIR).采用快速医疗互操作性资源（Fast Healthcare Interoperability Resources，FHIR）进行肌肉骨骼疾病（Musculoskeletal Disorder，MSD）健康数据采集、个性化管理和交换。

Sensors (Basel). 2024 Aug 10;24(16):5175. doi: 10.3390/s24165175.

Development and validation of a patient-level model to predict dementia across a network of observational databases.开发和验证一种基于患者水平的模型，以在一个观察性数据库网络中预测痴呆症。

BMC Med. 2024 Jul 29;22(1):308. doi: 10.1186/s12916-024-03530-9.

Comparing penalization methods for linear models on large observational health data.比较大型观测性健康数据中线性模型的惩罚方法。

J Am Med Inform Assoc. 2024 Jun 20;31(7):1514-1521. doi: 10.1093/jamia/ocae109.

Towards global model generalizability: independent cross-site feature evaluation for patient-level risk prediction models using the OHDSI network.迈向全球模型通用性：使用 OHDSI 网络进行患者水平风险预测模型的独立跨站点特征评估。

J Am Med Inform Assoc. 2024 Apr 19;31(5):1051-1061. doi: 10.1093/jamia/ocae028.

Development and Validation of a Prognostic Classification Model Predicting Postoperative Adverse Outcomes in Older Surgical Patients Using a Machine Learning Algorithm: Retrospective Observational Network Study.基于机器学习算法的老年外科患者术后不良结局预测预后分类模型的建立与验证：回顾性观察性网络研究。

J Med Internet Res. 2023 Nov 13;25:e42259. doi: 10.2196/42259.

本文引用的文献

Opioid use, postoperative complications, and implant survival after unicompartmental versus total knee replacement: a population-based network study.单髁膝关节置换与全膝关节置换术后阿片类药物使用、术后并发症及植入物生存率：一项基于人群的网络研究

Lancet Rheumatol. 2019 Dec;1(4):e229-e236. doi: 10.1016/S2665-9913(19)30075-X. Epub 2019 Nov 7.

Characterising the background incidence rates of adverse events of special interest for covid-19 vaccines in eight countries: multinational network cohort study.描述 8 个国家/地区新冠病毒疫苗特殊关注不良事件的背景发生率：跨国网络队列研究。

BMJ. 2021 Jun 14;373:n1435. doi: 10.1136/bmj.n1435.

Seek COVER: using a disease proxy to rapidly develop and validate a personalized risk calculator for COVID-19 outcomes in an international network.寻找替代指标：利用疾病替代指标在国际网络中快速开发和验证针对 COVID-19 结局的个体化风险计算器。

BMC Med Res Methodol. 2022 Jan 30;22(1):35. doi: 10.1186/s12874-022-01505-z.

Characteristics and outcomes of 627 044 COVID-19 patients living with and without obesity in the United States, Spain, and the United Kingdom.美国、西班牙和英国 627044 例合并和不合并肥胖的 COVID-19 患者的特征和结局。

Int J Obes (Lond). 2021 Nov;45(11):2347-2357. doi: 10.1038/s41366-021-00893-4. Epub 2021 Jul 15.

Thirty-Day Outcomes of Children and Adolescents With COVID-19: An International Experience.COVID-19 患儿和青少年 30 天结局：国际经验。

Pediatrics. 2021 Sep;148(3). doi: 10.1542/peds.2020-042929. Epub 2021 May 28.

Comparative Effectiveness of Famotidine in Hospitalized COVID-19 Patients.法莫替丁治疗住院 COVID-19 患者的疗效比较。

Am J Gastroenterol. 2021 Apr;116(4):692-699. doi: 10.14309/ajg.0000000000001153.

Use of repurposed and adjuvant drugs in hospital patients with covid-19: multinational network cohort study.新冠病毒肺炎住院患者中使用重新利用的药物和辅助药物：多国网络队列研究

BMJ. 2021 May 11;373:n1038. doi: 10.1136/bmj.n1038.

COVID-19 in patients with autoimmune diseases: characteristics and outcomes in a multinational network of cohorts across three countries.COVID-19 患者中的自身免疫性疾病：三个国家多国队列网络中的特征和结局。

Rheumatology (Oxford). 2021 Oct 9;60(SI):SI37-SI50. doi: 10.1093/rheumatology/keab250.

JMIR Med Inform. 2021 Apr 5;9(4):e21547. doi: 10.2196/21547.

Risk of depression, suicide and psychosis with hydroxychloroquine treatment for rheumatoid arthritis: a multinational network cohort study.类风湿关节炎羟氯喹治疗的抑郁、自杀和精神病风险：一项多国网络队列研究。

Rheumatology (Oxford). 2021 Jul 1;60(7):3222-3234. doi: 10.1093/rheumatology/keaa771.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用观察性健康数据进行可靠且快速的预测模型开发和验证的标准化分析管道。

A standardized analytics pipeline for reliable and rapid development and validation of prediction models using observational health data.

机构信息

出版信息

BACKGROUND AND OBJECTIVE

METHODS

RESULTS

CONCLUSION

背景和目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献