凸包分析在评估不同来源患者群体间数据异质性及医院偏差在基于机器学习的下游数据处理中的影响中的应用：4个重症监护患者数据集的比较

Application of convex hull analysis for the evaluation of data heterogeneity between patient populations of different origin and implications of hospital bias in downstream machine-learning-based data processing: A comparison of 4 critical-care patient datasets.

作者信息

Sharafutdinov Konstantin, Bhat Jayesh S, Fritsch Sebastian Johannes, Nikulina Kateryna, E Samadi Moein, Polzin Richard, Mayer Hannah, Marx Gernot, Bickenbach Johannes, Schuppert Andreas

机构信息

Institute for Computational Biomedicine, RWTH Aachen University, Aachen, Germany.

Joint Research Center for Computational Biomedicine, RWTH Aachen University, Aachen, Germany.

出版信息

Front Big Data. 2022 Oct 31;5:603429. doi: 10.3389/fdata.2022.603429. eCollection 2022.

DOI:10.3389/fdata.2022.603429

PMID:36387013

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9659720/

Abstract

Machine learning (ML) models are developed on a learning dataset covering only a small part of the data of interest. If model predictions are accurate for the learning dataset but fail for unseen data then generalization error is considered high. This problem manifests itself within all major sub-fields of ML but is especially relevant in medical applications. Clinical data structures, patient cohorts, and clinical protocols may be highly biased among hospitals such that sampling of representative learning datasets to learn ML models remains a challenge. As ML models exhibit poor predictive performance over data ranges sparsely or not covered by the learning dataset, in this study, we propose a novel method to assess their generalization capability among different hospitals based on the convex hull (CH) overlap between multivariate datasets. To reduce dimensionality effects, we used a two-step approach. First, CH analysis was applied to find mean CH coverage between each of the two datasets, resulting in an upper bound of the prediction range. Second, 4 types of ML models were trained to classify the origin of a dataset (i.e., from which hospital) and to estimate differences in datasets with respect to underlying distributions. To demonstrate the applicability of our method, we used 4 critical-care patient datasets from different hospitals in Germany and USA. We estimated the similarity of these populations and investigated whether ML models developed on one dataset can be reliably applied to another one. We show that the strongest drop in performance was associated with the poor intersection of convex hulls in the corresponding hospitals' datasets and with a high performance of ML methods for dataset discrimination. Hence, we suggest the application of our pipeline as a first tool to assess the transferability of trained models. We emphasize that datasets from different hospitals represent heterogeneous data sources, and the transfer from one database to another should be performed with utmost care to avoid implications during real-world applications of the developed models. Further research is needed to develop methods for the adaptation of ML models to new hospitals. In addition, more work should be aimed at the creation of gold-standard datasets that are large and diverse with data from varied application sites.

摘要

机器学习（ML）模型是在一个仅涵盖感兴趣数据一小部分的学习数据集上开发的。如果模型对学习数据集的预测准确，但对未见过的数据失败，那么泛化误差就被认为很高。这个问题在ML的所有主要子领域中都有体现，但在医学应用中尤为相关。临床数据结构、患者队列和临床方案在不同医院之间可能存在高度偏差，以至于为学习ML模型而对具有代表性的学习数据集进行采样仍然是一个挑战。由于ML模型在学习数据集稀疏覆盖或未覆盖的数据范围内表现出较差的预测性能，在本研究中，我们提出了一种基于多变量数据集之间的凸包（CH）重叠来评估其在不同医院之间泛化能力的新方法。为了减少维度效应，我们采用了两步法。首先，应用CH分析来找到两个数据集中每个数据集之间的平均CH覆盖率，从而得出预测范围的上限。其次，训练4种类型的ML模型来对数据集的来源（即来自哪家医院）进行分类，并估计数据集在潜在分布方面的差异。为了证明我们方法的适用性，我们使用了来自德国和美国不同医院的4个重症监护患者数据集。我们估计了这些人群的相似性，并研究了在一个数据集上开发的ML模型是否可以可靠地应用于另一个数据集。我们表明，性能下降最明显与相应医院数据集中凸包的交集较差以及ML方法在数据集区分方面的高性能有关。因此，我们建议将我们的流程作为评估训练模型可转移性的首要工具。我们强调，来自不同医院的数据集代表了异质数据源，从一个数据库转移到另一个数据库时应极其谨慎，以避免在开发模型的实际应用中产生影响。需要进一步研究来开发使ML模型适应新医院的方法。此外，更多的工作应该致力于创建具有来自不同应用站点的大量且多样数据的金标准数据集。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32ae/9659720/3275398ee467/fdata-05-603429-g0001.jpg

相似文献

Application of convex hull analysis for the evaluation of data heterogeneity between patient populations of different origin and implications of hospital bias in downstream machine-learning-based data processing: A comparison of 4 critical-care patient datasets.

Front Big Data. 2022 Oct 31;5:603429. doi: 10.3389/fdata.2022.603429. eCollection 2022.

Ensemble machine learning model trained on a new synthesized dataset generalizes well for stress prediction using wearable devices.

J Biomed Inform. 2023 Dec;148:104556. doi: 10.1016/j.jbi.2023.104556. Epub 2023 Dec 2.

Computational Simulation of Virtual Patients Reduces Dataset Bias and Improves Machine Learning-Based Detection of ARDS from Noisy Heterogeneous ICU Datasets.

IEEE Open J Eng Med Biol. 2023 Feb 8;5:611-620. doi: 10.1109/OJEMB.2023.3243190. eCollection 2024.

Evaluating and Enhancing the Generalization Performance of Machine Learning Models for Physical Activity Intensity Prediction From Raw Acceleration Data.

IEEE J Biomed Health Inform. 2020 Jan;24(1):27-38. doi: 10.1109/JBHI.2019.2917565. Epub 2019 May 20.

A real use case of semi-supervised learning for mammogram classification in a local clinic of Costa Rica.

Med Biol Eng Comput. 2022 Apr;60(4):1159-1175. doi: 10.1007/s11517-021-02497-6. Epub 2022 Mar 3.

Assessing optimal methods for transferring machine learning models to low-volume and imbalanced clinical datasets: experiences from predicting outcomes of Danish trauma patients.

Front Digit Health. 2023 Nov 2;5:1249258. doi: 10.3389/fdgth.2023.1249258. eCollection 2023.

An ensemble machine learning model based on multiple filtering and supervised attribute clustering algorithm for classifying cancer samples.

PeerJ Comput Sci. 2021 Sep 16;7:e671. doi: 10.7717/peerj-cs.671. eCollection 2021.

Multiscale unsupervised domain adaptation for automatic pancreas segmentation in CT volumes using adversarial learning.

Med Phys. 2022 Sep;49(9):5799-5818. doi: 10.1002/mp.15827. Epub 2022 Jul 27.

Synthesizing CT images from MR images with deep learning: model generalization for different datasets through transfer learning.

Biomed Phys Eng Express. 2021 Feb 24;7(2). doi: 10.1088/2057-1976/abe3a7.

Convex hulls in hamming space enable efficient search for similarity and clustering of genomic sequences.

BMC Bioinformatics. 2020 Dec 30;21(Suppl 18):482. doi: 10.1186/s12859-020-03811-z.

引用本文的文献

Computational Simulation of Virtual Patients Reduces Dataset Bias and Improves Machine Learning-Based Detection of ARDS from Noisy Heterogeneous ICU Datasets.

IEEE Open J Eng Med Biol. 2023 Feb 8;5:611-620. doi: 10.1109/OJEMB.2023.3243190. eCollection 2024.

A hybrid modeling framework for generalizable and interpretable predictions of ICU mortality across multiple hospitals.

Sci Rep. 2024 Mar 8;14(1):5725. doi: 10.1038/s41598-024-55577-6.

Developing an Artificial Intelligence-Based Representation of a Virtual Patient Model for Real-Time Diagnosis of Acute Respiratory Distress Syndrome.

Diagnostics (Basel). 2023 Jun 17;13(12):2098. doi: 10.3390/diagnostics13122098.

Analysis of Chest X-ray for COVID-19 Diagnosis as a Use Case for an HPC-Enabled Data Analysis and Machine Learning Platform for Medical Diagnosis Support.

Diagnostics (Basel). 2023 Jan 20;13(3):391. doi: 10.3390/diagnostics13030391.

本文引用的文献

Algorithmic surveillance of ICU patients with acute respiratory distress syndrome (ASIC): protocol for a multicentre stepped-wedge cluster randomised quality improvement strategy.

BMJ Open. 2021 Apr 8;11(4):e045589. doi: 10.1136/bmjopen-2020-045589.

An improved method for quality control of data from Argo floats using α convex hulls.

MethodsX. 2021 Apr 5;8:101337. doi: 10.1016/j.mex.2021.101337. eCollection 2021.

The reliability of a deep learning model in clinical out-of-distribution MRI data: A multicohort study.

Med Image Anal. 2020 Dec;66:101714. doi: 10.1016/j.media.2020.101714. Epub 2020 May 1.

A Review of Challenges and Opportunities in Machine Learning for Health.

AMIA Jt Summits Transl Sci Proc. 2020 May 30;2020:191-200. eCollection 2020.

Methodological challenges in translational drug response modeling in cancer: A systematic analysis with FORESEE.

PLoS Comput Biol. 2020 Apr 20;16(4):e1007803. doi: 10.1371/journal.pcbi.1007803. eCollection 2020 Apr.

SciPy 1.0: fundamental algorithms for scientific computing in Python.

Nat Methods. 2020 Mar;17(3):261-272. doi: 10.1038/s41592-019-0686-2. Epub 2020 Feb 3.

Deep learning algorithm predicts diabetic retinopathy progression in individual patients.

NPJ Digit Med. 2019 Sep 20;2:92. doi: 10.1038/s41746-019-0172-3. eCollection 2019.

Use of machine learning to analyse routinely collected intensive care unit data: a systematic review.

Crit Care. 2019 Aug 22;23(1):284. doi: 10.1186/s13054-019-2564-9.

A clinically applicable approach to continuous prediction of future acute kidney injury.

Nature. 2019 Aug;572(7767):116-119. doi: 10.1038/s41586-019-1390-1. Epub 2019 Jul 31.

Deep learning and alternative learning strategies for retrospective real-world clinical data.

NPJ Digit Med. 2019 May 30;2:43. doi: 10.1038/s41746-019-0122-0. eCollection 2019.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

凸包分析在评估不同来源患者群体间数据异质性及医院偏差在基于机器学习的下游数据处理中的影响中的应用：4个重症监护患者数据集的比较

Application of convex hull analysis for the evaluation of data heterogeneity between patient populations of different origin and implications of hospital bias in downstream machine-learning-based data processing: A comparison of 4 critical-care patient datasets.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献