Sharafutdinov Konstantin, Bhat Jayesh S, Fritsch Sebastian Johannes, Nikulina Kateryna, E Samadi Moein, Polzin Richard, Mayer Hannah, Marx Gernot, Bickenbach Johannes, Schuppert Andreas
Institute for Computational Biomedicine, RWTH Aachen University, Aachen, Germany.
Joint Research Center for Computational Biomedicine, RWTH Aachen University, Aachen, Germany.
Front Big Data. 2022 Oct 31;5:603429. doi: 10.3389/fdata.2022.603429. eCollection 2022.
Machine learning (ML) models are developed on a learning dataset covering only a small part of the data of interest. If model predictions are accurate for the learning dataset but fail for unseen data then generalization error is considered high. This problem manifests itself within all major sub-fields of ML but is especially relevant in medical applications. Clinical data structures, patient cohorts, and clinical protocols may be highly biased among hospitals such that sampling of representative learning datasets to learn ML models remains a challenge. As ML models exhibit poor predictive performance over data ranges sparsely or not covered by the learning dataset, in this study, we propose a novel method to assess their generalization capability among different hospitals based on the convex hull (CH) overlap between multivariate datasets. To reduce dimensionality effects, we used a two-step approach. First, CH analysis was applied to find mean CH coverage between each of the two datasets, resulting in an upper bound of the prediction range. Second, 4 types of ML models were trained to classify the origin of a dataset (i.e., from which hospital) and to estimate differences in datasets with respect to underlying distributions. To demonstrate the applicability of our method, we used 4 critical-care patient datasets from different hospitals in Germany and USA. We estimated the similarity of these populations and investigated whether ML models developed on one dataset can be reliably applied to another one. We show that the strongest drop in performance was associated with the poor intersection of convex hulls in the corresponding hospitals' datasets and with a high performance of ML methods for dataset discrimination. Hence, we suggest the application of our pipeline as a first tool to assess the transferability of trained models. We emphasize that datasets from different hospitals represent heterogeneous data sources, and the transfer from one database to another should be performed with utmost care to avoid implications during real-world applications of the developed models. Further research is needed to develop methods for the adaptation of ML models to new hospitals. In addition, more work should be aimed at the creation of gold-standard datasets that are large and diverse with data from varied application sites.
机器学习(ML)模型是在一个仅涵盖感兴趣数据一小部分的学习数据集上开发的。如果模型对学习数据集的预测准确,但对未见过的数据失败,那么泛化误差就被认为很高。这个问题在ML的所有主要子领域中都有体现,但在医学应用中尤为相关。临床数据结构、患者队列和临床方案在不同医院之间可能存在高度偏差,以至于为学习ML模型而对具有代表性的学习数据集进行采样仍然是一个挑战。由于ML模型在学习数据集稀疏覆盖或未覆盖的数据范围内表现出较差的预测性能,在本研究中,我们提出了一种基于多变量数据集之间的凸包(CH)重叠来评估其在不同医院之间泛化能力的新方法。为了减少维度效应,我们采用了两步法。首先,应用CH分析来找到两个数据集中每个数据集之间的平均CH覆盖率,从而得出预测范围的上限。其次,训练4种类型的ML模型来对数据集的来源(即来自哪家医院)进行分类,并估计数据集在潜在分布方面的差异。为了证明我们方法的适用性,我们使用了来自德国和美国不同医院的4个重症监护患者数据集。我们估计了这些人群的相似性,并研究了在一个数据集上开发的ML模型是否可以可靠地应用于另一个数据集。我们表明,性能下降最明显与相应医院数据集中凸包的交集较差以及ML方法在数据集区分方面的高性能有关。因此,我们建议将我们的流程作为评估训练模型可转移性的首要工具。我们强调,来自不同医院的数据集代表了异质数据源,从一个数据库转移到另一个数据库时应极其谨慎,以避免在开发模型的实际应用中产生影响。需要进一步研究来开发使ML模型适应新医院的方法。此外,更多的工作应该致力于创建具有来自不同应用站点的大量且多样数据的金标准数据集。