Suppr超能文献

用于多数据库预测建模的基于权重的框架:无数据共享的非迭代通信方法——多机构研究的隐私保护分析方法

Weight-Based Framework for Predictive Modeling of Multiple Databases With Noniterative Communication Without Data Sharing: Privacy-Protecting Analytic Method for Multi-Institutional Studies.

作者信息

Park Ji Ae, Sung Min Dong, Kim Ho Heon, Park Yu Rang

机构信息

Department of Biomedical System Informatics, Yonsei University College of Medicine, Seoul, Republic of Korea.

出版信息

JMIR Med Inform. 2021 Apr 5;9(4):e21043. doi: 10.2196/21043.

Abstract

BACKGROUND

Securing the representativeness of study populations is crucial in biomedical research to ensure high generalizability. In this regard, using multi-institutional data have advantages in medicine. However, combining data physically is difficult as the confidential nature of biomedical data causes privacy issues. Therefore, a methodological approach is necessary when using multi-institution medical data for research to develop a model without sharing data between institutions.

OBJECTIVE

This study aims to develop a weight-based integrated predictive model of multi-institutional data, which does not require iterative communication between institutions, to improve average predictive performance by increasing the generalizability of the model under privacy-preserving conditions without sharing patient-level data.

METHODS

The weight-based integrated model generates a weight for each institutional model and builds an integrated model for multi-institutional data based on these weights. We performed 3 simulations to show the weight characteristics and to determine the number of repetitions of the weight required to obtain stable values. We also conducted an experiment using real multi-institutional data to verify the developed weight-based integrated model. We selected 10 hospitals (2845 intensive care unit [ICU] stays in total) from the electronic intensive care unit Collaborative Research Database to predict ICU mortality with 11 features. To evaluate the validity of our model, compared with a centralized model, which was developed by combining all the data of 10 hospitals, we used proportional overlap (ie, 0.5 or less indicates a significant difference at a level of .05; and 2 indicates 2 CIs overlapping completely). Standard and firth logistic regression models were applied for the 2 simulations and the experiment.

RESULTS

The results of these simulations indicate that the weight of each institution is determined by 2 factors (ie, the data size of each institution and how well each institutional model fits into the overall institutional data) and that repeatedly generating 200 weights is necessary per institution. In the experiment, the estimated area under the receiver operating characteristic curve (AUC) and 95% CIs were 81.36% (79.37%-83.36%) and 81.95% (80.03%-83.87%) in the centralized model and weight-based integrated model, respectively. The proportional overlap of the CIs for AUC in both the weight-based integrated model and the centralized model was approximately 1.70, and that of overlap of the 11 estimated odds ratios was over 1, except for 1 case.

CONCLUSIONS

In the experiment where real multi-institutional data were used, our model showed similar results to the centralized model without iterative communication between institutions. In addition, our weight-based integrated model provided a weighted average model by integrating 10 models overfitted or underfitted, compared with the centralized model. The proposed weight-based integrated model is expected to provide an efficient distributed research approach as it increases the generalizability of the model and does not require iterative communication.

摘要

背景

在生物医学研究中,确保研究人群的代表性对于保证高通用性至关重要。在这方面,使用多机构数据在医学领域具有优势。然而,由于生物医学数据的保密性会引发隐私问题,实际合并数据存在困难。因此,在使用多机构医学数据进行研究以开发模型且机构间不共享数据时,需要一种方法学途径。

目的

本研究旨在开发一种基于权重的多机构数据综合预测模型,该模型无需机构间的反复沟通,通过在不共享患者层面数据的隐私保护条件下提高模型的通用性来提升平均预测性能。

方法

基于权重的综合模型为每个机构模型生成一个权重,并基于这些权重为多机构数据构建一个综合模型。我们进行了3次模拟以展示权重特征,并确定获得稳定值所需的权重重复次数。我们还使用真实的多机构数据进行了一项实验,以验证所开发的基于权重的综合模型。我们从电子重症监护病房协作研究数据库中选取了10家医院(总共2845例重症监护病房[ICU]住院病例),利用11个特征预测ICU死亡率。为评估我们模型的有效性,与通过合并10家医院的所有数据开发的集中式模型相比,我们使用了比例重叠(即0.5或更低表示在0.05水平上有显著差异;2表示两个置信区间完全重叠)。在这2次模拟和实验中应用了标准和费舍尔逻辑回归模型。

结果

这些模拟结果表明,每个机构的权重由2个因素决定(即每个机构的数据大小以及每个机构模型与整体机构数据的拟合程度),并且每个机构需要反复生成200个权重。在实验中,集中式模型和基于权重的综合模型的受试者工作特征曲线下面积(AUC)估计值及95%置信区间分别为81.36%(79.37% - 83.36%)和81.95%(80.03% - 83.87%)。基于权重的综合模型和集中式模型的AUC置信区间的比例重叠约为1.70,11个估计比值比的重叠情况除1例之外均超过1。

结论

在使用真实多机构数据的实验中,我们的模型在无需机构间反复沟通的情况下显示出与集中式模型相似的结果。此外,与集中式模型相比,我们基于权重的综合模型通过整合10个过拟合或欠拟合的模型提供了一个加权平均模型。所提出的基于权重的综合模型有望提供一种高效的分布式研究方法,因为它提高了模型的通用性且无需反复沟通。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/253e/8056295/d3b43bf59340/medinform_v9i4e21043_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验