Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, United States.
Department of Psychiatry, Harvard Medical School, Boston, MA, United States; Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, United States; Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, United States.
J Biomed Inform. 2023 Jan;137:104243. doi: 10.1016/j.jbi.2022.104243. Epub 2022 Nov 18.
We propose a communication-efficient transfer learning approach (COMMUTE) that effectively incorporates multi-site healthcare data for training a risk prediction model in a target population of interest, accounting for challenges including population heterogeneity and data sharing constraints across sites.
We first train population-specific source models locally within each site. Using data from a given target population, COMMUTE learns a calibration term for each source model, which adjusts for potential data heterogeneity through flexible distance-based regularizations. In a centralized setting where multi-site data can be directly pooled, all data are combined to train the target model after calibration. When individual-level data are not shareable in some sites, COMMUTE requests only the locally trained models from these sites, with which, COMMUTE generates heterogeneity-adjusted synthetic data for training the target model. We evaluate COMMUTE via extensive simulation studies and an application to multi-site data from the electronic Medical Records and Genomics (eMERGE) Network to predict extreme obesity.
Simulation studies show that COMMUTE outperforms methods without adjusting for population heterogeneity and methods trained in a single population over a broad spectrum of settings. Using eMERGE data, COMMUTE achieves an area under the receiver operating characteristic curve (AUC) around 0.80, which outperforms other benchmark methods with AUC ranging from 0.51 to 0.70.
COMMUTE improves the risk prediction in a target population with limited samples and safeguards against negative transfer when some source populations are highly different from the target. In a federated setting, it is highly communication efficient as it only requires each site to share model parameter estimates once, and no iterative communication or higher-order terms are needed.
我们提出了一种高效的通信转移学习方法(COMMUTE),可以有效地整合多站点医疗保健数据,以在目标人群中训练风险预测模型,同时考虑到包括人群异质性和站点间数据共享限制在内的挑战。
我们首先在每个站点内进行特定人群的本地训练。使用来自给定目标人群的数据,COMMUTE 为每个源模型学习校准项,通过灵活的基于距离的正则化来调整潜在的数据异质性。在可以直接汇总多站点数据的集中设置中,在进行校准后,所有数据都被组合在一起训练目标模型。当某些站点的个体水平数据不可共享时,COMMUTE 仅从这些站点请求本地训练的模型,并使用这些模型生成调整后的异质合成数据来训练目标模型。我们通过广泛的模拟研究和对电子病历和基因组学(eMERGE)网络的多站点数据的应用来评估 COMMUTE,以预测极端肥胖。
模拟研究表明,COMMUTE 在广泛的设置范围内优于不调整人群异质性的方法和在单一人群中训练的方法。使用 eMERGE 数据,COMMUTE 的接收器操作特征曲线下面积(AUC)约为 0.80,优于 AUC 范围在 0.51 到 0.70 之间的其他基准方法。
COMMUTE 可以在样本有限的情况下提高目标人群的风险预测能力,并防止当某些源人群与目标人群高度不同时出现负迁移。在联邦设置中,它的通信效率非常高,因为它只需要每个站点共享一次模型参数估计,而不需要迭代通信或更高阶项。