Kragh Jørgensen Rasmus Rask, Jensen Jonas Faartoft, El-Galaly Tarec, Bøgsted Martin, Brøndum Rasmus Froberg, Simonsen Mikkel Runason, Jakobsen Lasse Hjort
Department of Hematology, Clinical Cancer Research Center, Aalborg University Hospital, Aalborg, Denmark.
Center for Clinical Data Science, Aalborg University and Aalborg University Hospital, Aalborg, Denmark.
BMC Med Res Methodol. 2025 May 26;25(1):143. doi: 10.1186/s12874-025-02598-y.
In a wide range of diseases, it is necessary to utilize multiple data sources to obtain enough data for model training. However, performing centralized pooling of multiple data sources, while protecting each patients' sensitive data, can require a cumbersome process involving many institutional bodies. Alternatively, federated learning (FL) can be utilized to train models based on data located at multiple sites.
We propose two methods for training time-to-event prediction models based on distributed data, relying on FL algorithms, for time-to-event prediction models. Both approach incorporates steps to allow prediction of individual-level survival curves, without exposing individual-level event times. For Cox proportional hazards models, the latter is accomplished by using a kernel smoother for the baseline hazard function. The other proposed methodology is based on general parametric likelihood theory for right-censored data. We compared these two methods in four simulation and with one real-world dataset predicting the survival probability in patients with Hodgkin lymphoma (HL).
The simulations demonstrated that the FL models performed similarly to the non-distributed case in all four experiments, with only slight deviations in predicted survival probabilities compared to the true model. Our findings were similar in the real-world advanced-stage HL example where the FL models were compared to their non-distributed versions, revealing only small deviations in performance.
The proposed procedures enable training of time-to-event models using data distributed across sites, without direct sharing of individual-level data and event times, while retaining a predictive performance on par with undistributed approaches.
在多种疾病中,有必要利用多个数据源来获取足够的数据进行模型训练。然而,在保护每个患者敏感数据的同时,对多个数据源进行集中汇总可能需要一个涉及许多机构的繁琐过程。另外,可以利用联邦学习(FL)基于位于多个站点的数据来训练模型。
我们提出了两种基于分布式数据训练事件发生时间预测模型的方法,依靠FL算法来构建事件发生时间预测模型。两种方法都包含了允许预测个体水平生存曲线的步骤,同时不暴露个体水平的事件发生时间。对于Cox比例风险模型,后者通过对基线风险函数使用核平滑器来实现。另一种提出的方法是基于右删失数据的一般参数似然理论。我们在四个模拟实验以及一个预测霍奇金淋巴瘤(HL)患者生存概率的真实世界数据集上比较了这两种方法。
模拟实验表明,在所有四个实验中,FL模型的表现与非分布式情况相似,与真实模型相比,预测生存概率仅有轻微偏差。在真实世界的晚期HL实例中,我们将FL模型与其非分布式版本进行比较,发现结果相似,性能上仅有微小偏差。
所提出的方法能够使用跨站点分布的数据训练事件发生时间模型,无需直接共享个体水平数据和事件发生时间,同时保持与非分布式方法相当的预测性能。