Jégou Romain, Bachot Camille, Monteil Charles, Boernert Eric, Chmiel Jacek, Boucher Mathieu, Pau David
Keyrus Life Science, Nantes, France.
Roche Medical Data Center, Boulogne-Billancourt, France.
PLoS One. 2024 Nov 14;19(11):e0312697. doi: 10.1371/journal.pone.0312697. eCollection 2024.
The objective of this project was to determine the capability of a federated analysis approach using DataSHIELD to maintain the level of results of a classical centralized analysis in a real-world setting. This research was carried out on an anonymous synthetic longitudinal real-world oncology cohort randomly splitted in three local databases, mimicking three healthcare organizations, stored in a federated data platform integrating DataSHIELD. No individual data transfer, statistics were calculated simultaneously but in parallel within each healthcare organization and only summary statistics (aggregates) were provided back to the federated data analyst. Descriptive statistics, survival analysis, regression models and correlation were first performed on the centralized approach and then reproduced on the federated approach. The results were then compared between the two approaches.
The cohort was splitted in three samples (N1 = 157 patients, N2 = 94 and N3 = 64), 11 derived variables and four types of analyses were generated. All analyses were successfully reproduced using DataSHIELD, except for one descriptive variable due to data disclosure limitation in the federated environment, showing the good capability of DataSHIELD. For descriptive statistics, exactly equivalent results were found for the federated and centralized approaches, except some differences for position measures. Estimates of univariate regression models were similar, with a loss of accuracy observed for multivariate models due to source database variability.
Our project showed a practical implementation and use case of a real-world federated approach using DataSHIELD. The capability and accuracy of common data manipulation and analysis were satisfying, and the flexibility of the tool enabled the production of a variety of analyses while preserving the privacy of individual data. The DataSHIELD forum was also a practical source of information and support. In order to find the right balance between privacy and accuracy of the analysis, set-up of privacy requirements should be established prior to the start of the analysis, as well as a data quality review of the participating healthcare organization.
本项目旨在确定使用 DataSHIELD 的联邦分析方法在真实环境中保持经典集中分析结果水平的能力。这项研究是在一个匿名的合成纵向真实肿瘤队列上进行的,该队列随机分为三个本地数据库,模拟三个医疗保健组织,存储在一个整合了 DataSHIELD 的联邦数据平台中。没有进行任何个人数据传输,统计数据是在每个医疗保健组织内同时但并行计算的,并且只向联邦数据分析师提供汇总统计信息(聚合)。首先在集中式方法上进行描述性统计、生存分析、回归模型和相关性分析,然后在联邦式方法上进行复制。然后比较两种方法的结果。
该队列被分为三个样本(N1=157 例,N2=94 例,N3=64 例),11 个衍生变量和生成了四种类型的分析。除了由于联邦环境中的数据披露限制,有一个描述性变量无法使用 DataSHIELD 进行复制外,所有分析都成功地在 DataSHIELD 上进行了复制,显示了 DataSHIELD 的良好能力。对于描述性统计,在联邦和集中式方法中都找到了完全相同的结果,除了一些位置度量的差异。单变量回归模型的估计值相似,由于源数据库的可变性,多变量模型的准确性降低。
我们的项目展示了使用 DataSHIELD 的真实联邦方法的实际实施和用例。常见数据操作和分析的能力和准确性令人满意,并且工具的灵活性使各种分析能够在保护个人数据隐私的同时进行。DataSHIELD 论坛也是一个实用的信息和支持来源。为了在隐私和分析准确性之间找到正确的平衡,应该在分析开始之前建立隐私要求,并对参与的医疗保健组织进行数据质量审查。