Aalto University, Espoo, 00076, Finland.
University of Helsinki, Helsinki, 00014, Finland.
BMC Med Inform Decis Mak. 2024 Jun 14;24(1):167. doi: 10.1186/s12911-024-02563-7.
Consider a setting where multiple parties holding sensitive data aim to collaboratively learn population level statistics, but pooling the sensitive data sets is not possible due to privacy concerns and parties are unable to engage in centrally coordinated joint computation. We study the feasibility of combining privacy preserving synthetic data sets in place of the original data for collaborative learning on real-world health data from the UK Biobank.
We perform an empirical evaluation based on an existing prospective cohort study from the literature. Multiple parties were simulated by splitting the UK Biobank cohort along assessment centers, for which we generate synthetic data using differentially private generative modelling techniques. We then apply the original study's Poisson regression analysis on the combined synthetic data sets and evaluate the effects of 1) the size of local data set, 2) the number of participating parties, and 3) local shifts in distributions, on the obtained likelihood scores.
We discover that parties engaging in the collaborative learning via shared synthetic data obtain more accurate estimates of the regression parameters compared to using only their local data. This finding extends to the difficult case of small heterogeneous data sets. Furthermore, the more parties participate, the larger and more consistent the improvements become up to a certain limit. Finally, we find that data sharing can especially help parties whose data contain underrepresented groups to perform better-adjusted analysis for said groups.
Based on our results we conclude that sharing of synthetic data is a viable method for enabling learning from sensitive data without violating privacy constraints even if individual data sets are small or do not represent the overall population well. Lack of access to distributed sensitive data is often a bottleneck in biomedical research, which our study shows can be alleviated with privacy-preserving collaborative learning methods.
考虑这样一种场景,多个持有敏感数据的方旨在合作学习人群级别的统计数据,但由于隐私问题,他们无法合并敏感数据集,并且各方也无法进行集中协调的联合计算。我们研究了在英国生物库的真实健康数据上进行协作学习时,用隐私保护的合成数据集代替原始数据的可行性。
我们基于文献中的一项现有前瞻性队列研究进行实证评估。通过沿着评估中心拆分英国生物库队列来模拟多方,我们使用差分隐私生成建模技术生成合成数据。然后,我们将原始研究的泊松回归分析应用于联合合成数据集,并评估以下因素的影响:1)本地数据集的大小;2)参与方的数量;3)局部分布的偏移,对获得的似然评分的影响。
我们发现,与仅使用本地数据相比,通过共享合成数据进行协作学习的各方可以获得更准确的回归参数估计。这一发现适用于小且异构数据集的困难情况。此外,参与方越多,改进就越大,一致性也越高,直到达到一定的极限。最后,我们发现数据共享特别有助于那些数据中包含代表性不足的群体的方,使他们能够为这些群体进行更好的调整分析。
根据我们的结果,我们得出结论,即使单个数据集较小或不能很好地代表总体人群,共享合成数据也是一种可行的方法,可以在不违反隐私约束的情况下从敏感数据中进行学习。分布式敏感数据的缺乏往往是生物医学研究的一个瓶颈,我们的研究表明,隐私保护的协作学习方法可以缓解这一问题。