Halilovic Mehmed, Meurers Thierry, Otte Karen, Prasser Fabian
Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Medical Informatics Group, Charitéplatz 1, 10117, Berlin, Germany.
BMC Med Inform Decis Mak. 2025 Mar 12;25(1):129. doi: 10.1186/s12911-025-02959-z.
Sharing health data holds great potential for advancing medical research but also poses many challenges, including the need to protect people's privacy. One approach to address this is data anonymization, which refers to the process of altering or transforming a dataset to preserve the privacy of the individuals contributing data. To this, privacy models have been designed to measure risks and optimization algorithms can be used to transform data to achieve a good balance between risks reduction and the preservation of the dataset's utility. However, this process is computationally complex and challenging to apply to large datasets. Previously suggested parallel algorithms have been tailored to specific risk models, utility models and transformation methods.
We present a novel parallel algorithm that supports a wide range of methods for measuring risks, optimizing utility and transforming data. The algorithm trades data utility for parallelization, by anonymizing partitions of the dataset in parallel. To ensure the correctness of the anonymization process, the algorithm carefully controls the process and if needed rearranges partitions and performs additional transformations.
We demonstrate the effectiveness of our method through an open-source implementation. Our experiments show that our approach can reduce execution times by up to one order of magnitude with minor impacts on output data utility in a wide range of scenarios.
Our novel P4 algorithm for parallel and distributed data anonymization is, to the best of our knowledge, the first to systematically support a wide variety of privacy, transformation and utility models.
共享健康数据在推动医学研究方面具有巨大潜力,但也带来了诸多挑战,其中包括保护人们隐私的必要性。解决这一问题的一种方法是数据匿名化,它指的是对数据集进行更改或转换的过程,以保护贡献数据的个人隐私。为此,已经设计了隐私模型来衡量风险,并且可以使用优化算法来转换数据,以在降低风险和保持数据集效用之间实现良好平衡。然而,这个过程计算复杂,难以应用于大型数据集。先前提出的并行算法是针对特定的风险模型、效用模型和转换方法量身定制的。
我们提出了一种新颖的并行算法,该算法支持多种用于衡量风险、优化效用和转换数据的方法。该算法通过并行匿名化数据集的分区,以牺牲数据效用为代价来实现并行化。为确保匿名化过程的正确性,该算法仔细控制过程,并在需要时重新排列分区并执行额外的转换。
我们通过开源实现展示了我们方法的有效性。我们的实验表明,在广泛的场景中,我们的方法可以将执行时间减少多达一个数量级,同时对输出数据效用的影响较小。
据我们所知,我们用于并行和分布式数据匿名化的新颖P4算法是第一个系统支持多种隐私、转换和效用模型的算法。