Eicher Johanna, Kuhn Klaus A, Prasser Fabian
Institute of Medical Statistics and Epidemiology, University Hospital rechts der Isar, Technical University of Munich, Germany.
Stud Health Technol Inform. 2017;245:704-708.
When individual-level health data are shared in biomedical research, the privacy of patients must be protected. This is typically achieved by data de-identification methods, which transform data in such a way that formal privacy requirements are met. In the process, it is important to minimize the loss of information to maintain data quality. Although several models have been proposed for measuring this aspect, it remains unclear which model is best suited for which application. We have therefore performed an extensive experimental comparison. We first implemented several common quality models into the ARX de-identification tool for biomedical data. We then used each model to de-identify a patient discharge dataset covering almost 4 million cases and outputs were analyzed to measure the impact of different quality models on real-world applications. Our results show that different models are best suited for specific applications, but that one model (Non-Uniform Entropy) is particularly well suited for general-purpose use.
当个体层面的健康数据在生物医学研究中共享时,患者的隐私必须得到保护。这通常通过数据去识别方法来实现,这些方法以满足正式隐私要求的方式对数据进行转换。在此过程中,尽量减少信息损失以保持数据质量非常重要。尽管已经提出了几种模型来衡量这一方面,但仍不清楚哪种模型最适合哪种应用。因此,我们进行了广泛的实验比较。我们首先在用于生物医学数据的ARX去识别工具中实现了几种常见的质量模型。然后,我们使用每个模型对包含近400万个病例的患者出院数据集进行去识别,并对输出进行分析,以衡量不同质量模型对实际应用的影响。我们的结果表明,不同的模型最适合特定的应用,但有一种模型(非均匀熵)特别适合通用用途。