Johann Tim I, Otte Karen, Prasser Fabian, Dieterich Christoph
Klaus Tschira Institute for Integrative Computational Cardiology, University Hospital Heidelberg, Im Neuenheimer Feld 669, 69120 Heidelberg, Germany.
Berlin Institute of Health, Charité, Universitätsmedizin Berlin, Charitéplatz 1, 10115 Berlin, Germany.
Eur Heart J Digit Health. 2024 Nov 20;6(1):147-154. doi: 10.1093/ehjdh/ztae083. eCollection 2025 Jan.
Data availability remains a critical challenge in modern, data-driven medical research. Due to the sensitive nature of patient health records, they are rightfully subject to stringent privacy protection measures. One way to overcome these restrictions is to preserve patient privacy by using anonymization and synthetization strategies. In this work, we investigate the effectiveness of these methods for protecting patient privacy using real-world cardiology health records.
We implemented anonymization and synthetization techniques for a structure data set, which was collected during the HiGHmed Use Case Cardiology study. We employed the data anonymization tool ARX and the data synthetization framework ASyH individually and in combination. We evaluated the utility and shortcomings of the different approaches by statistical analyses and privacy risk assessments. Data utility was assessed by computing two heart failure risk scores on the protected data sets. We observed only minimal deviations to scores from the original data set. Additionally, we performed a re-identification risk analysis and found only minor residual risks for common types of privacy threats.
We could demonstrate that anonymization and synthetization methods protect privacy while retaining data utility for heart failure risk assessment. Both approaches and a combination thereof introduce only minimal deviations from the original data set over all features. While data synthesis techniques produce any number of new records, data anonymization techniques offer more formal privacy guarantees. Consequently, data synthesis on anonymized data further enhances privacy protection with little impacting data utility. We share all generated data sets with the scientific community through a use and access agreement.
在现代数据驱动的医学研究中,数据可用性仍然是一个关键挑战。由于患者健康记录的敏感性质,它们理所当然地受到严格的隐私保护措施。克服这些限制的一种方法是通过使用匿名化和合成策略来保护患者隐私。在这项工作中,我们使用真实世界的心脏病健康记录来研究这些方法在保护患者隐私方面的有效性。
我们对在HiGHmed用例心脏病学研究期间收集到的结构化数据集实施了匿名化和合成技术。我们分别单独以及联合使用了数据匿名化工具ARX和数据合成框架ASyH。我们通过统计分析和隐私风险评估来评估不同方法的效用和缺点。通过计算受保护数据集上的两个心力衰竭风险评分来评估数据效用。我们观察到与原始数据集的评分只有极小的偏差。此外,我们进行了重新识别风险分析,发现对于常见类型的隐私威胁只有较小的残余风险。
我们可以证明,匿名化和合成方法在保护隐私的同时保留了用于心力衰竭风险评估的数据效用。这两种方法及其组合在所有特征上与原始数据集相比只引入了极小的偏差。虽然数据合成技术可以生成任意数量的新记录,但数据匿名化技术提供了更正式的隐私保证。因此,对匿名数据进行数据合成进一步增强了隐私保护,同时对数据效用影响很小。我们通过使用和访问协议与科学界共享所有生成的数据集。