Pilgram Lisa, Meurers Thierry, Malin Bradley, Schaeffner Elke, Eckardt Kai-Uwe, Prasser Fabian
Junior Digital Clinician Scientist Program, Biomedical Innovation Academy, Berlin Institute of Health at Charité-Universitätsmedizin Berlin, Berlin, Germany.
Department of Nephrology and Medical Intensive Care, Charité-Universitätsmedizin Berlin, Berlin, Germany.
J Med Internet Res. 2024 Apr 24;26:e49445. doi: 10.2196/49445.
Sharing data from clinical studies can accelerate scientific progress, improve transparency, and increase the potential for innovation and collaboration. However, privacy concerns remain a barrier to data sharing. Certain concerns, such as reidentification risk, can be addressed through the application of anonymization algorithms, whereby data are altered so that it is no longer reasonably related to a person. Yet, such alterations have the potential to influence the data set's statistical properties, such that the privacy-utility trade-off must be considered. This has been studied in theory, but evidence based on real-world individual-level clinical data is rare, and anonymization has not broadly been adopted in clinical practice.
The goal of this study is to contribute to a better understanding of anonymization in the real world by comprehensively evaluating the privacy-utility trade-off of differently anonymized data using data and scientific results from the German Chronic Kidney Disease (GCKD) study.
The GCKD data set extracted for this study consists of 5217 records and 70 variables. A 2-step procedure was followed to determine which variables constituted reidentification risks. To capture a large portion of the risk-utility space, we decided on risk thresholds ranging from 0.02 to 1. The data were then transformed via generalization and suppression, and the anonymization process was varied using a generic and a use case-specific configuration. To assess the utility of the anonymized GCKD data, general-purpose metrics (ie, data granularity and entropy), as well as use case-specific metrics (ie, reproducibility), were applied. Reproducibility was assessed by measuring the overlap of the 95% CI lengths between anonymized and original results.
Reproducibility measured by 95% CI overlap was higher than utility obtained from general-purpose metrics. For example, granularity varied between 68.2% and 87.6%, and entropy varied between 25.5% and 46.2%, whereas the average 95% CI overlap was above 90% for all risk thresholds applied. A nonoverlapping 95% CI was detected in 6 estimates across all analyses, but the overwhelming majority of estimates exhibited an overlap over 50%. The use case-specific configuration outperformed the generic one in terms of actual utility (ie, reproducibility) at the same level of privacy.
Our results illustrate the challenges that anonymization faces when aiming to support multiple likely and possibly competing uses, while use case-specific anonymization can provide greater utility. This aspect should be taken into account when evaluating the associated costs of anonymized data and attempting to maintain sufficiently high levels of privacy for anonymized data.
German Clinical Trials Register DRKS00003971; https://drks.de/search/en/trial/DRKS00003971.
INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID): RR2-10.1093/ndt/gfr456.
分享临床研究数据可加速科学进步、提高透明度,并增加创新与合作的潜力。然而,隐私问题仍是数据共享的障碍。某些问题,如重新识别风险,可通过应用匿名化算法来解决,即对数据进行修改,使其不再能合理地与个人相关联。然而,这种修改有可能影响数据集的统计特性,因此必须考虑隐私 - 实用性权衡。这在理论上已有研究,但基于现实世界个体层面临床数据的证据很少,并且匿名化在临床实践中尚未得到广泛应用。
本研究的目的是通过使用德国慢性肾脏病(GCKD)研究的数据和科学结果,全面评估不同匿名化数据的隐私 - 实用性权衡,从而有助于更好地理解现实世界中的匿名化。
本研究提取的GCKD数据集包含5217条记录和70个变量。采用两步程序来确定哪些变量构成重新识别风险。为涵盖大部分风险 - 实用性空间,我们确定了范围从0.02到1的风险阈值。然后通过泛化和抑制对数据进行转换,并使用通用配置和特定用例配置对匿名化过程进行变化。为评估匿名化GCKD数据的实用性,应用了通用指标(即数据粒度和熵)以及特定用例指标(即可重复性)。通过测量匿名化结果与原始结果之间95%置信区间长度的重叠来评估可重复性。
通过95%置信区间重叠测量的可重复性高于从通用指标获得的实用性。例如,粒度在68.2%至87.6%之间变化,熵在25.5%至46.2%之间变化,而对于所有应用的风险阈值,平均95%置信区间重叠均高于90%。在所有分析的6个估计中检测到了不重叠的95%置信区间,但绝大多数估计的重叠超过50%。在相同隐私水平下,特定用例配置在实际实用性(即可重复性)方面优于通用配置。
我们的结果说明了匿名化在旨在支持多种可能且可能相互竞争的用途时所面临的挑战,而特定用例的匿名化可提供更大的实用性。在评估匿名化数据的相关成本并试图为匿名化数据维持足够高的隐私水平时,应考虑这一方面。
德国临床试验注册中心DRKS00003971;https://drks.de/search/en/trial/DRKS00003971。
国际注册报告标识符(IRRID):RR2 - 10.1093/ndt/gfr456。