Institute of Data Analysis and Process Design, Zurich University of Applied Sciences, Winterthur, Switzerland.
Department of Population Health, London School of Hygiene and Tropical Medicine, Lilongwe, Malawi.
JMIR Public Health Surveill. 2022 Sep 2;8(9):e34472. doi: 10.2196/34472.
Data anonymization and sharing have become popular topics for individuals, organizations, and countries worldwide. Open-access sharing of anonymized data containing sensitive information about individuals makes the most sense whenever the utility of the data can be preserved and the risk of disclosure can be kept below acceptable levels. In this case, researchers can use the data without access restrictions and limitations.
This study aimed to highlight the requirements and possible solutions for sharing health surveillance event history data. The challenges lie in the anonymization of multiple event dates and time-varying variables.
A sequential approach that adds noise to event dates is proposed. This approach maintains the event order and preserves the average time between events. In addition, a nosy neighbor distance-based matching approach to estimate the risk is proposed. Regarding the key variables that change over time, such as educational level or occupation, we make 2 proposals: one based on limiting the intermediate statuses of the individual and the other to achieve k-anonymity in subsets of the data. The proposed approaches were applied to the Karonga health and demographic surveillance system (HDSS) core residency data set, which contains longitudinal data from 1995 to the end of 2016 and includes 280,381 events with time-varying socioeconomic variables and demographic information.
An anonymized version of the event history data, including longitudinal information on individuals over time, with high data utility, was created.
The proposed anonymization of event history data comprising static and time-varying variables applied to HDSS data led to acceptable disclosure risk, preserved utility, and being sharable as public use data. It was found that high utility was achieved, even with the highest level of noise added to the core event dates. The details are important to ensure consistency or credibility. Importantly, the sequential noise addition approach presented in this study does not only maintain the event order recorded in the original data but also maintains the time between events. We proposed an approach that preserves the data utility well but limits the number of response categories for the time-varying variables. Furthermore, using distance-based neighborhood matching, we simulated an attack under a nosy neighbor situation and by using a worst-case scenario where attackers have full information on the original data. We showed that the disclosure risk is very low, even when assuming that the attacker's database and information are optimal. The HDSS and medical science research communities in low- and middle-income country settings will be the primary beneficiaries of the results and methods presented in this paper; however, the results will be useful for anyone working on anonymizing longitudinal event history data with time-varying variables for the purposes of sharing.
数据匿名化和共享已成为全球个人、组织和国家关注的热门话题。只要能够保留数据的效用,并将披露风险保持在可接受的水平以下,就可以公开共享包含个人敏感信息的匿名化数据。在这种情况下,研究人员可以在不受限制和限制的情况下使用这些数据。
本研究旨在强调共享健康监测事件历史数据的要求和可能的解决方案。挑战在于对多个事件日期和时变变量进行匿名化。
提出了一种向事件日期添加噪声的顺序方法。该方法保持事件顺序,并保留事件之间的平均时间。此外,还提出了一种基于嘈杂邻居距离的匹配方法来估计风险。对于随时间变化的关键变量,例如教育水平或职业,我们提出了两种解决方案:一种基于限制个人的中间状态,另一种在数据的子集上实现 k-匿名。所提出的方法应用于卡拉翁加健康和人口监测系统 (HDSS) 核心居住数据集,该数据集包含 1995 年底至 2016 年底的纵向数据,其中包含 280381 个具有时变社会经济变量和人口统计信息的事件。
创建了包含随时间变化的个体纵向信息的事件历史数据的匿名版本,具有较高的数据效用。
应用于 HDSS 数据的包含静态和时变变量的事件历史数据匿名化导致可接受的披露风险、保留的效用和可作为公共使用数据共享。结果表明,即使对核心事件日期添加了最高级别的噪声,也可以实现高效用。详细信息对于确保一致性或可信度很重要。重要的是,本研究中提出的顺序噪声添加方法不仅保持了原始数据中记录的事件顺序,而且还保持了事件之间的时间间隔。我们提出了一种方法,该方法很好地保留了数据效用,但限制了时变变量的响应类别数量。此外,使用基于距离的邻居匹配,我们模拟了在好奇邻居情况下的攻击,并使用攻击者对原始数据具有完整信息的最坏情况进行了模拟。结果表明,即使假设攻击者的数据库和信息是最佳的,披露风险也非常低。该结果和方法将主要使中低收入国家的 HDSS 和医学科学研究界受益,但对于任何旨在共享具有时变变量的纵向事件历史数据的人来说,该结果和方法都将是有用的。